How to evenly distribute data in apache pig output files? -
I have a dip-Latin script that takes XML in some, XPath uses UDF to remove some fields The resulting field stores:
Registry udf-lib-1.0-SNAPSHOT.jar; Define XPath com.blah.udfs.XPath (); Docs = Load '$ input' as com.blah.storage.XMLLoader ('root') (content: chararray); Results = FOREACH Documents Generation XPath (content, 'root / id'), XPath (content, 'root / other fields'), content; Store results in 'Output'; Note that we are using pig-0.12.0 on our cluster, so I split XPath / XMLLoader sections out of Dip-1 0.14 and in their own jar So that I can use them in 0.12.
The above script works fine and produces the data that I am looking for. However, it produces more than 1900 files with only a few MBs in each file. I have learned about the default_parallel option, so I tried from 128 and set to get 128 partfiles. I ended up adding a piece to reduce the phase to achieve it. My script now appears:
set default_parallel 128; Registrar udf-lib-1.0-SNAPSHOT.jar; Define XPath com.blah.udfs.XPath (); Docs = Load '$ input' as com.blah.storage.XMLLoader ('root') (content: chararray); Results = FOREACH Documents Generation XPath (content, 'root / id'), XPath (content, 'root / other fields'), content; Forced_readuce = Generate FOREACH (Group Results By Random) (Flatain) (Results); Compelled in '$ output'; Again, it produces the expected data. Besides, now I get 128 part-files. My problem is now that the data is not evenly distributed in the same-files In some 8 gigs, the other is 100 MB. I should remember this when grouped by Random () :.
My question is, what would be the preferred method of limiting the number of part-files, even then would they have the same size-size? I'm new in pig / dip Latin and assume I'm going completely wrong about it.
p.s Because I care about the number of part-files because I want to process production with Spark and our spark cluster works very well with very small files.
I'm still trying to do this with pag script directly, but now my " Solution "Re-dividing the data within Spark, the process that works on the production of pig scripts, I use the RDD.coalesce function to regulate the data.
Comments
Post a Comment