How to evenly distribute data in apache pig output files? -

September 15, 2014

I have a dip-Latin script that takes XML in some, XPath uses UDF to remove some fields The resulting field stores:

  Registry udf-lib-1.0-SNAPSHOT.jar; Define XPath com.blah.udfs.XPath (); Docs = Load '$ input' as com.blah.storage.XMLLoader ('root') (content: chararray); Results = FOREACH Documents Generation XPath (content, 'root / id'), XPath (content, 'root / other fields'), content; Store results in 'Output';

Note that we are using pig-0.12.0 on our cluster, so I split XPath / XMLLoader sections out of Dip-1 0.14 and in their own jar So that I can use them in 0.12.

The above script works fine and produces the data that I am looking for. However, it produces more than 1900 files with only a few MBs in each file. I have learned about the default_parallel option, so I tried from 128 and set to get 128 partfiles. I ended up adding a piece to reduce the phase to achieve it. My script now appears:

  set default_parallel 128; Registrar udf-lib-1.0-SNAPSHOT.jar; Define XPath com.blah.udfs.XPath (); Docs = Load '$ input' as com.blah.storage.XMLLoader ('root') (content: chararray); Results = FOREACH Documents Generation XPath (content, 'root / id'), XPath (content, 'root / other fields'), content; Forced_readuce = Generate FOREACH (Group Results By Random) (Flatain) (Results); Compelled in '$ output';

Again, it produces the expected data. Besides, now I get 128 part-files. My problem is now that the data is not evenly distributed in the same-files In some 8 gigs, the other is 100 MB. I should remember this when grouped by Random () :.

My question is, what would be the preferred method of limiting the number of part-files, even then would they have the same size-size? I'm new in pig / dip Latin and assume I'm going completely wrong about it.

p.s Because I care about the number of part-files because I want to process production with Spark and our spark cluster works very well with very small files.

I'm still trying to do this with pag script directly, but now my " Solution "Re-dividing the data within Spark, the process that works on the production of pig scripts, I use the RDD.coalesce function to regulate the data.

Search This Blog

Raj T

How to evenly distribute data in apache pig output files? -

Comments

Post a Comment

Popular posts from this blog

python - Overriding the save method in Django ModelForm -

html - CSS autoheight, but fit content to height of div -

qt - How to prevent QAudioInput from automatically boosting the master volume to 100%? -