-
Notifications
You must be signed in to change notification settings - Fork 9
Read splittable LZO effectively in transformer #105
base: master
Are you sure you want to change the base?
Read splittable LZO effectively in transformer #105
Conversation
Reading indexed LZO compressed files from Spark cannot be achieved using the default `textFile()` because `textFile()` does not understand how to split the LZO file even when an index is available. This results in reading an indexed LZO file from the same executor, thus reducing the job parallelism. To read effectively indexed LZO, `newAPIHadoopFile()` is used to rely on the available indexing, and splitting the file multiple partition automatically. The compression format of input files is passed as a CLI argument to control which implementation is used.
Thanks for your pull request. Is this your first contribution to a Snowplow open source project? Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). 📝 Please visit https://github.com/snowplow/snowplow/wiki/CLA to learn more and sign. Once you've signed, please reply here (e.g. I signed it!) and we'll verify. Thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks really great, @AcidFlow!
Thanks @chuwy ! I wasn't sure whether the input compression should have been a new CLI parameter or declared in the "snowflake_config" configuration file (which would then require another pull request for the new schema definition on iglu central. What is your opinion on that? |
I signed the CLA ;) |
Confirmed! @AcidFlow has signed the Contributor License Agreement. Thanks so much. |
Yeah, that's tricky one. I think config files should contain only static information that is:
Doesn't seem that LZO encoding fits any of these, so I'm happy for it to be a CLI option, though wouldn't oppose it to be otherwise. |
Do you need a published RC asset or you already use your own? |
To be honest I am not using it yet, but I can use my own asset until there is an official release :) Thanks for asking! |
Reading indexed LZO compressed files from Spark cannot be achieved using
the default
textFile()
becausetextFile()
does not understand how tosplit the LZO file even when an index is available.
This results in reading an indexed LZO file from the same executor, thus
reducing the job parallelism.
To read effectively indexed LZO,
newAPIHadoopFile()
is used to rely onthe available indexing, and splitting the file multiple partition
automatically.
The compression format of input files is passed as a CLI argument to
control which implementation is used.
This should close #104