Read splittable LZO effectively in transformer #105

AcidFlow · 2020-04-01T19:49:23Z

Reading indexed LZO compressed files from Spark cannot be achieved using
the default textFile() because textFile() does not understand how to
split the LZO file even when an index is available.

This results in reading an indexed LZO file from the same executor, thus
reducing the job parallelism.

To read effectively indexed LZO, newAPIHadoopFile() is used to rely on
the available indexing, and splitting the file multiple partition
automatically.

The compression format of input files is passed as a CLI argument to
control which implementation is used.

This should close #104

Reading indexed LZO compressed files from Spark cannot be achieved using the default `textFile()` because `textFile()` does not understand how to split the LZO file even when an index is available. This results in reading an indexed LZO file from the same executor, thus reducing the job parallelism. To read effectively indexed LZO, `newAPIHadoopFile()` is used to rely on the available indexing, and splitting the file multiple partition automatically. The compression format of input files is passed as a CLI argument to control which implementation is used.

snowplowcla · 2020-04-01T19:49:27Z

Thanks for your pull request. Is this your first contribution to a Snowplow open source project? Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://github.com/snowplow/snowplow/wiki/CLA to learn more and sign.

Once you've signed, please reply here (e.g. I signed it!) and we'll verify. Thanks.

chuwy

This looks really great, @AcidFlow!

AcidFlow · 2020-04-02T12:02:47Z

Thanks @chuwy !

I wasn't sure whether the input compression should have been a new CLI parameter or declared in the "snowflake_config" configuration file (which would then require another pull request for the new schema definition on iglu central.

What is your opinion on that?

AcidFlow · 2020-04-02T13:03:40Z

I signed the CLA ;)

snowplowcla · 2020-04-02T13:03:43Z

Confirmed! @AcidFlow has signed the Contributor License Agreement. Thanks so much.

chuwy · 2020-04-02T13:11:58Z

What is your opinion on that?

Yeah, that's tricky one. I think config files should contain only static information that is:

Common between different components: transformer and loader
Common between different components's subcommands: setup, load, migrate
When changed requires further other actions: e.g. changing database requires manifest change or clean-up and vice versa, auth requires configuring role ARN etc

Doesn't seem that LZO encoding fits any of these, so I'm happy for it to be a CLI option, though wouldn't oppose it to be otherwise.

chuwy · 2020-04-02T13:12:17Z

Do you need a published RC asset or you already use your own?

AcidFlow · 2020-04-02T16:42:56Z

To be honest I am not using it yet, but I can use my own asset until there is an official release :)

Thanks for asking!

snowplowcla added the cla:no label Apr 1, 2020

chuwy approved these changes Apr 2, 2020

View reviewed changes

snowplowcla added cla:yes and removed cla:no labels Apr 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Read splittable LZO effectively in transformer #105

Read splittable LZO effectively in transformer #105

AcidFlow commented Apr 1, 2020

snowplowcla commented Apr 1, 2020

chuwy left a comment

AcidFlow commented Apr 2, 2020

AcidFlow commented Apr 2, 2020

snowplowcla commented Apr 2, 2020

chuwy commented Apr 2, 2020

chuwy commented Apr 2, 2020

AcidFlow commented Apr 2, 2020

Read splittable LZO effectively in transformer #105

Are you sure you want to change the base?

Read splittable LZO effectively in transformer #105

Conversation

AcidFlow commented Apr 1, 2020

snowplowcla commented Apr 1, 2020

chuwy left a comment

Choose a reason for hiding this comment

AcidFlow commented Apr 2, 2020

AcidFlow commented Apr 2, 2020

snowplowcla commented Apr 2, 2020

chuwy commented Apr 2, 2020

chuwy commented Apr 2, 2020

AcidFlow commented Apr 2, 2020