Skip to content
This repository has been archived by the owner on Sep 26, 2023. It is now read-only.

Read splittable LZO effectively in transformer #105

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

AcidFlow
Copy link

@AcidFlow AcidFlow commented Apr 1, 2020

Reading indexed LZO compressed files from Spark cannot be achieved using
the default textFile() because textFile() does not understand how to
split the LZO file even when an index is available.

This results in reading an indexed LZO file from the same executor, thus
reducing the job parallelism.

To read effectively indexed LZO, newAPIHadoopFile() is used to rely on
the available indexing, and splitting the file multiple partition
automatically.

The compression format of input files is passed as a CLI argument to
control which implementation is used.

This should close #104

Reading indexed LZO compressed files from Spark cannot be achieved using
the default `textFile()` because `textFile()` does not understand how to
split the LZO file even when an index is available.

This results in reading an indexed LZO file from the same executor, thus
reducing the job parallelism.

To read effectively indexed LZO, `newAPIHadoopFile()` is used to rely on
the available indexing, and splitting the file multiple partition
automatically.

The compression format of input files is passed as a CLI argument to
control which implementation is used.
@snowplowcla
Copy link
Collaborator

Thanks for your pull request. Is this your first contribution to a Snowplow open source project? Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://github.com/snowplow/snowplow/wiki/CLA to learn more and sign.

Once you've signed, please reply here (e.g. I signed it!) and we'll verify. Thanks.

Copy link
Contributor

@chuwy chuwy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really great, @AcidFlow!

@AcidFlow
Copy link
Author

AcidFlow commented Apr 2, 2020

Thanks @chuwy !

I wasn't sure whether the input compression should have been a new CLI parameter or declared in the "snowflake_config" configuration file (which would then require another pull request for the new schema definition on iglu central.

What is your opinion on that?

@AcidFlow
Copy link
Author

AcidFlow commented Apr 2, 2020

I signed the CLA ;)

@snowplowcla
Copy link
Collaborator

Confirmed! @AcidFlow has signed the Contributor License Agreement. Thanks so much.

@snowplowcla snowplowcla added cla:yes and removed cla:no labels Apr 2, 2020
@chuwy
Copy link
Contributor

chuwy commented Apr 2, 2020

What is your opinion on that?

Yeah, that's tricky one. I think config files should contain only static information that is:

  1. Common between different components: transformer and loader
  2. Common between different components's subcommands: setup, load, migrate
  3. When changed requires further other actions: e.g. changing database requires manifest change or clean-up and vice versa, auth requires configuring role ARN etc

Doesn't seem that LZO encoding fits any of these, so I'm happy for it to be a CLI option, though wouldn't oppose it to be otherwise.

@chuwy
Copy link
Contributor

chuwy commented Apr 2, 2020

Do you need a published RC asset or you already use your own?

@AcidFlow
Copy link
Author

AcidFlow commented Apr 2, 2020

To be honest I am not using it yet, but I can use my own asset until there is an official release :)

Thanks for asking!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for lzo files in input
3 participants