Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimizations to R ETL process (esp. loading). #4

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

concretevitamin
Copy link

  • Directly reads from .txt file instead of saving out to .Rdata first
    then reading back again. Prototyped for Regression.
  • Even if the .Rdata step is desired, using fread() has much better
    performance.

I've found this to be much more efficient for benchmarking (tested on an EC2 instance). If this approach looks good, I could certainly make corresponding changes for all queries.

- Directly reads from .txt file instead of saving out to .Rdata first
  then reading back again.  Prototyped for Regression.
- Even if the .Rdata step is desired, using fread() has much better
  performance.
@rytaft
Copy link
Collaborator

rytaft commented Jul 28, 2015

Sorry this slipped through the cracks and I am only looking at this now. Thanks for submitting your code!

Regarding the changes to vanilla_R_benchmark.R, I have done a bit of testing on the 5000x5000 dataset, and it seems that load() on a binary file is faster than fread() on a text file (6.5 seconds v. 11.8 seconds). Under what conditions did you find fread() to be faster?

Regarding the changes to generate_Rdata.R, fread() is certainly faster than read.csv(), but it seems to leave the data in a format that doesn't work with the code in vanilla_R_benchmark.R. I haven't done much debugging, but if you have any ideas I'd definitely appreciate them!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants