Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Long processing times when handling very large files #12

Open
hammzj opened this issue May 5, 2020 · 3 comments
Open

Long processing times when handling very large files #12

hammzj opened this issue May 5, 2020 · 3 comments

Comments

@hammzj
Copy link

hammzj commented May 5, 2020

Hello,

I've been here before and I'm back 😊
This gem has become a cornerstone in one of the projects I've developed. In most cases, it handles very well, minus some configuration options we need to customize to each scenario, but so it goes.

Now, we are working with larger files, 1.5million and such rows. In some cases, it seems to take hours. I've tested this before in tests with files between 500,000 and 1,000,000 rows, and have experienced around 15 minutes or more to fully process diffs of these files using the gem. We can deal with that even though it's not lovely, but any time taking longer than that is detrimental.

Now, I am not sure if this is an issue with how we provide key_fields or such, but I am mainly writing this issue out as a question on what experiences people have had with comparing large files? Is this a gem constraint, our own CSVDiff configuration, or something else?

What have you recorded for working with files of one-million plus rows, with up to 100 columns?

@hammzj
Copy link
Author

hammzj commented Oct 27, 2020

@agardiner You're gonna hear from me a lot but that's because this tool is incredibly valuable to our team.

We have been experiencing very long times to process files over 100,000 rows. Mainly I am talking 500k+ rows in files. These run for hours without producing results. Have you done any performance testing with large files?

@agardiner
Copy link
Owner

I've not used this with files larger than 100k records, but I'd expect that performance drops off exponentially as your inputs grow. The implementation is pretty simple and works well for small inputs, but it was not designed for speed or to scale to large volume inputs.
Sorry I don't have any better news for you.

@SerKnight
Copy link

Came across this issue. If I continue using I can look to see if I can add a debug option so that at least you have insight into stage of process.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants