You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Users with big CSV files with a size superior to, say, 500MB, and using a large numbers of VU directly experience our issue with handling large files.
As a result, we would like for k6 to offer an alternative way to handle CSV files. Ideally, we would like it to be streaming-based, and to hold only a subset of the data at a time in memory. That way k6 memory footprint would remain sustainable for such users.
Non-final target API
importhttpfrom'k6/http';import{csv}from'k6/files'letfilename='10M-rows.csv';// username, password, email// pawel, test123, [email protected]// ...// not using the old open() api.// let fileContents = open(filename);letfileHandle=streamingOpenFileHandler(filename);constcsvHandler=csv.objectReader(fileHandle.stream,{delimiter: ',',consumptionStrategy: 'uniqueSequential',// VU-safe, non-repeating.endOfFileHandling: 'startFromBeginning',// what to do when we run out of rows})exportdefaultfunction(){letobject=csvHandler.next()// unique row across all VUsobject.usernameconstres=http.post('http://test.k6.io/login',{user: object.username,pass: object.password});}
Prerequisites
However, being able to provide such an alternative implementation of a CSV parser that would work both for open-source and cloud users is currently blocked by issues listed in "improving the handling of large files in k6".
Namely, we would first need the ability to access such files, seek through them, and stream their content without having to first decompress them on disk, and without having to load their whole content in memory first. Another prerequisites would also be the presence of an API that allows to open and read files separately too, as opposed to storing their content in memory.
The text was updated successfully, but these errors were encountered:
I would like to argue about pivoting the initial implementation to an API and technics that will be applicable for the current architecture of k6 and it's capabilities.
That is I would argue that for users it will be more beneficial if the new csv parser can return an array ... even maybe directly a SharedArray instead of having to also get #3014 to work.
This means that users will be able to use this as a more of a drop in replacement to get the data parametization code they use today to be faster and lighter.
Given that due to afero we in practice will load the file twice in memory before we even start reading it and that I doubt uploading more than 100mb of an archive to the cloud will work reliably either way. And it will likely need changing the archive type of k6 at the minimum.
Additionally this seems like the usual CPU vs MEM comparission, where users can either keep parsing the same CSV files concurrently in every VU (using CPU) or have one precalculated copy that uses a lot of memory.
I would be interested in how much memory for example a 100mb or 1GB CSV file?
From the last experiments we tried the current biggest problem with papaparse is that it becomes a lot slower once you go over 5mb ... it wasn't how much memory k6 was using to parse it, or hold it after.
This doesn't stop us from having the remaining parts of the API in the future as well.
p.s. In theory you can iterate over with next but I will expect that will be a lot slower especially with await needing ot unwind the stack back and forth.
Users with big CSV files with a size superior to, say, 500MB, and using a large numbers of VU directly experience our issue with handling large files.
As a result, we would like for k6 to offer an alternative way to handle CSV files. Ideally, we would like it to be streaming-based, and to hold only a subset of the data at a time in memory. That way k6 memory footprint would remain sustainable for such users.
Non-final target API
Prerequisites
However, being able to provide such an alternative implementation of a CSV parser that would work both for open-source and cloud users is currently blocked by issues listed in "improving the handling of large files in k6".
Namely, we would first need the ability to access such files, seek through them, and stream their content without having to first decompress them on disk, and without having to load their whole content in memory first. Another prerequisites would also be the presence of an API that allows to open and read files separately too, as opposed to storing their content in memory.
The text was updated successfully, but these errors were encountered: