Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a streaming-based CSV parser to k6 #2976

Closed
oleiade opened this issue Mar 13, 2023 · 2 comments
Closed

Add a streaming-based CSV parser to k6 #2976

oleiade opened this issue Mar 13, 2023 · 2 comments
Assignees
Labels

Comments

@oleiade
Copy link
Member

oleiade commented Mar 13, 2023

Users with big CSV files with a size superior to, say, 500MB, and using a large numbers of VU directly experience our issue with handling large files.

As a result, we would like for k6 to offer an alternative way to handle CSV files. Ideally, we would like it to be streaming-based, and to hold only a subset of the data at a time in memory. That way k6 memory footprint would remain sustainable for such users.

Non-final target API

import http from 'k6/http';
import { csv } from 'k6/files'

let filename = '10M-rows.csv';

// username, password, email
// pawel, test123, [email protected]
// ...

// not using the old open() api.
// let fileContents = open(filename);

let fileHandle = streamingOpenFileHandler(filename);

const csvHandler = csv.objectReader(fileHandle.stream, {
  delimiter: ',',
  consumptionStrategy: 'uniqueSequential', // VU-safe, non-repeating.
  endOfFileHandling: 'startFromBeginning', // what to do when we run out of rows
})

export default function () {
  let object = csvHandler.next() // unique row across all VUs
  object.username

  const res = http.post('http://test.k6.io/login', {
    user: object.username,
    pass: object.password
  });
}

Prerequisites

However, being able to provide such an alternative implementation of a CSV parser that would work both for open-source and cloud users is currently blocked by issues listed in "improving the handling of large files in k6".

Namely, we would first need the ability to access such files, seek through them, and stream their content without having to first decompress them on disk, and without having to load their whole content in memory first. Another prerequisites would also be the presence of an API that allows to open and read files separately too, as opposed to storing their content in memory.

@mstoykov
Copy link
Contributor

I would like to argue about pivoting the initial implementation to an API and technics that will be applicable for the current architecture of k6 and it's capabilities.

That is I would argue that for users it will be more beneficial if the new csv parser can return an array ... even maybe directly a SharedArray instead of having to also get #3014 to work.

This means that users will be able to use this as a more of a drop in replacement to get the data parametization code they use today to be faster and lighter.

Given that due to afero we in practice will load the file twice in memory before we even start reading it and that I doubt uploading more than 100mb of an archive to the cloud will work reliably either way. And it will likely need changing the archive type of k6 at the minimum.

Additionally this seems like the usual CPU vs MEM comparission, where users can either keep parsing the same CSV files concurrently in every VU (using CPU) or have one precalculated copy that uses a lot of memory.

I would be interested in how much memory for example a 100mb or 1GB CSV file?

From the last experiments we tried the current biggest problem with papaparse is that it becomes a lot slower once you go over 5mb ... it wasn't how much memory k6 was using to parse it, or hold it after.

This doesn't stop us from having the remaining parts of the API in the future as well.

p.s. In theory you can iterate over with next but I will expect that will be a lot slower especially with await needing ot unwind the stack back and forth.

@oleiade
Copy link
Member Author

oleiade commented Sep 18, 2024

addressed by #3743

@oleiade oleiade closed this as completed Sep 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants