Add a streaming-based CSV parser to k6 #2976

oleiade · 2023-03-13T12:45:59Z

Users with big CSV files with a size superior to, say, 500MB, and using a large numbers of VU directly experience our issue with handling large files.

As a result, we would like for k6 to offer an alternative way to handle CSV files. Ideally, we would like it to be streaming-based, and to hold only a subset of the data at a time in memory. That way k6 memory footprint would remain sustainable for such users.

Non-final target API

import http from 'k6/http';
import { csv } from 'k6/files'

let filename = '10M-rows.csv';

// username, password, email
// pawel, test123, [email protected]
// ...

// not using the old open() api.
// let fileContents = open(filename);

let fileHandle = streamingOpenFileHandler(filename);

const csvHandler = csv.objectReader(fileHandle.stream, {
  delimiter: ',',
  consumptionStrategy: 'uniqueSequential', // VU-safe, non-repeating.
  endOfFileHandling: 'startFromBeginning', // what to do when we run out of rows
})

export default function () {
  let object = csvHandler.next() // unique row across all VUs
  object.username

  const res = http.post('http://test.k6.io/login', {
    user: object.username,
    pass: object.password
  });
}

Prerequisites

However, being able to provide such an alternative implementation of a CSV parser that would work both for open-source and cloud users is currently blocked by issues listed in "improving the handling of large files in k6".

Namely, we would first need the ability to access such files, seek through them, and stream their content without having to first decompress them on disk, and without having to load their whole content in memory first. Another prerequisites would also be the presence of an API that allows to open and read files separately too, as opposed to storing their content in memory.

mstoykov · 2024-08-22T10:31:15Z

I would like to argue about pivoting the initial implementation to an API and technics that will be applicable for the current architecture of k6 and it's capabilities.

That is I would argue that for users it will be more beneficial if the new csv parser can return an array ... even maybe directly a SharedArray instead of having to also get #3014 to work.

This means that users will be able to use this as a more of a drop in replacement to get the data parametization code they use today to be faster and lighter.

Given that due to afero we in practice will load the file twice in memory before we even start reading it and that I doubt uploading more than 100mb of an archive to the cloud will work reliably either way. And it will likely need changing the archive type of k6 at the minimum.

Additionally this seems like the usual CPU vs MEM comparission, where users can either keep parsing the same CSV files concurrently in every VU (using CPU) or have one precalculated copy that uses a lot of memory.

I would be interested in how much memory for example a 100mb or 1GB CSV file?

From the last experiments we tried the current biggest problem with papaparse is that it becomes a lot slower once you go over 5mb ... it wasn't how much memory k6 was using to parse it, or hold it after.

This doesn't stop us from having the remaining parts of the API in the future as well.

p.s. In theory you can iterate over with next but I will expect that will be a lot slower especially with await needing ot unwind the stack back and forth.

oleiade · 2024-09-18T09:04:49Z

addressed by #3743

oleiade self-assigned this Mar 13, 2023

oleiade mentioned this issue Mar 13, 2023

Improving handling of large files in k6 #2974

Open

5 tasks

joanlopez self-assigned this May 9, 2024

oleiade mentioned this issue May 15, 2024

Add an experimental csv module exposing a streaming csv parser #3743

Merged

5 tasks

oleiade added the feature label Jun 19, 2024

oleiade closed this as completed Sep 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a streaming-based CSV parser to k6 #2976

Add a streaming-based CSV parser to k6 #2976

oleiade commented Mar 13, 2023

mstoykov commented Aug 22, 2024

oleiade commented Sep 18, 2024

Add a streaming-based CSV parser to k6 #2976

Add a streaming-based CSV parser to k6 #2976

Comments

oleiade commented Mar 13, 2023

Non-final target API

Prerequisites

mstoykov commented Aug 22, 2024

oleiade commented Sep 18, 2024