CSV API #1021

bookmoons · 2019-05-16T07:02:24Z

Trying to propose an API for the CSV item on the roadmap, something I can set a target for. What about this?

Suggesting here to include CSV parsing of files and response data. For files it can use streaming to minimize memory usage. I think segmenting is not yet deployed but a partitioning ability can be built in that segmenting can plug into.

A possible API in the init context. When segmenting these can access partitioned data.

const data = openCsv("data.csv");
data.next(); // [ "apple", "pie" ]
data.next(); // [ "orange", "juice" ]
data.next(); // [ "ants", "mashed" ]
data.next(); // null === EOF

const data = openCsv("data.csv", { header: true });
data.next(); // { item: "apple", food: "pie" }
data.next(); // { item: "orange", food: "juice" }
data.next(); // { item: "ants", food: "mashed" }

const data = openCsv("data.csv", { columns: [ "ingredient", "recipe" ] });
data.next(); // { ingredient: "apple", recipe: "pie" }
data.next(); // { ingredient: "orange", recipe: "juice" }
data.next(); // { ingredient: "ants", recipe: "mashed" }

A possible API for response data.

const response = getResponse();
const data = response.csv();
/*
 * [
 *   [ "apple", "pie" ],
 *   [ "orange", "juice" ],
 *   [ "ants", "mashed" ]
 * ]
 */

const response = getResponse();
const data = response.csv({ header: true });
/*
 * [
 *   { item: "apple", food: "pie" },
 *   { item: "orange", food: "juice" },
 *   { item: "ants", food: "mashed" }
 * ]
 */

const response = getResponse();
const data = response.csv({ columns: [ "ingredient", "recipe" ] });
/*
 * [
 *   { ingredient: "apple", recipe: "pie" },
 *   { ingredient: "orange", recipe: "juice" },
 *   { ingredient: "ants", recipe: "mashed" }
 * ]
 */

That leaves these options:

header - Boolean. Interpret first line as column header. Return objects.
columns - Array. Explicit column names. Return objects.

The text was updated successfully, but these errors were encountered:

na-- · 2019-05-16T17:17:03Z

Thanks for this proposal! Unfortunately, there are some issues with it, most of which stem from the somewhat poor design of the existing k6 APIs that you've chosen to emulate... The rest are due to the data/work streaming and segmentation functionality we want to have. Sorry for the long post in advance, but I'll try to explain what I mean and where we're trying to go with the new file/CSV/JSON/XML/etc. APIs. I'm not sure if I can currently propose an alternative API, since I have to think and research a bit more, but hopefully this summary of the situation will help me as well.

So, to start with, the current open() function is sorely lacking in multiple different ways:

Contrary to normal programming language convention, it not only opens the file, it actually reads all of its contents in memory.
Not only that, but despite k6 intentionally not having the functionality to write or modify files, the file contents aren't stored into memory only once, instead each VU has its own unique copy. So, especially with big files and/or many VUs, that quickly explodes the RAM usage.
It has an optional second argument for returning a binary result, but as I explained in Better support for binary data #1020, its result is not very usable for JS manipulation and use.

What we need instead is a new function, or more likely - a set of functions, that will satisfy some of our requirements:

be able to read a whole file, but keep only a single copy in memory that can be accessed, in a random-access and read-only fashion, by all of the VUs
not read the whole file in memory, instead be able to read from it in a streaming fashion, similar to the io.Reader Go interface

Somewhat orthogonal to the above concepts, in all cases (plain strings, the 2 points above, maybe even binary data), we will also need a way for VUs to read the data bit by bit, in various increments - bytes, characters, lines, null, untilChar(c), etc. And that also has at least 2 different flavors:

overlapping, where each VU is able to read a complete copy of all of the data
non-overlapping, where when each VU reads a unique new character/line/segment from the file

It's in the second case where execution segments (#997) come in, since they offer a way to achieve non-overlapping reads of a single piece of static data without synchronization between VUs on different k6 instances. And even though the execution segments work for the new schedulers (#1007), it's not clear how we should expose them to the JS data traversal APIs. And it's currently an open question how we can allow users to loop over files multiple times. And how do we signal things like "end of file"...

In any case, up until now, I was describing only raw files, but CSV/JSON/XML/HTML/etc. are things that would have to work on top of the above machinery. Or at least on top of the first set of functions for opening a file - CSV parsing should be able to work with a simple string (i.e. the current result of an open() call), and on a shared-memory file, and on a streaming file. And the CSV API should offer similar reading modes to the ones I described above:

parse the whole raw file into one huge array of arrays/maps
overlapping reads, where each VU can call data.next() and get the same results as the other VUs
non-overlapping reads with segmentation, where each VU will get unique results when calling data.next()

So, to get back to your proposal, I'm against an openCsv() function simply because opening a file shouldn't be tied to actually parsing it as a CSV/JSON/XML/etc., those should be 2 completely separate things.

Similarly, I think that the current html() and json() methods of http.Response aren't the best API design. The HTTP response object has a body property, which can very easily be passed on to an HTML/JSON/CSV/XML/etc. parsing function that can deal with it. That way, that function can also be reused for files, strings, websocket responses and whatever else we introduce. The tight coupling introduced between the different format parsing function and HTTP responses isn't very helpful and is only forcing us to spegetify our code. So, in short, despite the unfortunate presence of http.Response.html() and http.Response.json(), I don't think we should continue the trend and add http.Response.csv().

I'll finish by saying that I actually like the parts of your proposal that deal with column names and header parsing! They are very intuitive and user-friendly. Some other CSV-specific comments:

we'll likely also want to support different delimiters, so that should also be configurable
maybe add support for comments, lazy quotes, things like that - see the go csv.Reader docs for more possible options
we need to figure out how we'll deal with error handling and corner cases - do malformed CSV rows raise an exception, etc.
this shouldn't be a problem, but we need to check how well goja can support the case where we can return an array of arrays when there aren't column names and an array of objects when there are

And finally, links to some of the connected issues: #532, #592, #992, #997, #1020

greyireland · 2019-09-29T03:32:55Z

What we need instead is a new function, or more likely - a set of functions, that will satisfy some of our requirements:

be able to read a whole file, but keep only a single copy in memory that can be accessed, in a random-access and read-only fashion, by all of the VUs
not read the whole file in memory, instead be able to read from it in a streaming fashion, similar to the io.Reader Go interface

I want know how implements this feature？ can you give some advice?
I have a 1G http requests data ,and I want replay it with 100 VUs,how to share the requests data?

na-- · 2019-10-10T13:13:17Z

@greyireland, sorry for the late response, as I explained in my previous comment, implementing this is currently blocked by execution segments (implemented in #1007, which we'll hopefully soon finish and merge) and us not having a clear picture how the APIs that deal with streams/shared data/data partitioning/etc. should look like.

You can relatively easily implement a streaming CSV reader for your use case, or adapt the one implemented in #612. Unfortunately, I can't guarantee that we'd merge something like that until we have a clearer idea how everything needs to look like.

sandeepojha930 · 2020-08-26T05:53:19Z

@greyireland , Did you archive your requirement ?

greyireland · 2020-08-28T12:46:20Z

@greyireland , Did you archive your requirement ?

yes，I read data from file and share the data.

olegbespalov · 2023-12-04T09:11:40Z

After an internal discussion, we've decided to close this task since the proposed API should be re-think and could be highly impacted by the new File API (in particular streaming). #2978

na-- added evaluation needed proposal needs to be validated or tested before fully implementing it in k6 feature labels May 16, 2019

This was referenced May 16, 2019

JSONPath API #992

Closed

Execution segments for partitioning work between k6 instances #997

Closed

na-- mentioned this issue Jul 8, 2020

Data segmentation API framework #1539

Open

olegbespalov closed this as not planned Won't fix, can't repro, duplicate, stale Dec 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CSV API #1021

CSV API #1021

bookmoons commented May 16, 2019

na-- commented May 16, 2019 •

edited

Loading

greyireland commented Sep 29, 2019

na-- commented Oct 10, 2019

sandeepojha930 commented Aug 26, 2020

greyireland commented Aug 28, 2020

olegbespalov commented Dec 4, 2023

CSV API #1021

CSV API #1021

Comments

bookmoons commented May 16, 2019

na-- commented May 16, 2019 • edited Loading

greyireland commented Sep 29, 2019

na-- commented Oct 10, 2019

sandeepojha930 commented Aug 26, 2020

greyireland commented Aug 28, 2020

olegbespalov commented Dec 4, 2023

na-- commented May 16, 2019 •

edited

Loading