Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSV API #1021

Closed
bookmoons opened this issue May 16, 2019 · 6 comments
Closed

CSV API #1021

bookmoons opened this issue May 16, 2019 · 6 comments
Labels
evaluation needed proposal needs to be validated or tested before fully implementing it in k6 feature

Comments

@bookmoons
Copy link
Contributor

Trying to propose an API for the CSV item on the roadmap, something I can set a target for. What about this?

Suggesting here to include CSV parsing of files and response data. For files it can use streaming to minimize memory usage. I think segmenting is not yet deployed but a partitioning ability can be built in that segmenting can plug into.

A possible API in the init context. When segmenting these can access partitioned data.

const data = openCsv("data.csv");
data.next(); // [ "apple", "pie" ]
data.next(); // [ "orange", "juice" ]
data.next(); // [ "ants", "mashed" ]
data.next(); // null === EOF
const data = openCsv("data.csv", { header: true });
data.next(); // { item: "apple", food: "pie" }
data.next(); // { item: "orange", food: "juice" }
data.next(); // { item: "ants", food: "mashed" }
const data = openCsv("data.csv", { columns: [ "ingredient", "recipe" ] });
data.next(); // { ingredient: "apple", recipe: "pie" }
data.next(); // { ingredient: "orange", recipe: "juice" }
data.next(); // { ingredient: "ants", recipe: "mashed" }

A possible API for response data.

const response = getResponse();
const data = response.csv();
/*
 * [
 *   [ "apple", "pie" ],
 *   [ "orange", "juice" ],
 *   [ "ants", "mashed" ]
 * ]
 */
const response = getResponse();
const data = response.csv({ header: true });
/*
 * [
 *   { item: "apple", food: "pie" },
 *   { item: "orange", food: "juice" },
 *   { item: "ants", food: "mashed" }
 * ]
 */
const response = getResponse();
const data = response.csv({ columns: [ "ingredient", "recipe" ] });
/*
 * [
 *   { ingredient: "apple", recipe: "pie" },
 *   { ingredient: "orange", recipe: "juice" },
 *   { ingredient: "ants", recipe: "mashed" }
 * ]
 */

That leaves these options:

  • header - Boolean. Interpret first line as column header. Return objects.
  • columns - Array. Explicit column names. Return objects.
@na--
Copy link
Member

na-- commented May 16, 2019

Thanks for this proposal! Unfortunately, there are some issues with it, most of which stem from the somewhat poor design of the existing k6 APIs that you've chosen to emulate... The rest are due to the data/work streaming and segmentation functionality we want to have. Sorry for the long post in advance, but I'll try to explain what I mean and where we're trying to go with the new file/CSV/JSON/XML/etc. APIs. I'm not sure if I can currently propose an alternative API, since I have to think and research a bit more, but hopefully this summary of the situation will help me as well.

So, to start with, the current open() function is sorely lacking in multiple different ways:

  • Contrary to normal programming language convention, it not only opens the file, it actually reads all of its contents in memory.
  • Not only that, but despite k6 intentionally not having the functionality to write or modify files, the file contents aren't stored into memory only once, instead each VU has its own unique copy. So, especially with big files and/or many VUs, that quickly explodes the RAM usage.
  • It has an optional second argument for returning a binary result, but as I explained in Better support for binary data #1020, its result is not very usable for JS manipulation and use.

What we need instead is a new function, or more likely - a set of functions, that will satisfy some of our requirements:

  • be able to read a whole file, but keep only a single copy in memory that can be accessed, in a random-access and read-only fashion, by all of the VUs
  • not read the whole file in memory, instead be able to read from it in a streaming fashion, similar to the io.Reader Go interface

Somewhat orthogonal to the above concepts, in all cases (plain strings, the 2 points above, maybe even binary data), we will also need a way for VUs to read the data bit by bit, in various increments - bytes, characters, lines, null, untilChar(c), etc. And that also has at least 2 different flavors:

  • overlapping, where each VU is able to read a complete copy of all of the data
  • non-overlapping, where when each VU reads a unique new character/line/segment from the file

It's in the second case where execution segments (#997) come in, since they offer a way to achieve non-overlapping reads of a single piece of static data without synchronization between VUs on different k6 instances. And even though the execution segments work for the new schedulers (#1007), it's not clear how we should expose them to the JS data traversal APIs. And it's currently an open question how we can allow users to loop over files multiple times. And how do we signal things like "end of file"...

In any case, up until now, I was describing only raw files, but CSV/JSON/XML/HTML/etc. are things that would have to work on top of the above machinery. Or at least on top of the first set of functions for opening a file - CSV parsing should be able to work with a simple string (i.e. the current result of an open() call), and on a shared-memory file, and on a streaming file. And the CSV API should offer similar reading modes to the ones I described above:

  • parse the whole raw file into one huge array of arrays/maps
  • overlapping reads, where each VU can call data.next() and get the same results as the other VUs
  • non-overlapping reads with segmentation, where each VU will get unique results when calling data.next()

So, to get back to your proposal, I'm against an openCsv() function simply because opening a file shouldn't be tied to actually parsing it as a CSV/JSON/XML/etc., those should be 2 completely separate things.

Similarly, I think that the current html() and json() methods of http.Response aren't the best API design. The HTTP response object has a body property, which can very easily be passed on to an HTML/JSON/CSV/XML/etc. parsing function that can deal with it. That way, that function can also be reused for files, strings, websocket responses and whatever else we introduce. The tight coupling introduced between the different format parsing function and HTTP responses isn't very helpful and is only forcing us to spegetify our code. So, in short, despite the unfortunate presence of http.Response.html() and http.Response.json(), I don't think we should continue the trend and add http.Response.csv().

I'll finish by saying that I actually like the parts of your proposal that deal with column names and header parsing! They are very intuitive and user-friendly. Some other CSV-specific comments:

  • we'll likely also want to support different delimiters, so that should also be configurable
  • maybe add support for comments, lazy quotes, things like that - see the go csv.Reader docs for more possible options
  • we need to figure out how we'll deal with error handling and corner cases - do malformed CSV rows raise an exception, etc.
  • this shouldn't be a problem, but we need to check how well goja can support the case where we can return an array of arrays when there aren't column names and an array of objects when there are

And finally, links to some of the connected issues: #532, #592, #992, #997, #1020

@na-- na-- added evaluation needed proposal needs to be validated or tested before fully implementing it in k6 feature labels May 16, 2019
@greyireland
Copy link

What we need instead is a new function, or more likely - a set of functions, that will satisfy some of our requirements:

be able to read a whole file, but keep only a single copy in memory that can be accessed, in a random-access and read-only fashion, by all of the VUs
not read the whole file in memory, instead be able to read from it in a streaming fashion, similar to the io.Reader Go interface

I want know how implements this feature? can you give some advice?
I have a 1G http requests data ,and I want replay it with 100 VUs,how to share the requests data?

@na--
Copy link
Member

na-- commented Oct 10, 2019

@greyireland, sorry for the late response, as I explained in my previous comment, implementing this is currently blocked by execution segments (implemented in #1007, which we'll hopefully soon finish and merge) and us not having a clear picture how the APIs that deal with streams/shared data/data partitioning/etc. should look like.

You can relatively easily implement a streaming CSV reader for your use case, or adapt the one implemented in #612. Unfortunately, I can't guarantee that we'd merge something like that until we have a clearer idea how everything needs to look like.

@sandeepojha930
Copy link

@greyireland , Did you archive your requirement ?

@greyireland
Copy link

@greyireland , Did you archive your requirement ?

yes,I read data from file and share the data.

@olegbespalov
Copy link
Contributor

After an internal discussion, we've decided to close this task since the proposed API should be re-think and could be highly impacted by the new File API (in particular streaming). #2978

@olegbespalov olegbespalov closed this as not planned Won't fix, can't repro, duplicate, stale Dec 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
evaluation needed proposal needs to be validated or tested before fully implementing it in k6 feature
Projects
None yet
Development

No branches or pull requests

5 participants