Splitting a single logical resource across multiple files (was: Group of resources + remote schema) #661

paulgirard · 2019-12-11T16:03:45Z

paulgirard
Dec 11, 2019

Dear all,

Resources can be grouped. To leverage splitting data into multiple files having the same structure. I need this to cut my dataset by archival sources used. First because it's easier to version, then it's easier when editions are needed to correct (not have to open a HUGE CSV), last it makes validations errors located to more precise files and finally it makes much more sense for a dataset which has been created by transcribing many different archival books.

About resources group, I think it's kind of accepted that for a set of resources gathered in one same group, the schema can be set only for the first resource of the group.
If not set in specs (there are not specs for group as far as I know), I've seen code which load package that way.

My current issue is that with a package which has a group of resources, if a schema is indicated for all the resources of the group (which are numerous in my case > 1000) then the lib will load the same schema as many times as the number or resources.
Which is not ideal... above all if the basepath is actually remote...
Thus I see two ways out :

a change my package by removing schema in my grouped resources but the first one
I update the datapackage-js to load only the first schema is resources are in a group

The affected datapackage : https://github.com/medialab/ricardo_data/blob/master/datapackage.json

I am about to try the first path by removing the schema in all but first grouped resource.
Changing datapackage-js to add a special behaviour for group should not be that hard.

rufuspollock · 2019-12-11T17:49:26Z

rufuspollock
Dec 11, 2019
Maintainer

@paulgirard great question. First some relevant existing issues:

Pattern for organizing (large) datasets into data packages https://github.com/frictionlessdata/specs/issues/620
Chunks / Parts "Chunks" and resources split over multiple physical files datapackage#228 - this got implemented into the spec see https://frictionlessdata.io/specs/data-resource/#data-in-multiple-files

If i understand it you have what sounds like a single "Logical" resource that is split into multiple physical files.

The options are then:

Use the Parts / Chunks approach
Cache the remote schema to avoid network calls
Move the schema "in data package" but (not in each of the 1000 resources)

0 replies

paulgirard · 2019-12-11T20:18:20Z

paulgirard
Dec 11, 2019
Author

Thank you !
I thought chunks implied not repeating teh headers which was a no-go for me.
If it's possible with keaders, it's nicer solution cause it makes a much lighter datapackage.
Note: it's not a bad idea to use group to specify source information. In my case it's ok because my file names are built from the source name.

I'll read over the issues to learn more but I bet chunks might do it.

ps: i've tried to specify the remote schema only for the first resource and it went fine. So that's a safe workaround I guess.

0 replies

paulgirard · 2020-01-21T13:37:42Z

paulgirard
Jan 21, 2020
Author

Dear @roll @rufuspollock
I am trying to use a multipart set-up.
I don't understand which spec element I should use to declare that the chunks all have headers.
By default the python lib consider that every chunks don't have headers.
I couldn't find how to deal with headered chunks in the specification.

I can read the concept and even a warning about headers here : https://frictionlessdata.io/specs/data-resource/
But no more information about multipart and chunks ?

I tried to add a dialect.header = true but that's not it.

Here is my multipart resource:

  {
      "name": "flows",
      "sources": [
        {
          "title": "Trade statistics from various countries archives and research papers, see data/sources.csv for details"
        }
      ],
      "encoding": "utf-8",
      "profile": "tabular-data-resource",
      "title": "Trade flows transcribed from various international trade statistics volumes",
      "format": "csv",
      "schema": "flows_schema.json",
      "dialect": {
        "header": true
      },
      "path": [
        "data/flows/AnnalesDuCommerceExterieurFaitsCommerciaux3eSerieAvisDivers_IndesOrientalesNeerlandaises_Fc2.csv",
        "data/flows/ForeignCommerceAndNavigationOfTheUnitedStates_191112.csv",
        "data/flows/AnnalesDuCommerceExterieurFaitsCommerciaux3eSerieAvisDivers_OceanieEtAustralie_Fc7.csv",
        "data/flows/ForeignCommerceAndNavigationOfTheUnitedStates_1938.csv",
]}

0 replies

rufuspollock · 2020-01-21T19:09:25Z

rufuspollock
Jan 21, 2020
Maintainer

@paulgirard great question 👍 I think this is under-specified in the spec 😉This was actually discussed at some length in original speccing here frictionlessdata/datapackage#228

The recommendation at that time was to do something explicit but this did not make it into the spec - i think we probably want something. My sense is that:

default behaviour should be to assume that all files have a single header row (or rather that all files conform to csv dialect spec which includes option of 0 or 1 header rows)
- Aside: i think it would be nice to add arbitrary header rows amount to csv dialect but that's something else
add an explicit option like noChunkHeaders ...

Here's the thread summarized:

frictionlessdata/datapackage#228 (comment)

What do we about repeated headers e.g. in CSV files? Not so sure. Feel this is properly a separate option e.g. stripChunkHeaders which mean you should take first line of all but first chunk and throw it away (first line being first thing up to \n or \r\n. Thoughts welcome

frictionlessdata/datapackage#228 (comment)

It's easy to determine whether a header is repeated or not. I don't think we need an explicit property. It can be done automatically.

frictionlessdata/datapackage#228 (comment)

My feeling is to default to stripping the first line in subsequent files the way csvstack does. After all, what is a CSV file with no header :)? In that case, of course, we would need a csvsplit utility.

Rufus' final summary

Stripping headers: I think we want an explicit boolean property so that is not something left to guessing by external tools (though such tools are welcome to check / guess)

...

The one remaining question is what to do with headers:

csvstack eats rows if files have no headers wireservice/csvkit#188 - csvkit supports --no-header now

Postgres COPY command as used in Redshift has "ignoreheader" as a number which is number of rows to skip: http://docs.aws.amazon.com/redshift/latest/dg/copy-parameters-data-conversion.html#copy-ignoreheader

My sense here is that in Data Package behaviour is naive: simple concatenation. In Tabular Data Package where there could be a concept of skipInitialRows or headerRow etc - see #326

0 replies

paulgirard · 2020-02-04T13:08:56Z

paulgirard
Feb 4, 2020
Author

I propose to prepare a PR about this starting with python libs.

But let's first agree on a solution.
Important point about this choice is that current implementations consider there are no headers in chunks. Changing the default to headers on would break this behaviour!

Based on existing discussion summed up by @rufuspollock stated, I think we have three solutions:

solution 1 - chunks are CSV after all

Multipart's chunks are considered to have a header even if it's a chunk of a multipart resource.
Reading chunks should then check header compliance before concatenating.
We might want to introduce a "no-header" option to multipart resource which would change this.
Actually the dialect can already specify no to the header option.

"dialect": {
        "header": true
},

Should we just use this to define chunks concatenation behaviour?
If we do so I feel like the very first chunk should follow the same rule as others.
If not I prefer solution 2.
If we take this path we IMHO don't have choice but to reverse default to with-header since we stick to CSV spec.

solution 2 - specific header option for multipart

In case the no-header in multipart situations should stay at the current spec (ie. only the first chunk has header), I would prefer to add a multipart specific option in resource like noHeadersInChunks:boolean.
If we take this path, still have to decide what's the default.
Since the feeling of this option is what we called multipart and chunks are specific, I think keeping default to noHeadersInChunks:true is quite ok not to disturb existing packages.

solution 3 - magic default

Since we have the specification of what a header line must be, we could test first line of every chunk is it's a header one or not. I generally don't like this magical behaviour but :

we have a spec so this magic is very close to an identity test (test already implemented)
it would make life easier by not breaking current behaviour and work seamlessly for people with headered-chunks.

I need this headered-chunks behaviour and I can take time to implement it at least in Python lib.
@rufuspollock @roll please let me know what decision you take about the specification.

IMHO having chunks where only the first one have header doesn't feel right.
I understand dependency to the path issue though.
Solution one with default to true feels like the best one to me because it's not magic, it's coherent CSV-wise and people who would have to migrate could do it by only removing the header line in first chunk and adding dialect.header=false in spec.
Solution 3 is appealing but looks to close to the dangerous dont-try-to-be-too-clever zone.
Solution 2 is enforcing that multipart and chunk are specific situation with specific behaviour.

Let me know. Would be a pleasure to PR any of the solution mentioned here.

0 replies

roll · 2020-02-04T15:38:52Z

roll
Feb 4, 2020
Maintainer

I think before any actions are taken it should be clarified on the specs level.

At the moment multipart resource is a part of the Resource spec it means that it's applicable to all possible file types.

For example 1:

path:
- picture100kb-1.jpeg
- picture100kb-2.jpeg

An implementation should concatenate it on a binary level to get picture.jpeg.

Not let's say we have example 2:

csv-chunk-1

header1, header2, header3
value1,value2,value3
...
value1, val

csv-chunk-2

ue2,value3
...
value1,value2,value3

Our implementation is going to do the same as with the picture (binary level concatenation) and we will get proper table.csv. The second part of the file is not even a tabular file, it's not properly parsable (it's a chunk of a big tabular file).

Let's have example 3:

csv-chunk-1

header1, header2, header3
...

csv-chunk-2

header1, header2, header3
value1,value2,value3
...

Not we can't concatenate it as binary chunks because there is the headers row. So we need to parse as a table both chunks. I don't see how an implementation should understand it (binary concatenation vs logical concatenation)

0 replies

paulgirard · 2020-02-04T16:17:52Z

paulgirard
Feb 4, 2020
Author

What about adapting the concatenation implementation when the resource has a tabular-data-resource profile ?

0 replies

roll · 2020-02-04T17:26:43Z

roll
Feb 4, 2020
Maintainer

@paulgirard
Do you mean making it a special case?

0 replies

paulgirard · 2020-02-04T18:01:13Z

paulgirard
Feb 4, 2020
Author

Yes. tabular-data-resource => logical concatenation (i.e. handling header row) leaving other cases on binary level.
That would not shock me architecturally speaking.

0 replies

roll · 2020-02-04T18:25:03Z

roll
Feb 4, 2020
Maintainer

@paulgirard
Probably. TBH not fully sure wherever there are any gotchas for having the different meanings of the path property for different resource types. I think the specs need to clarify it first anyway

0 replies

paulgirard · 2020-02-05T08:28:08Z

paulgirard
Feb 5, 2020
Author

@roll @rufuspollock
Well, let me know when the specs are clarified then.
I can help for the implementation once the plan is clear.
Meanwhile I will use group of resources instead as I was doing before.

Note: I have to say that I am not very happy with this situation since I already switched to multipart in part of my projects. My bad I should have checked before.
I would recommend Updating the header/multipart issue in this page : https://frictionlessdata.io/specs/data-resource/

"
NOTE: all files in the array MUST be similar in terms of structure, format etc. Implementors MUST be able to concatenate together the files in the simplest way and treat the result as one large file. For tabular data there is the issue of header rows. See the Tabular Data Package spec for more on this.
"

I didn't find any info in the tabular data package spec about the header issue.
Would be good to clearly state that multipart implies no headers for tabular resource for now.

0 replies

rufuspollock · 2020-02-05T08:30:54Z

rufuspollock
Feb 5, 2020
Maintainer

@roll there is no big issue here i think. The existing spec already says by default you do simple concatenation (see https://frictionlessdata.io/specs/data-resource/). What's missing is the special case for tabular data.

@paulgirard I think we go for option 1 for now and if there is real demand we add option to specify that files other than first one are different (e.g. don't have headers). In short, expanding on what I said above (and you expanded in option 1):

default behaviour should be to assume [for tabular data] that all files in parts conform to specified (or implict i.e. default values) csv dialect spec.

Default is that there is a header row and therefore that every file in parts has a header row.

PS: I wrote this before seeing your latest comment (3m ago!). I think you are good to go on implementing this - and i don't even think it is a change in the specs (though i will add clarification as it was simply a bug that that this was not specified in tabular data resource!)

0 replies

paulgirard · 2020-02-05T09:12:36Z

paulgirard
Feb 5, 2020
Author

Ok I'll work on option 1 and try to find a proper way to distinguish tabular resource.
Profile looks like the good one even though it's not a mandatory field...
@roll I understand this issue... If you have more suggestion/ideas/recommendations/warnings please share.

I'll let you know in this thread before issuing a PR.

0 replies

roll · 2020-02-05T12:48:22Z

roll
Feb 5, 2020
Maintainer

@paulgirard
Consider creating an issue for datapackage-py when you can share your findings regarding the implementation. I think there are many hidden things to take into account e.g. package.save/resource.raw_read/etc for multipart-resource etc. Currently, it relies on the concept of a byte stream making it work automatically (chunks are encapsulated). And having logical chunks will require the following changes to the stack.

A good thing is that multipart resources are only implemented in Python as far as I know.

Also, cc @pwalsh as one of the leads of the multipart resources feature and @akariv. I think this thread is important to review by all core-team members.

0 replies

rufuspollock · 2020-03-01T03:35:08Z

rufuspollock
Mar 1, 2020
Maintainer

@roll just FYI i was the lead of the multipart feature (i was pretty much the lead for all of it 😉)

I agree this could be a complex change so let's see how it goes ...

0 replies

rufuspollock · 2020-06-14T08:48:55Z

rufuspollock
Jun 14, 2020
Maintainer

@paulgirard how did you get on here?

0 replies

paulgirard · 2020-06-14T17:13:30Z

paulgirard
Jun 14, 2020
Author

Sorry for having disappeared. I started my own company two weeks before french COVID lockdown.. such a mess.
I still plan to finish up the feature by adding the magic + warning features we discussed with @roll on the PR comments : frictionlessdata/datapackage-py#257 (review)

I will need a few more weeks before I can finish that up. Feel free to assign some of your team if you want to go faster no prb for me !

I'll let you know.

0 replies

paulgirard · 2020-08-13T09:57:14Z

paulgirard
Aug 13, 2020
Author

@rufuspollock @roll I finally took time to finish up this work on multipart. I merged master and added a commit to better handled deprecated header chunks as requested. Let me know if you can reopen the PR or If I should create a new one. Reopening will have the benefit to keep previous discussion in history.

0 replies

roll · 2020-08-14T06:32:57Z

roll
Aug 14, 2020
Maintainer

@paulgirard
I've reopened 👍 Thanks!

0 replies

rufuspollock · 2021-04-06T14:10:54Z

rufuspollock
Apr 6, 2021
Maintainer

@paulgirard i think this is done but could you confirm and then we can accept an answer here and mark as done.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Splitting a single logical resource across multiple files (was: Group of resources + remote schema) #661

{{title}}

Replies: 20 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Splitting a single logical resource across multiple files (was: Group of resources + remote schema) #661

paulgirard Dec 11, 2019

Replies: 20 comments

rufuspollock Dec 11, 2019 Maintainer

paulgirard Dec 11, 2019 Author

paulgirard Jan 21, 2020 Author

rufuspollock Jan 21, 2020 Maintainer

paulgirard Feb 4, 2020 Author

solution 1 - chunks are CSV after all

solution 2 - specific header option for multipart

solution 3 - magic default

roll Feb 4, 2020 Maintainer

paulgirard Feb 4, 2020 Author

roll Feb 4, 2020 Maintainer

paulgirard Feb 4, 2020 Author

roll Feb 4, 2020 Maintainer

paulgirard Feb 5, 2020 Author

rufuspollock Feb 5, 2020 Maintainer

paulgirard Feb 5, 2020 Author

roll Feb 5, 2020 Maintainer

rufuspollock Mar 1, 2020 Maintainer

rufuspollock Jun 14, 2020 Maintainer

paulgirard Jun 14, 2020 Author

paulgirard Aug 13, 2020 Author

roll Aug 14, 2020 Maintainer

rufuspollock Apr 6, 2021 Maintainer

paulgirard
Dec 11, 2019

rufuspollock
Dec 11, 2019
Maintainer

paulgirard
Dec 11, 2019
Author

paulgirard
Jan 21, 2020
Author

rufuspollock
Jan 21, 2020
Maintainer

paulgirard
Feb 4, 2020
Author

roll
Feb 4, 2020
Maintainer

paulgirard
Feb 4, 2020
Author

roll
Feb 4, 2020
Maintainer

paulgirard
Feb 4, 2020
Author

roll
Feb 4, 2020
Maintainer

paulgirard
Feb 5, 2020
Author

rufuspollock
Feb 5, 2020
Maintainer

paulgirard
Feb 5, 2020
Author

roll
Feb 5, 2020
Maintainer

rufuspollock
Mar 1, 2020
Maintainer

rufuspollock
Jun 14, 2020
Maintainer

paulgirard
Jun 14, 2020
Author

paulgirard
Aug 13, 2020
Author

roll
Aug 14, 2020
Maintainer

rufuspollock
Apr 6, 2021
Maintainer