Splitting a single logical resource across multiple files (was: Group of resources + remote schema) #661
Replies: 20 comments
-
@paulgirard great question. First some relevant existing issues:
If i understand it you have what sounds like a single "Logical" resource that is split into multiple physical files. The options are then:
|
Beta Was this translation helpful? Give feedback.
-
Thank you ! I'll read over the issues to learn more but I bet chunks might do it. ps: i've tried to specify the remote schema only for the first resource and it went fine. So that's a safe workaround I guess. |
Beta Was this translation helpful? Give feedback.
-
Dear @roll @rufuspollock I can read the concept and even a warning about headers here : https://frictionlessdata.io/specs/data-resource/ I tried to add a dialect.header = true but that's not it. Here is my multipart resource: {
"name": "flows",
"sources": [
{
"title": "Trade statistics from various countries archives and research papers, see data/sources.csv for details"
}
],
"encoding": "utf-8",
"profile": "tabular-data-resource",
"title": "Trade flows transcribed from various international trade statistics volumes",
"format": "csv",
"schema": "flows_schema.json",
"dialect": {
"header": true
},
"path": [
"data/flows/AnnalesDuCommerceExterieurFaitsCommerciaux3eSerieAvisDivers_IndesOrientalesNeerlandaises_Fc2.csv",
"data/flows/ForeignCommerceAndNavigationOfTheUnitedStates_191112.csv",
"data/flows/AnnalesDuCommerceExterieurFaitsCommerciaux3eSerieAvisDivers_OceanieEtAustralie_Fc7.csv",
"data/flows/ForeignCommerceAndNavigationOfTheUnitedStates_1938.csv",
]} |
Beta Was this translation helpful? Give feedback.
-
@paulgirard great question 👍 I think this is under-specified in the spec 😉This was actually discussed at some length in original speccing here frictionlessdata/datapackage#228 The recommendation at that time was to do something explicit but this did not make it into the spec - i think we probably want something. My sense is that:
Here's the thread summarized: frictionlessdata/datapackage#228 (comment)
frictionlessdata/datapackage#228 (comment)
frictionlessdata/datapackage#228 (comment)
Rufus' final summary
...
|
Beta Was this translation helpful? Give feedback.
-
I propose to prepare a PR about this starting with python libs. But let's first agree on a solution. Based on existing discussion summed up by @rufuspollock stated, I think we have three solutions: solution 1 - chunks are CSV after allMultipart's chunks are considered to have a header even if it's a chunk of a multipart resource. "dialect": {
"header": true
}, Should we just use this to define chunks concatenation behaviour? solution 2 - specific header option for multipartIn case the no-header in multipart situations should stay at the current spec (ie. only the first chunk has header), I would prefer to add a multipart specific option in resource like noHeadersInChunks:boolean. solution 3 - magic defaultSince we have the specification of what a header line must be, we could test first line of every chunk is it's a header one or not. I generally don't like this magical behaviour but :
I need this headered-chunks behaviour and I can take time to implement it at least in Python lib. IMHO having chunks where only the first one have header doesn't feel right. Let me know. Would be a pleasure to PR any of the solution mentioned here. |
Beta Was this translation helpful? Give feedback.
-
I think before any actions are taken it should be clarified on the specs level. At the moment multipart resource is a part of the For path:
- picture100kb-1.jpeg
- picture100kb-2.jpeg An implementation should concatenate it on a binary level to get Not let's say we have
Our implementation is going to do the same as with the picture (binary level concatenation) and we will get proper Let's have
Not we can't concatenate it as binary chunks because there is the headers row. So we need to parse as a table both chunks. I don't see how an implementation should understand it (binary concatenation vs logical concatenation) |
Beta Was this translation helpful? Give feedback.
-
What about adapting the concatenation implementation when the resource has a tabular-data-resource profile ? |
Beta Was this translation helpful? Give feedback.
-
@paulgirard |
Beta Was this translation helpful? Give feedback.
-
Yes. tabular-data-resource => logical concatenation (i.e. handling header row) leaving other cases on binary level. |
Beta Was this translation helpful? Give feedback.
-
@paulgirard |
Beta Was this translation helpful? Give feedback.
-
@roll @rufuspollock Note: I have to say that I am not very happy with this situation since I already switched to multipart in part of my projects. My bad I should have checked before. " I didn't find any info in the tabular data package spec about the header issue. |
Beta Was this translation helpful? Give feedback.
-
@roll there is no big issue here i think. The existing spec already says by default you do simple concatenation (see https://frictionlessdata.io/specs/data-resource/). What's missing is the special case for tabular data. @paulgirard I think we go for option 1 for now and if there is real demand we add option to specify that files other than first one are different (e.g. don't have headers). In short, expanding on what I said above (and you expanded in option 1):
Default is that there is a header row and therefore that every file in parts has a header row. PS: I wrote this before seeing your latest comment (3m ago!). I think you are good to go on implementing this - and i don't even think it is a change in the specs (though i will add clarification as it was simply a bug that that this was not specified in tabular data resource!) |
Beta Was this translation helpful? Give feedback.
-
Ok I'll work on option 1 and try to find a proper way to distinguish tabular resource. I'll let you know in this thread before issuing a PR. |
Beta Was this translation helpful? Give feedback.
-
@paulgirard A good thing is that multipart resources are only implemented in Python as far as I know. Also, cc @pwalsh as one of the leads of the multipart resources feature and @akariv. I think this thread is important to review by all core-team members. |
Beta Was this translation helpful? Give feedback.
-
@roll just FYI i was the lead of the multipart feature (i was pretty much the lead for all of it 😉) I agree this could be a complex change so let's see how it goes ... |
Beta Was this translation helpful? Give feedback.
-
@paulgirard how did you get on here? |
Beta Was this translation helpful? Give feedback.
-
Sorry for having disappeared. I started my own company two weeks before french COVID lockdown.. such a mess. I will need a few more weeks before I can finish that up. Feel free to assign some of your team if you want to go faster no prb for me ! I'll let you know. |
Beta Was this translation helpful? Give feedback.
-
@rufuspollock @roll I finally took time to finish up this work on multipart. I merged master and added a commit to better handled deprecated header chunks as requested. Let me know if you can reopen the PR or If I should create a new one. Reopening will have the benefit to keep previous discussion in history. |
Beta Was this translation helpful? Give feedback.
-
@paulgirard |
Beta Was this translation helpful? Give feedback.
-
@paulgirard i think this is done but could you confirm and then we can accept an answer here and mark as done. |
Beta Was this translation helpful? Give feedback.
-
Dear all,
Resources can be grouped. To leverage splitting data into multiple files having the same structure. I need this to cut my dataset by archival sources used. First because it's easier to version, then it's easier when editions are needed to correct (not have to open a HUGE CSV), last it makes validations errors located to more precise files and finally it makes much more sense for a dataset which has been created by transcribing many different archival books.
About resources group, I think it's kind of accepted that for a set of resources gathered in one same group, the schema can be set only for the first resource of the group.
If not set in specs (there are not specs for group as far as I know), I've seen code which load package that way.
My current issue is that with a package which has a group of resources, if a schema is indicated for all the resources of the group (which are numerous in my case > 1000) then the lib will load the same schema as many times as the number or resources.
Which is not ideal... above all if the basepath is actually remote...
Thus I see two ways out :
The affected datapackage : https://github.com/medialab/ricardo_data/blob/master/datapackage.json
I am about to try the first path by removing the schema in all but first grouped resource.
Changing datapackage-js to add a special behaviour for group should not be that hard.
Beta Was this translation helpful? Give feedback.
All reactions