Promote "Compression of resources" recipes to the Data Resource spec #1023

roll · 2024-04-11T13:47:46Z

roll
Apr 11, 2024
Maintainer

Overview

There is quite a simple recipe - https://datapackage.org/recipes/compression-of-resources/ - adding a new resource.compression property to the Data Resource spec. It's supported by frictionless-py.

Note

It might make sense to consider frictionless-py's resource.innerPath as well for providing a path inside an archive.

peterdesmet · 2024-04-15T08:18:23Z

peterdesmet
Apr 15, 2024
Collaborator

frictionless-r ignores the resource.compression property. It does support reading compressed files based on the extension found in path: https://docs.ropensci.org/frictionless/reference/read_resource.html#file-compression

So I'm neutral regarding promoting this to the specs: frictionless-r will likely continue to ignore it.

0 replies

peterdesmet · 2024-04-15T08:19:44Z

peterdesmet
Apr 15, 2024
Collaborator

Regarding resource.innerPath, I'd rather not see an additional property for files, since we already have to deal with data and path. I would express this in path, as follows:

"path": "path/to/my/archive.zip/data.csv"

0 replies

khusmann · 2024-04-15T16:22:29Z

khusmann
Apr 15, 2024
Collaborator

frictionless-r ignores the resource.compression property.
So I'm neutral regarding promoting this to the specs

Agreed -- inferring compression from the path seems just fine. Do we get some other value from having a "compression" property that I'm missing?

Regarding resource.innerPath, I'd rather not see an additional property for files, since we already have to deal with data and path. I would express this in path, as follows:

I think this works for local paths, but not as well for remote ones because it's harder to detect where the zip file is (archive.zip may not be an actual zip file, but just part of the url, and it's harder to check compared to local)

That said, I'm not keen on innerPath either. Regarding compression I would expect two main scenarios / use cases:

Individual resources are compressed (as described in the pattern). This is useful for remote data packages being hosted remotely -- you only need to download the data you need (and it is transferred compressed). This doesn't need innerPath because the compressed files do not contain multiple files.
The entire data package (including datapackage.json) is compressed. This is useful for an archival blob of the entire package that can be distributed as a single unit, without dependencies. (Similar to an opendocument spreadsheet with multiple sheets). This doesn't need innerPath either, because everything is already inside the zip file so paths are already internal to the zip.

innerPath is only applicable when a data package is a) referencing multiple resources in a single zip and b) the datapackage.json isn't included in that level of compression. I don't think this is something we want to support / encourage -- Or maybe I'm missing a benefit / use case?

0 replies

peterdesmet · 2024-04-15T16:32:28Z

peterdesmet
Apr 15, 2024
Collaborator

I don't think this is something we want to support / encourage.

I agree.

0 replies

roll · 2024-04-16T07:18:50Z

roll
Apr 16, 2024
Maintainer Author

Agreed -- inferring compression from the path seems just fine. Do we get some other value from having a "compression" property that I'm missing?

I will play a devil's advocate role here, but you know, inferring a Table Schema usually works just fine as well 😃 So in my opinion it is just a question of increasing interoperability documentation quality. I think, currently, the spec doesn't mention compression at all so the behavior does look just undefined generally speaking. I think we at least need to clarify it. On the other hand as there is already resource.format, resource.compression feels like the same kind of indicator.

That said, I'm not keen on innerPath either. Regarding compression I would expect two main scenarios / use cases:

So regarding inner path I think it's only applicable if a data publisher has to use some artifact i.e. ZIP file that they cannot control so they map resources from this archive similarly how excel sheets mapped onto resources with Table Dialect

0 replies

khusmann · 2024-04-16T17:53:13Z

khusmann
Apr 16, 2024
Collaborator

I will play a devil's advocate role here, but you know, inferring a Table Schema usually works just fine as well 😃

Haha, to play counter devil's advocate: It's standard for file names to include compression type in their extension (file1.csv.zip, file2.csv.gz), but there's not a similar standard for field names to include frictionless field type information (column1.integer, column2.boolean, column3.number).

To me, the question is, do we want to allow compressed paths without extensions, or compressed paths with extensions that don't match the compression type? Otherwise, resource.compression is redundant and will be largely ignored by implementations (as frictionless-r does right now).

I think it's only applicable if a data publisher has to use some artifact i.e. ZIP file that they cannot control so they map resources from this archive

If they don't control the ZIP, then there's all kinds of malformed scenarios we can imagine... The question is where we draw the line.

For example, the ZIP could have other nested ZIPs in it, which in turn hold the table data... That would require nested innerPath properties, which I don't think we should support either.

similarly how excel sheets mapped onto resources with Table Dialect

Selecting an excel sheet in a workbook or a table in an SQLite db is a lot more well-defined, I think, because unlike a generic multi-file archive they have a lot of guarantees / constraints (e.g. they only hold a specific kind of table data and cannot be nested)

Side note -- Does the sheetName property in Table Dialect also allow you to select particular tables in an SQL db? (are SQLite DBs considered as "spreadsheet" formats?)

0 replies

khusmann · 2024-04-16T18:22:51Z

khusmann
Apr 16, 2024
Collaborator

I think, currently, the spec doesn't mention compression at all so the behavior does look just undefined generally speaking. I think we at least need to clarify it.

I agree on this though! I'd suggest something like 1) compression type MAY be specified via path extension (and here's a supported list of formats) and 2) when paths to archives are used, they MUST only contain only one file.

0 replies

roll · 2024-04-17T07:01:40Z

roll
Apr 17, 2024
Maintainer Author

It's standard for file names to include compression type in their extension (file1.csv.zip, file2.csv.gz)

I think it's more like a convention rather than a standard

Side note -- Does the sheetName property in Table Dialect also allow you to select particular tables in an SQL db? (are SQLite DBs considered as "spreadsheet" formats?)

No, we had table property in the draft for SQL but I removed it for now to wait for an actual user request

I agree on this though! I'd suggest something like 1) compression type MAY be specified via path extension (and here's a supported list of formats) and 2) when paths to archives are used, they MUST only contain only one file.

I think, currently, we don't define anything regarding the form of resource.path (regarding format or compression). We might consider adding compression information to https://datapackage.org/specifications/data-resource/#path-or-data-required. Personally, I don't have preferences -- requiring one file per archive seems a reasonable approach. My main point here is as "an implementor" I need some clear definition like "if it is an archived file, resource.path MUST ends with .gz or .zip prefix indicating the compression algorithm" (and reading this sentence I still feel that a dedicated property might be kind better than parsing a path 😃 )

0 replies

fjuniorr · 2024-04-17T10:28:22Z

fjuniorr
Apr 17, 2024

No, we had table property in the draft for SQL but I removed it for now to wait for an actual user request

@roll you mean something like this would not be supported? I do use this internally and crafted this gist for a user query in Frictionless Slack.

0 replies

roll · 2024-04-17T12:28:26Z

roll
Apr 17, 2024
Maintainer Author

@fjuniorr
We can totally add dialect.table if there is a demand for it cc @dafeder

0 replies

khusmann · 2024-04-17T17:02:24Z

khusmann
Apr 17, 2024
Collaborator

I think it's more like a convention rather than a standard

I agree, I was being sloppy with my language there :)

To be clear, I'm neutral on the resource.compression property. My only slight preference here is that we choose to use the extension in the path, OR resource.compression, but not both... that way they are not in competition. But I defer to stronger opinions on this.

My main point here is as "an implementor" I need some clear definition
We might consider adding compression information to https://datapackage.org/specifications/data-resource/#path-or-data-required.

Agreed! I think that's a perfect place to put this info.

I do think enforcing one file per archive is a good idea (no innerPath property). This way it's stays natural to specify multi-part resources with compression, and we don't need an exception for archives (e.g. "path" = ["file1.csv.gz", "file2.csv.gz"]). We can always relax the one file per archive rule and add an innerPath if there's demand for it later.

I also think it'd be nice to mention somewhere the practice of compressing entire data packages. (frictionless-py already supports this as well).

We can totally add dialect.table if there is a demand

I would also very much support this... I was surprised to see spreadsheet support & mention of sql databases, but no clear way to select an sql table :)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Promote "Compression of resources" recipes to the Data Resource spec #1023

{{title}}

Replies: 11 comments

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Promote "Compression of resources" recipes to the Data Resource spec #1023

roll Apr 11, 2024 Maintainer

Overview

Note

Replies: 11 comments

peterdesmet Apr 15, 2024 Collaborator

peterdesmet Apr 15, 2024 Collaborator

khusmann Apr 15, 2024 Collaborator

peterdesmet Apr 15, 2024 Collaborator

roll Apr 16, 2024 Maintainer Author

khusmann Apr 16, 2024 Collaborator

khusmann Apr 16, 2024 Collaborator

roll Apr 17, 2024 Maintainer Author

fjuniorr Apr 17, 2024

roll Apr 17, 2024 Maintainer Author

khusmann Apr 17, 2024 Collaborator

roll
Apr 11, 2024
Maintainer

peterdesmet
Apr 15, 2024
Collaborator

peterdesmet
Apr 15, 2024
Collaborator

khusmann
Apr 15, 2024
Collaborator

peterdesmet
Apr 15, 2024
Collaborator

roll
Apr 16, 2024
Maintainer Author

khusmann
Apr 16, 2024
Collaborator

khusmann
Apr 16, 2024
Collaborator

roll
Apr 17, 2024
Maintainer Author

fjuniorr
Apr 17, 2024

roll
Apr 17, 2024
Maintainer Author

khusmann
Apr 17, 2024
Collaborator