You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently this collection of data sets is very chaotic. However, there has reached a critical mass of individual examples such that it might now be possible to distill the commonalities between them and form a coherent general structure out of them.
There are two main things to deal with: directory structure and file structure.
Directory Structure
Ideally, the directory structure should be able to be traversed by some program in order to build an index of all the data sets. In order to make this straightforward, one solution might be to introduce a fixed hierarchical structure like this:
{{dataSource}}/{{dataCollection}}/{{dataSet}}
where
dataSource represents the organization that originally published the data.
dataCollection represents a category of data sets published
dataSet represents an individual data table
File Structure
The dsv-dataset project provides a metadata specification for annotating data sets with column types so they can be automatically parsed. The file structure should leverage dsv-dataset.
Perhaps each data set could have two files, one with the CSV data, and one with the metadata, like this:
The disadvantage of this is that folks who want to move the .csv file into a different context will need to spend time thinking about what to name it, or just leave it as data.csv, which is rather generic.
Alternatively, the files could take on the name of the data set, like this
The disadvantage of this approach is that anyone who wants to just get the CSV data out will need to write some code, rather than just copy an existing CSV file.
Alternatively, the data sets could live in the data collection directory, like this:
This would make the data collection directories kind of messy. Also, having a README.md for each level might be a good thing too, which would make if favorable to have each data set reside in its own directory, like this:
Also, it might be nice to have an index at each level, so programs can query for what is there. These files could simply contain arrays of strings. This would make the full file layout look something like this:
What about cases where a data cube is partitioned across files, where each file contains a portion of the fact table where a certain dimension equals a certain value? For example, a data set may be partitioned across many files, one file per year. Or the partitioning could use one file per geographic region. Perhaps the directory tree can be built in such a way that it is possible to have a directory full of CSV files within a given data set, and all of them can share the same metadata file.
The text was updated successfully, but these errors were encountered:
Currently this collection of data sets is very chaotic. However, there has reached a critical mass of individual examples such that it might now be possible to distill the commonalities between them and form a coherent general structure out of them.
There are two main things to deal with: directory structure and file structure.
Directory Structure
Ideally, the directory structure should be able to be traversed by some program in order to build an index of all the data sets. In order to make this straightforward, one solution might be to introduce a fixed hierarchical structure like this:
{{dataSource}}/{{dataCollection}}/{{dataSet}}
where
dataSource
represents the organization that originally published the data.dataCollection
represents a category of data sets publisheddataSet
represents an individual data tableFile Structure
The dsv-dataset project provides a metadata specification for annotating data sets with column types so they can be automatically parsed. The file structure should leverage dsv-dataset.
Perhaps each data set could have two files, one with the CSV data, and one with the metadata, like this:
{{dataSource}}/{{dataCollection}}/{{dataSet}}/data.csv
{{dataSource}}/{{dataCollection}}/{{dataSet}}/metadata.json
The disadvantage of this is that folks who want to move the .csv file into a different context will need to spend time thinking about what to name it, or just leave it as
data.csv
, which is rather generic.Alternatively, the files could take on the name of the data set, like this
{{dataSource}}/{{dataCollection}}/{{dataSet}}/{{dataSet}}.csv
{{dataSource}}/{{dataCollection}}/{{dataSet}}/{{dataSet}}.json
As yet another alternative, data sets and their metadata could be combined into a single JSON file, whose contents might look something like this:
This could live in a single file:
{{dataSource}}/{{dataCollection}}/{{dataSet}}/{{dataSet}}.json
The disadvantage of this approach is that anyone who wants to just get the CSV data out will need to write some code, rather than just copy an existing CSV file.
Alternatively, the data sets could live in the data collection directory, like this:
{{dataSource}}/{{dataCollection}}/{{dataSet}}.csv
{{dataSource}}/{{dataCollection}}/{{dataSet}}.json
This would make the data collection directories kind of messy. Also, having a README.md for each level might be a good thing too, which would make if favorable to have each data set reside in its own directory, like this:
{{dataSource}}/README.md
{{dataSource}}/{{dataCollection}}/README.md
{{dataSource}}/{{dataCollection}}/{{dataSet}}/README.md
Also, it might be nice to have an index at each level, so programs can query for what is there. These files could simply contain arrays of strings. This would make the full file layout look something like this:
{{dataSource}}/README.md
{{dataSource}}/dataCollections.json
{{dataSource}}/{{dataCollection}}/README.md
{{dataSource}}/{{dataCollection}}/dataSets.json
{{dataSource}}/{{dataCollection}}/{{dataSet}}/README.md
{{dataSource}}/{{dataCollection}}/{{dataSet}}/data.csv
{{dataSource}}/{{dataCollection}}/{{dataSet}}/metadata.json
Trees
What about cases where a data cube is partitioned across files, where each file contains a portion of the fact table where a certain dimension equals a certain value? For example, a data set may be partitioned across many files, one file per year. Or the partitioning could use one file per geographic region. Perhaps the directory tree can be built in such a way that it is possible to have a directory full of CSV files within a given data set, and all of them can share the same metadata file.
The text was updated successfully, but these errors were encountered: