-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace csv storage formats #163
Comments
just adding to this - the current merge_output.csv is so riddled with git conflicts, it's not currently usable by the public anyway. |
Very rough draft of a potential JSON schema we could use for datasets. Note that the file records are nested as a property within the JSON object rather than having multiple JSON objects for the same dataset but different files. {
"type": "object",
"properties": {
"title": {
"type": "string"
},
"owner": {
"type": "string"
},
"pageURL": {
"type": "string"
},
"dateCreated": {
"type": "string"
},
"dateUpdated": {
"type": "string"
},
"license": {
"type": "string"
},
"description": {
"type": "string"
},
"tags": {
"type": "array",
"description": "Could make an array of objects with specifier for tags from original dataset, ones manually added and ones added by the pipeline",
"items": {
"type": "string"
}
},
"resources": {
"type": "array",
"items": {
"type": "object",
"properties": {
"fileName": {
"type": "string"
},
"fileSize": {
"type": "string"
},
"fileSizeUnit": {
"type": "string",
"description": "Could we do away with this prop and just enforce file sizes to be bytes?"
},
"fileType": {
"type": "string"
},
"assetUrl": {
"type": "string"
},
"dateCreated": {
"type": "string"
},
"dateUpdated": {
"type": "string"
},
"numRecords": {
"type": "number"
}
},
"required": [
"fileName",
"fileType",
"assetUrl"
]
}
}
},
"required": [
"title",
"owner",
"pageURL",
"dateCreated"
]
} |
Started branch 163-replace-csv-storage-formats for this.
|
note: having merged_output.csv doesn't actually lose us any data because export2jkan was correctly reading linebreaks. looks like our issues with .csv are just aesthetic problems, functionally it's fine. But worth it to change to .json anyway because it is easier to traceback than in .csv given the rendering issues. |
This project so far has relied on csv files as storage of outputs and inputs into following processes, and that's been working, but as the variety and volume of listings and publishers grow, we're starting to see issues with encoding, line endings, quoting, arrays etc.
This issue needs to consider replacing csv files as storage for:
JSON has been suggested but we're not closed to other options.
As a secondary outcome, we'll still want to provide a .csv as output for public users to download, but this should be published output only.
See process flow for current system and PR #160
The text was updated successfully, but these errors were encountered: