Replace csv storage formats #163

KarenJewell · 2022-09-28T23:49:56Z

This project so far has relied on csv files as storage of outputs and inputs into following processes, and that's been working, but as the variety and volume of listings and publishers grow, we're starting to see issues with encoding, line endings, quoting, arrays etc.

This issue needs to consider replacing csv files as storage for:

web scrapers output (extract)
merge_data.py output (aggregate and clean)

JSON has been suggested but we're not closed to other options.

As a secondary outcome, we'll still want to provide a .csv as output for public users to download, but this should be published output only.

See process flow for current system and PR #160

KarenJewell · 2022-10-01T18:49:09Z

just adding to this - the current merge_output.csv is so riddled with git conflicts, it's not currently usable by the public anyway.

JackGilmore · 2022-10-05T15:25:59Z

Very rough draft of a potential JSON schema we could use for datasets. Note that the file records are nested as a property within the JSON object rather than having multiple JSON objects for the same dataset but different files.

{
    "type": "object",
    "properties": {
        "title": {
            "type": "string"
        },
        "owner": {
            "type": "string"
        },
        "pageURL": {
            "type": "string"
        },
        "dateCreated": {
            "type": "string"
        },
        "dateUpdated": {
            "type": "string"
        },
        "license": {
            "type": "string"
        },
        "description": {
            "type": "string"
        },
        "tags": {
            "type": "array",
            "description": "Could make an array of objects with specifier for tags from original dataset, ones manually added and ones added by the pipeline",
            "items": {
                "type": "string"
            }
        },
        "resources": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "fileName": {
                        "type": "string"
                    },
                    "fileSize": {
                        "type": "string"
                    },
                    "fileSizeUnit": {
                        "type": "string",
                        "description": "Could we do away with this prop and just enforce file sizes to be bytes?"
                    },
                    "fileType": {
                        "type": "string"
                    },
                    "assetUrl": {
                        "type": "string"
                    },
                    "dateCreated": {
                        "type": "string"
                    },
                    "dateUpdated": {
                        "type": "string"
                    },
                    "numRecords": {
                        "type": "number"
                    }
                },
                "required": [
                    "fileName",
                    "fileType",
                    "assetUrl"
                ]
            }
        }
    },
    "required": [
        "title",
        "owner",
        "pageURL",
        "dateCreated"
    ]
}

KarenJewell · 2023-01-29T21:25:48Z

KarenJewell · 2023-01-29T21:26:38Z

note: having merged_output.csv doesn't actually lose us any data because export2jkan was correctly reading linebreaks. looks like our issues with .csv are just aesthetic problems, functionally it's fine. But worth it to change to .json anyway because it is easier to traceback than in .csv given the rendering issues.

KarenJewell added data engineering Things related to data: scraping, cleaning, labelling, transformation back end labels Sep 28, 2022

KarenJewell mentioned this issue Sep 28, 2022

CKAN and statistics.gov.scot bug fixes #160

Merged

KarenJewell mentioned this issue Oct 1, 2022

update tidy_licence function #154

Merged

JackGilmore added this to Open Data Scotland 2024 Oct 13, 2022

JackGilmore moved this to Todo in Open Data Scotland 2024 Oct 13, 2022

This was referenced Jan 7, 2023

Replace the_od_bods/data/ with every pipeline run OpenDataScotland/opendata.scot_pipeline#2

Closed

National Library Scotland Multiple Data Downloads #130 #134

Merged

KarenJewell moved this from Backlog to In Progress in Open Data Scotland 2024 Jan 29, 2023

KarenJewell self-assigned this Jan 29, 2023

KarenJewell linked a pull request Feb 18, 2023 that will close this issue

163 replace csv storage formats #227

Merged

9 tasks

KarenJewell closed this as completed in #227 Feb 22, 2023

github-project-automation bot moved this from In Progress to Done in Open Data Scotland 2024 Feb 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace csv storage formats #163

Replace csv storage formats #163

KarenJewell commented Sep 28, 2022

KarenJewell commented Oct 1, 2022

JackGilmore commented Oct 5, 2022

KarenJewell commented Jan 29, 2023 •

edited

Loading

KarenJewell commented Jan 29, 2023

Replace csv storage formats #163

Replace csv storage formats #163

Comments

KarenJewell commented Sep 28, 2022

KarenJewell commented Oct 1, 2022

JackGilmore commented Oct 5, 2022

KarenJewell commented Jan 29, 2023 • edited Loading

KarenJewell commented Jan 29, 2023

KarenJewell commented Jan 29, 2023 •

edited

Loading