Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace csv storage formats #163

Closed
KarenJewell opened this issue Sep 28, 2022 · 4 comments · Fixed by #227
Closed

Replace csv storage formats #163

KarenJewell opened this issue Sep 28, 2022 · 4 comments · Fixed by #227
Assignees
Labels
data engineering Things related to data: scraping, cleaning, labelling, transformation

Comments

@KarenJewell
Copy link
Member

This project so far has relied on csv files as storage of outputs and inputs into following processes, and that's been working, but as the variety and volume of listings and publishers grow, we're starting to see issues with encoding, line endings, quoting, arrays etc.

This issue needs to consider replacing csv files as storage for:

  • web scrapers output (extract)
  • merge_data.py output (aggregate and clean)

JSON has been suggested but we're not closed to other options.

As a secondary outcome, we'll still want to provide a .csv as output for public users to download, but this should be published output only.

See process flow for current system and PR #160

@KarenJewell KarenJewell added data engineering Things related to data: scraping, cleaning, labelling, transformation back end labels Sep 28, 2022
@KarenJewell
Copy link
Member Author

just adding to this - the current merge_output.csv is so riddled with git conflicts, it's not currently usable by the public anyway.

@JackGilmore
Copy link
Member

Very rough draft of a potential JSON schema we could use for datasets. Note that the file records are nested as a property within the JSON object rather than having multiple JSON objects for the same dataset but different files.

{
    "type": "object",
    "properties": {
        "title": {
            "type": "string"
        },
        "owner": {
            "type": "string"
        },
        "pageURL": {
            "type": "string"
        },
        "dateCreated": {
            "type": "string"
        },
        "dateUpdated": {
            "type": "string"
        },
        "license": {
            "type": "string"
        },
        "description": {
            "type": "string"
        },
        "tags": {
            "type": "array",
            "description": "Could make an array of objects with specifier for tags from original dataset, ones manually added and ones added by the pipeline",
            "items": {
                "type": "string"
            }
        },
        "resources": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "fileName": {
                        "type": "string"
                    },
                    "fileSize": {
                        "type": "string"
                    },
                    "fileSizeUnit": {
                        "type": "string",
                        "description": "Could we do away with this prop and just enforce file sizes to be bytes?"
                    },
                    "fileType": {
                        "type": "string"
                    },
                    "assetUrl": {
                        "type": "string"
                    },
                    "dateCreated": {
                        "type": "string"
                    },
                    "dateUpdated": {
                        "type": "string"
                    },
                    "numRecords": {
                        "type": "number"
                    }
                },
                "required": [
                    "fileName",
                    "fileType",
                    "assetUrl"
                ]
            }
        }
    },
    "required": [
        "title",
        "owner",
        "pageURL",
        "dateCreated"
    ]
}

@KarenJewell
Copy link
Member Author

KarenJewell commented Jan 29, 2023

Started branch 163-replace-csv-storage-formats for this.

@KarenJewell
Copy link
Member Author

note: having merged_output.csv doesn't actually lose us any data because export2jkan was correctly reading linebreaks. looks like our issues with .csv are just aesthetic problems, functionally it's fine. But worth it to change to .json anyway because it is easier to traceback than in .csv given the rendering issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data engineering Things related to data: scraping, cleaning, labelling, transformation
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

2 participants