Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace .csv in usmart api #223

Open
4 tasks
KarenJewell opened this issue Feb 18, 2023 · 13 comments · May be fixed by #260
Open
4 tasks

Replace .csv in usmart api #223

KarenJewell opened this issue Feb 18, 2023 · 13 comments · May be fixed by #260
Assignees
Labels
good first issue Good for newcomers

Comments

@KarenJewell
Copy link
Member

Is your feature request related to a problem? Please describe.
Currently, the usmart api script (usmart.py) outputs to a .csv which is then ingested into merge_data.py. This should be converted to a .json output to keep consistent with the rest of the project.

Describe the solution you'd like

  • Change the output of usmart.py to be a .json file
  • Update merge_data.py to ingest the new .json file
  • format code with Black before PR
  • update docs (architecture.md and datapipeline.png)

Describe alternatives you've considered
None

Additional context
Original ticket triggering this change is #163

@paul-bradbeer
Copy link

@KarenJewell please assign

@JackGilmore
Copy link
Member

@paul-bradbeer It's all yours! Don't forget to check out our docs at https://docs.opendata.scot and don't hesitate to give us a shout here or on Slack if you have any questions.

@paul-bradbeer
Copy link

@JackGilmore @KarenJewell Good docs - thanks. Also the walkthrough video crystallised things clearly. I've been preoccupied of late, but I will have more time in November. In the meantime, it strikes me that there are a number of these csv to json type conversions and so it seems likely the first one done will serve as a pattern for imitation to some degree - perhaps some common machinery will benefit. JSON will be slightly bulkier than CSV, but I don't think the data volume is sufficient for this to be a concern - let me know if you disagree. Standardising on JSON will also allow for easier representation of hierarchical structure if needed.

@JackGilmore
Copy link
Member

@paul-bradbeer Good shout. I did create a write_json method for processor.py when I was writing the scraper for the Scottish Parliament portal (scottish_parliament.py) which is probably an OK start but it really just takes the list of dict objects I created and json.dumps them.

Ideally once I get time (or if someone gets there before me!) I'd like to declare some proper classes for the JSON schema we want to use with the aim of saving people from having to write their own all the time.

@paul-bradbeer
Copy link

@JackGilmore JSON schema good if they all look the same or are very similar; just checking that now.

@JackGilmore
Copy link
Member

@paul-bradbeer Can't find your comment anymore but I've updated the Contributor Guide with better guidance for forking and creating a branch. When we wrote the docs we must've been viewing the pages as members of the OpenDataScotland GitHub org meaning we've got slightly more elevated permissions to write directly to the repo compared to contributors that will need to fork instead 🤦‍♂️
https://docs.opendata.scot/contribute/contribution-guide/

@paul-bradbeer
Copy link

@JackGilmore removed it once I figured out that what I was commenting on was out of date. Forking in the usual way is fine.

@JackGilmore
Copy link
Member

@paul-bradbeer Kept going one some work with the common schema stuff and built some common classes for datasets and file resources in the branch I've got at the moment for building a Stagecoach scraper as part of #120

Can see more info here if you fancy a look:

@paul-bradbeer
Copy link

@JackGilmore Today I am submitting my pull request for the completed works, including a dataclass object as a sort of schema.

I can't see how this process works if issues are allocated to people and then others go off and work on the same thing. It seems to be a very inefficient way to do things resulting in duplication.

I will submit my pull request; do with it as you please but my spare time is nearly up.

The solution is a pattern for immitation with a Json and Csv decorator class - so folks can see which are json and which are csv
It allows a mix and match approach.

@paul-bradbeer
Copy link

@JackGilmore Draft pull request #260 created. See what you make of it.

@paul-bradbeer
Copy link

@JackGilmore the pull request doesn't seem to be linked. Are you able to link pull request 260 to this issue; I don't seem to be able to do that.

@JackGilmore JackGilmore linked a pull request Nov 25, 2023 that will close this issue
8 tasks
@JackGilmore
Copy link
Member

@paul-bradbeer Thanks for your PR and apologies for the delay in getting back to this. The last couple of weeks have been a tad busy.

Please also accept my apologies if you thought I was duplicating your work. I had been prototyping a few things for the pipeline and hadn't realised that you were working on a common decorator class as part of your work. Nevertheless, your classes actually look a lot more organised and make better sense than what I was playing about with, so thank you! Your contribution is greatly appreciated and we're always looking for feedback on how we can improve the contribution process so I'll take this on board and see how we can update the way we coordinate issues and PRs.

I'm aiming to try to review this by the end of next week.

@paul-bradbeer
Copy link

@JackGilmore Thanks. I wasn't sure how much overlap there would be and I probably should have checked your PR to see but I was running out of spare time. Hopefully you can make some use of it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
Development

Successfully merging a pull request may close this issue.

3 participants