Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[EN] Convert Yerevan budget documents from PDF to CSV/XML/JSON machine readable data #5

Open
ivbeg opened this issue May 26, 2023 · 10 comments
Labels
parsing Tasks that require data parsing topic-finances Task related to public finances, banking, currencies and e.t.c topic-government Tasks dedicated to government openness

Comments

@ivbeg
Copy link
Contributor

ivbeg commented May 26, 2023

Goal

The goal is to create a dataset with the Yerevan city budget for further analysis and visualization. It could be done now since the budget is being published as a set of PDF documents.

Tasks

The Yerevan city budget for the 2023 year and report about the budget execution of the 2022 budget will be published on the city website https://www.yerevan.am/hy/finance/ as a set of archives with PDF documents insight.

These documents have a text layer that can be processed to extract tables.

These tables look like this
изображение
Page 49 from the budget report of the 2022 year https://www.yerevan.am/uploads/media/default/0002/19/b257858f7a9940c75efc4a98acb88e949dd6e554.pdf

  1. These tables should be extracted and converted as Excel, CSV, or JSON files—one file per table.
  2. It would be great if table headers were in English and headers were translated from Armenian to English. For example, Եկամտատեսակները in the Excel or CSV file should be written as "income".
  3. It would be even better if you could convert past budgets too, city budgets 2018-2022 available as sub-pages at the same link

Context

The Yerevan budget was published as a set of Armenian-only text/pdf documents without any machine-readable or at least Excel file.

To convert PDF files to Excel or CSV/JSON, you could use ABBYY Finereader, Tabula, or any other tool that could help.

Requirements

  • create a public GitHub repository to store code and data under one of the free and open licenses like Creative Commons license or MIT license

Wishes

Please write your code as reusable code that could be launched by someone else later since we could need to update this dataset later.

Resources

Prepared by

The Open Data Armenia team prepared this task

@ivbeg ivbeg added parsing Tasks that require data parsing extraction Task that require data extraction (scraping) skills topic-government Tasks dedicated to government openness and removed extraction Task that require data extraction (scraping) skills labels May 26, 2023
@dkagramanyan
Copy link

I can help with converting pdf files to csv/json/excel but after 1.5 weeks. I need to complete several study projects before exams

@dkagramanyan
Copy link

dkagramanyan commented Jun 4, 2023

@ivbeg Hi! I processed one pdf document. Is everything okay? If not, please tell me where there is a mistake. Tables are available on google drive and on my repository

@ivbeg
Copy link
Contributor Author

ivbeg commented Jun 5, 2023

@dkagramanyan Hi! Yes, it looks great!
P.S. repository looks private, so I've checked only google drive documents

@ivbeg ivbeg added the topic-finances Task related to public finances, banking, currencies and e.t.c label Jun 5, 2023
@dkagramanyan
Copy link

Added 30 new tables. About 50 tables left to process. Loaded new data to the same gdrive

@ansakoy
Copy link
Collaborator

ansakoy commented Jun 9, 2023

@dkagramanyan thanks a lot! If you could possibly share the code that did the trick it would be just perfect (the repo you referred to above is private, as @ivbeg pointed out).
Meanwhile, best of luck with your exams.

@dkagramanyan
Copy link

dkagramanyan commented Jun 9, 2023

@ansakoy in fact, I didn't use any code to parse those tables. I converted the pdf files with FineReader and then manually made corrections. But I think, this method can't be used for the remaining 50 tables as it is very time consuming. I will have to come up with some kind of automatic method

P.S. now repository is public

@dkagramanyan
Copy link

Hi! I have successfully completed parsing of Yerevan budget 2023. The data is available in my repository and on gdrive. Links are in my previous comments

@dkagramanyan
Copy link

@ivbeg Hi! Is everything ОК with the data?

@ansakoy
Copy link
Collaborator

ansakoy commented Jun 30, 2023

@dkagramanyan Thanks a lot, David, this was really useful. The data look fine to me. Ivan has been away on business, so he could not respond promptly, but he will as soon as possible.

@ivbeg
Copy link
Contributor Author

ivbeg commented Jul 4, 2023

@dkagramanyan Hi David! Sorry, I was on business trip for some time and was unable to answer. Yes, it looks great! Thanks a lot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parsing Tasks that require data parsing topic-finances Task related to public finances, banking, currencies and e.t.c topic-government Tasks dedicated to government openness
Projects
None yet
Development

No branches or pull requests

3 participants