Merge pull request #184 from georgetown-cset/data-airflow

Airflow pipeline creation updates
georgetown-cset · Jan 18, 2024 · 15c4062 · 15c4062
2 parents b6e75da + a6c2d22
commit 15c4062
Show file tree

Hide file tree

Showing 60 changed files with 2,446 additions and 484 deletions.
diff --git a/company_linkage/Dockerfile b/company_linkage/Dockerfile
@@ -0,0 +1,22 @@
+FROM ubuntu:20.04
+
+# Set up system dependencies
+RUN apt -y update
+RUN apt-get -y update
+RUN apt-get install -y build-essential libssl-dev libffi-dev python3-dev python3-pip curl
+
+# Grab files we need to run
+ADD requirements.txt /parat/requirements.txt
+ADD parat_scripts/* /parat/
+
+# install gsutil and put it on the path for airflow to use
+ENV CLOUDSDK_INSTALL_DIR /usr/local/gcloud/
+RUN curl -sSL https://sdk.cloud.google.com | bash
+ENV PATH $PATH:/usr/local/gcloud/google-cloud-sdk/bin
+
+# Install python dependencies
+WORKDIR /parat
+ENV AIRFLOW_GPL_UNIDECODE=yes
+RUN pip3 install -r requirements.txt
+# Make sure the above config succeeded
+RUN python3 -m pytest test_aggregate_organizations.py -k test_add_location
diff --git a/company_linkage/README.md b/company_linkage/README.md
@@ -16,37 +16,47 @@ run some of this code as-is.
 
 ## Tasks to build visualization data
 
-1. [creating_organizations_from_airtable_imports.sql](sql/create_organizations_from_airtable_imports.sql)
-2. [selecting_ai_publications.sql](sql/selecting_ai_publications.sql)
-3. `python3 aggregate_organizations.py aggregated_organizations.jsonl`
-4. Replace `high_resolution_entities.aggregated_organizations` with the data from `aggregated_organizations.jsonl` using the [aggregated_organizations_schema](schemas/aggregated_organizations_schema.json)
-5. [selecting_ai_patents.sql](sql/selecting_ai_patents.sql)
-6. `python3 get_ai_counts.py data/ai_company_papers.jsonl data/ai_company_patents.jsonl` 
-7. Upload `ai_company_papers.jsonl` to `ai_companies_visualization.ai_company_pubs` using the [ai_papers_schema](schemas/ai_papers_schema.json)
-8. Upload `ai_company_patents.jsonl` to `ai_companies_visualization.ai_company_patents` using the [ai_patents_schema](schemas/ai_patents_schema.json)
-9. [creating_initial_visualization_data_publications.sql](sql/creating_initial_visualization_data_publications.sql)
-10. [adding_ai_pubs_by_year_to_visualization.sql](sql/adding_ai_pubs_by_year_to_visualization.sql)
-11. [creating_patent_visualization_data.sql](sql/creating_patent_visualization_data.sql)
-12. [adding_ai_patents_by_year_to_visualization.sql](sql/adding_ai_patents_by_year_to_visualization.sql)
-13. [creating_paper_visualization_data.sql](sql/creating_paper_visualization_data.sql)
-14. [adding_top_mag_ai_fields.sql](sql/adding_top_mag_ai_fields.sql)
-15. [adding_top_science_map_clusters.sql](sql/adding_top_science_map_clusters.sql)
-16. [adding_company_references.sql](sql/adding_company_references.sql)
-17. [adding_top_tasks.sql](sql/adding_top_tasks.sql)
-18. [adding_top_methods.sql](sql/adding_top_methods.sql)
-19. [selecting_top_conference_pubs.sql](sql/selecting_top_conference_pubs.sql)
-20. [pulling_publications_in_top_ai_conferences.sql](sql/pulling_publications_in_top_ai_conferences.sql)
-21. `python3 top_papers.py top_paper_counts.jsonl`
-22. Upload `top_paper_counts.jsonl` to `ai_companies_visualization.top_paper_counts` using the [top_papers_schema](schemas/top_papers_schema.json)
-23. [adding_top_paper_counts.sql](sql/adding_top_paper_counts.sql)
-24. [selecting_all_publications.sql](sql/selecting_all_publications.sql)
-25. `python3 all_papers.py all_paper_counts.jsonl`
-26. Upload `all_paper_counts.jsonl` to `ai_companies_visualization.total_paper_counts` using the [all_papers_schema](schemas/all_papers_schema.json)
-27. [adding_all_paper_counts.sql](sql/adding_all_paper_counts.sql)
-28. [creating_workforce_visualization_data.sql](sql/creating_workforce_visualization_data.sql)
-29. [adding_ai_jobs_to_workforce_visualization.sql](sql/adding_ai_jobs_to_workforce_visualization.sql)
-31. [omit_by_rule.sql](sql/omit_by_rule.sql)
-32. [omit_by_rule_papers.sql](sql/omit_by_rule_papers.sql)
-33. [omit_by_rule_patents.sql](sql/omit_by_rule_patents.sql)
-34. [omit_by_rule_workforce.sql](sql/omit_by_rule_workforce.sql)
-35. [adding_crunchbase_company_metadata.sql](sql/adding_crunchbase_company_metadata.sql)
+1. [organizations.sql](sql/organizations.sql)
+2. [ai_publications.sql](sql/ai_publications.sql)
+3. [linked_ai_patents.sql](sql/linked_ai_patents.sql)
+4. [top_conference_pubs.sql](sql/top_conference_pubs.sql)
+5. [pubs_in_top_conferences.sql](sql/pubs_in_top_conferences.sql)
+6. [all_publications.sql](sql/all_publications.sql)
+7. `python3 aggregate_organizations.py aggregated_organizations.jsonl`
+8. Replace `high_resolution_entities.aggregated_organizations` with the data from `aggregated_organizations.jsonl` using the [aggregated_organizations_schema](schemas/aggregated_organizations_schema.json)
+9. `python3 get_ai_counts.py data/ai_company_papers.jsonl data/ai_company_patents.jsonl` 
+10. Upload `ai_company_papers.jsonl` to `ai_companies_visualization.ai_company_pubs` using the [ai_papers_schema](schemas/ai_papers_schema.json)
+11. Upload `ai_company_patents.jsonl` to `ai_companies_visualization.ai_company_patents` using the [ai_patents_schema](schemas/ai_patents_schema.json)
+12. `python3 top_papers.py top_paper_counts.jsonl`
+13. Upload `top_paper_counts.jsonl` to `ai_companies_visualization.top_paper_counts` using the [top_papers_schema](schemas/top_papers_schema.json)
+14. `python3 all_papers.py all_paper_counts.jsonl`
+15. Upload `all_paper_counts.jsonl` to `ai_companies_visualization.total_paper_counts` using the [all_papers_schema](schemas/all_papers_schema.json)
+16. [initial_visualization_data.sql](sql/initial_visualization_data.sql)
+17. [visualization_data_with_by_year.sql](sql/visualization_data_with_by_year.sql)
+18. [visualization_data_with_top_papers.sql](sql/visualization_data_with_top_papers.sql)
+19. [visualization_data_with_all_papers.sql](sql/visualization_data_with_all_papers.sql)
+20. [initial_patent_visualization_data.sql](sql/initial_patent_visualization_data.sql)
+21. [patent_visualization_data_with_by_year.sql](sql/patent_visualization_data_with_by_year.sql)
+22. [initial_paper_visualization_data.sql](sql/initial_paper_visualization_data.sql)
+23. [paper_visualization_data_with_mag.sql](sql/paper_visualization_data_with_mag.sql)
+24. [paper_visualization_data_with_clusters.sql](sql/paper_visualization_data_with_clusters.sql)
+25. [paper_visualization_data_with_company_references.sql](sql/paper_visualization_data_with_company_references.sql)
+26. [paper_visualization_data_with_tasks.sql](sql/paper_visualization_data_with_tasks.sql)
+27. [paper_visualization_data_with_methods.sql](sql/paper_visualization_data_with_methods.sql)
+28. [initial_workforce_visualization_data.sql](sql/initial_workforce_visualization_data.sql)
+29. [workforce_visualization_data_with_ai_jobs.sql](sql/workforce_visualization_data_with_ai_jobs.sql)
+30. [visualization_data_omit_by_rule.sql](sql/visualization_data_omit_by_rule.sql)
+31. [visualization_data.sql](sql/visualization_data.sql)
+32. [patent_visualization_data.sql](sql/patent_visualization_data.sql)
+33. [paper_visualization_data.sql](sql/paper_visualization_data.sql)
+34. [workforce_visualization_data.sql](sql/workforce_visualization_data.sql)
+
+# Deployment
+
+To refresh the docker container (which you must do if you change any of the python scripts in parat_scripts/), run
+
+```
+docker build -t parat .
+docker tag parat us.gcr.io/gcp-cset-projects/parat
+docker push us.gcr.io/gcp-cset-projects/parat
+```
diff --git a/company_linkage/data/omit.csv b/company_linkage/data/omit.csv