Skip to content

Commit

Permalink
Merge pull request #184 from georgetown-cset/data-airflow
Browse files Browse the repository at this point in the history
Airflow pipeline creation updates
  • Loading branch information
jmelot authored Jan 18, 2024
2 parents b6e75da + a6c2d22 commit 15c4062
Show file tree
Hide file tree
Showing 60 changed files with 2,446 additions and 484 deletions.
22 changes: 22 additions & 0 deletions company_linkage/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
FROM ubuntu:20.04

# Set up system dependencies
RUN apt -y update
RUN apt-get -y update
RUN apt-get install -y build-essential libssl-dev libffi-dev python3-dev python3-pip curl

# Grab files we need to run
ADD requirements.txt /parat/requirements.txt
ADD parat_scripts/* /parat/

# install gsutil and put it on the path for airflow to use
ENV CLOUDSDK_INSTALL_DIR /usr/local/gcloud/
RUN curl -sSL https://sdk.cloud.google.com | bash
ENV PATH $PATH:/usr/local/gcloud/google-cloud-sdk/bin

# Install python dependencies
WORKDIR /parat
ENV AIRFLOW_GPL_UNIDECODE=yes
RUN pip3 install -r requirements.txt
# Make sure the above config succeeded
RUN python3 -m pytest test_aggregate_organizations.py -k test_add_location
78 changes: 44 additions & 34 deletions company_linkage/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,37 +16,47 @@ run some of this code as-is.

## Tasks to build visualization data

1. [creating_organizations_from_airtable_imports.sql](sql/create_organizations_from_airtable_imports.sql)
2. [selecting_ai_publications.sql](sql/selecting_ai_publications.sql)
3. `python3 aggregate_organizations.py aggregated_organizations.jsonl`
4. Replace `high_resolution_entities.aggregated_organizations` with the data from `aggregated_organizations.jsonl` using the [aggregated_organizations_schema](schemas/aggregated_organizations_schema.json)
5. [selecting_ai_patents.sql](sql/selecting_ai_patents.sql)
6. `python3 get_ai_counts.py data/ai_company_papers.jsonl data/ai_company_patents.jsonl`
7. Upload `ai_company_papers.jsonl` to `ai_companies_visualization.ai_company_pubs` using the [ai_papers_schema](schemas/ai_papers_schema.json)
8. Upload `ai_company_patents.jsonl` to `ai_companies_visualization.ai_company_patents` using the [ai_patents_schema](schemas/ai_patents_schema.json)
9. [creating_initial_visualization_data_publications.sql](sql/creating_initial_visualization_data_publications.sql)
10. [adding_ai_pubs_by_year_to_visualization.sql](sql/adding_ai_pubs_by_year_to_visualization.sql)
11. [creating_patent_visualization_data.sql](sql/creating_patent_visualization_data.sql)
12. [adding_ai_patents_by_year_to_visualization.sql](sql/adding_ai_patents_by_year_to_visualization.sql)
13. [creating_paper_visualization_data.sql](sql/creating_paper_visualization_data.sql)
14. [adding_top_mag_ai_fields.sql](sql/adding_top_mag_ai_fields.sql)
15. [adding_top_science_map_clusters.sql](sql/adding_top_science_map_clusters.sql)
16. [adding_company_references.sql](sql/adding_company_references.sql)
17. [adding_top_tasks.sql](sql/adding_top_tasks.sql)
18. [adding_top_methods.sql](sql/adding_top_methods.sql)
19. [selecting_top_conference_pubs.sql](sql/selecting_top_conference_pubs.sql)
20. [pulling_publications_in_top_ai_conferences.sql](sql/pulling_publications_in_top_ai_conferences.sql)
21. `python3 top_papers.py top_paper_counts.jsonl`
22. Upload `top_paper_counts.jsonl` to `ai_companies_visualization.top_paper_counts` using the [top_papers_schema](schemas/top_papers_schema.json)
23. [adding_top_paper_counts.sql](sql/adding_top_paper_counts.sql)
24. [selecting_all_publications.sql](sql/selecting_all_publications.sql)
25. `python3 all_papers.py all_paper_counts.jsonl`
26. Upload `all_paper_counts.jsonl` to `ai_companies_visualization.total_paper_counts` using the [all_papers_schema](schemas/all_papers_schema.json)
27. [adding_all_paper_counts.sql](sql/adding_all_paper_counts.sql)
28. [creating_workforce_visualization_data.sql](sql/creating_workforce_visualization_data.sql)
29. [adding_ai_jobs_to_workforce_visualization.sql](sql/adding_ai_jobs_to_workforce_visualization.sql)
31. [omit_by_rule.sql](sql/omit_by_rule.sql)
32. [omit_by_rule_papers.sql](sql/omit_by_rule_papers.sql)
33. [omit_by_rule_patents.sql](sql/omit_by_rule_patents.sql)
34. [omit_by_rule_workforce.sql](sql/omit_by_rule_workforce.sql)
35. [adding_crunchbase_company_metadata.sql](sql/adding_crunchbase_company_metadata.sql)
1. [organizations.sql](sql/organizations.sql)
2. [ai_publications.sql](sql/ai_publications.sql)
3. [linked_ai_patents.sql](sql/linked_ai_patents.sql)
4. [top_conference_pubs.sql](sql/top_conference_pubs.sql)
5. [pubs_in_top_conferences.sql](sql/pubs_in_top_conferences.sql)
6. [all_publications.sql](sql/all_publications.sql)
7. `python3 aggregate_organizations.py aggregated_organizations.jsonl`
8. Replace `high_resolution_entities.aggregated_organizations` with the data from `aggregated_organizations.jsonl` using the [aggregated_organizations_schema](schemas/aggregated_organizations_schema.json)
9. `python3 get_ai_counts.py data/ai_company_papers.jsonl data/ai_company_patents.jsonl`
10. Upload `ai_company_papers.jsonl` to `ai_companies_visualization.ai_company_pubs` using the [ai_papers_schema](schemas/ai_papers_schema.json)
11. Upload `ai_company_patents.jsonl` to `ai_companies_visualization.ai_company_patents` using the [ai_patents_schema](schemas/ai_patents_schema.json)
12. `python3 top_papers.py top_paper_counts.jsonl`
13. Upload `top_paper_counts.jsonl` to `ai_companies_visualization.top_paper_counts` using the [top_papers_schema](schemas/top_papers_schema.json)
14. `python3 all_papers.py all_paper_counts.jsonl`
15. Upload `all_paper_counts.jsonl` to `ai_companies_visualization.total_paper_counts` using the [all_papers_schema](schemas/all_papers_schema.json)
16. [initial_visualization_data.sql](sql/initial_visualization_data.sql)
17. [visualization_data_with_by_year.sql](sql/visualization_data_with_by_year.sql)
18. [visualization_data_with_top_papers.sql](sql/visualization_data_with_top_papers.sql)
19. [visualization_data_with_all_papers.sql](sql/visualization_data_with_all_papers.sql)
20. [initial_patent_visualization_data.sql](sql/initial_patent_visualization_data.sql)
21. [patent_visualization_data_with_by_year.sql](sql/patent_visualization_data_with_by_year.sql)
22. [initial_paper_visualization_data.sql](sql/initial_paper_visualization_data.sql)
23. [paper_visualization_data_with_mag.sql](sql/paper_visualization_data_with_mag.sql)
24. [paper_visualization_data_with_clusters.sql](sql/paper_visualization_data_with_clusters.sql)
25. [paper_visualization_data_with_company_references.sql](sql/paper_visualization_data_with_company_references.sql)
26. [paper_visualization_data_with_tasks.sql](sql/paper_visualization_data_with_tasks.sql)
27. [paper_visualization_data_with_methods.sql](sql/paper_visualization_data_with_methods.sql)
28. [initial_workforce_visualization_data.sql](sql/initial_workforce_visualization_data.sql)
29. [workforce_visualization_data_with_ai_jobs.sql](sql/workforce_visualization_data_with_ai_jobs.sql)
30. [visualization_data_omit_by_rule.sql](sql/visualization_data_omit_by_rule.sql)
31. [visualization_data.sql](sql/visualization_data.sql)
32. [patent_visualization_data.sql](sql/patent_visualization_data.sql)
33. [paper_visualization_data.sql](sql/paper_visualization_data.sql)
34. [workforce_visualization_data.sql](sql/workforce_visualization_data.sql)

# Deployment

To refresh the docker container (which you must do if you change any of the python scripts in parat_scripts/), run

```
docker build -t parat .
docker tag parat us.gcr.io/gcp-cset-projects/parat
docker push us.gcr.io/gcp-cset-projects/parat
```
40 changes: 0 additions & 40 deletions company_linkage/data/omit.csv

This file was deleted.

Loading

0 comments on commit 15c4062

Please sign in to comment.