Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add extracting all patent counts to pipeline; modify other code for extensibility to enable this #198

Conversation

rggelles
Copy link
Member

@rggelles rggelles commented Jan 26, 2024

Note: this affects/modifies the grants PR

Closes #125

@rggelles rggelles requested a review from jmelot January 26, 2024 16:58
Copy link

github-actions bot commented Jan 26, 2024

No need for rebasing 👍
behind_count is 0
ahead_count is 14

Copy link

github-actions bot commented Jan 26, 2024

JavaScript Coverage

Summary

Lines Statements Branches Functions
Coverage: 67%
67.87% (374/551) 59.58% (174/292) 67.55% (127/188)
Modified Files • (67%)
File% Stmts% Branch% Funcs% LinesUncovered Line #s
All files67.8759.5867.5567.5 
components64.462.1162.0163.84 
   DetailViewPublications.jsx000015–143
static_data70.964476.6667.85 
   data.js100100100100 

Base automatically changed from 146-add-patents-granted-metric to version2 January 29, 2024 18:25
@jmelot jmelot force-pushed the 125-add-patentsall_patents-to-data-to-aggregate-patent-data-across-subject-area branch from 11d1e24 to 61d109c Compare February 5, 2024 19:03
Copy link
Member

@jmelot jmelot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! This looks good to me - I did have one substantive question about the use of a test parameter where I didn't expect it, but otherwise these are largely fussy comments.

company_linkage/parat_data_dag.py Show resolved Hide resolved
company_linkage/parat_scripts/all_patents.py Show resolved Hide resolved
company_linkage/parat_scripts/all_patents.py Show resolved Hide resolved
company_linkage/sql/linked_all_patents.sql Show resolved Hide resolved
@jmelot
Copy link
Member

jmelot commented Feb 6, 2024

@rggelles I've started integrating this stuff, but I actually don't see all_patents in patent_visualization_data - has the pipeline been run since you made these changes, or does that column still need to get integrated into the final output table (or am I just blind - if so, sorry!)

@rggelles
Copy link
Member Author

rggelles commented Feb 6, 2024

@rggelles I've started integrating this stuff, but I actually don't see all_patents in patent_visualization_data - has the pipeline been run since you made these changes, or does that column still need to get integrated into the final output table (or am I just blind - if so, sorry!)

This is because I forgot a change I needed to make in the omit_by_rule query; added this in as well for the update I'm about to push. My bad.

@rggelles rggelles requested a review from jmelot February 7, 2024 14:55
@rggelles
Copy link
Member Author

rggelles commented Feb 7, 2024

@jmelot I believe all fixes should be in now, including at least two substantive ones. Pipeline is rerunning, although even the substantive reviews shouldn't affect the code significantly. Would appreciate a second review of the changes made when it finishes. Thanks!

@rggelles
Copy link
Member Author

rggelles commented Feb 7, 2024

(it does appear to be failing some of your tests, but I think those are related to the user interface so it's not clear to me what's going on here or what I need to fix).

Copy link
Member

@jmelot jmelot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading over this, looks good, thanks! I'll have another try at integration soon (and you're right, the broken tests aren't related to anything you did - I got part of the way through integration and then stopped which left things in a broken state)

@jmelot
Copy link
Member

jmelot commented Feb 8, 2024

Just checked this again - I think something is messed up: SELECT count(0) FROM ai_companies_visualization.patent_visualization_data where all_patents is not null = 721 but SELECT count(0) FROM ai_companies_visualization.patent_visualization_data where ai_patents is not null = 2478

@rggelles
Copy link
Member Author

rggelles commented Feb 9, 2024

@jmelot Believe this should be fixed; I failed to recompile the docker container for the last run but it's done now.

@rggelles
Copy link
Member Author

Just pinging on this again -- is there anything else that needs to be done to close this? This field should be available and usable now.

@brianlove brianlove changed the base branch from version2 to master February 22, 2024 19:08
@brianlove
Copy link
Contributor

With the merging of version2 into master (#163), I've re-targeted this PR at master.

@jmelot
Copy link
Member

jmelot commented Feb 29, 2024

@rggelles I was (finally) trying to integrate this, but I noticed the data got much larger. It turns out that the number of rows in patent_visualization_data has drastically increased (compare number of rows in the last few snapshots in ai_companies_visualization_backups for example). I think this may be due to duplicate rows creeping in - for example, SELECT count(0) FROM ai_companies_visualization.patent_visualization_data where cset_id = 796 now returns 36 rows.

@rggelles
Copy link
Member Author

rggelles commented Mar 4, 2024

@rggelles I was (finally) trying to integrate this, but I noticed the data got much larger. It turns out that the number of rows in patent_visualization_data has drastically increased (compare number of rows in the last few snapshots in ai_companies_visualization_backups for example). I think this may be due to duplicate rows creeping in - for example, SELECT count(0) FROM ai_companies_visualization.patent_visualization_data where cset_id = 796 now returns 36 rows.

Just pushed a fix for this; this was an issue in my aggregation in the SQL query rather than any problem in the actual data pulls. The pipeline is rerunning from that point (which is well after the expensive point so it shouldn't be an issue to run).

@jmelot jmelot force-pushed the 125-add-patentsall_patents-to-data-to-aggregate-patent-data-across-subject-area branch from a670079 to f3316b5 Compare March 5, 2024 20:10
@jmelot
Copy link
Member

jmelot commented Mar 5, 2024

Ok I think I've successfully integrated this. @brianlove can you please check my commits (7dff783, f6d1873, f3316b5, a2ca760) and then if all is good we can merge!

@jmelot jmelot requested a review from brianlove March 6, 2024 20:38
…unts

Add citation counts for other classifiers
Copy link
Contributor

@brianlove brianlove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@brianlove brianlove merged commit 4113a20 into master Mar 8, 2024
4 checks passed
@brianlove brianlove deleted the 125-add-patentsall_patents-to-data-to-aggregate-patent-data-across-subject-area branch March 8, 2024 17:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add patents.all_patents to data to aggregate patent data across subject area
3 participants