Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mc agency homepage searcher #74

Draft
wants to merge 74 commits into
base: main
Choose a base branch
from
Draft

Conversation

maxachis
Copy link
Collaborator

@maxachis maxachis commented Apr 10, 2024

Fixes

#53 - "Alternative way of getting more urls: Automated Search Engine Calls"

Description

This performs several related actions:

  • Obtains all agencies from the PUBLIC.AGENCIES table in the PDAP Digital Ocean DB which both do not have a homepage_url and which have not already been searched for (as determined by whether their unique identifier is present in the PUBLIC.AGENCY_URL_SEARCH_CACHE table
  • Generates a search query for each agency based on information present in the database, and performs an automated Google Search API Call. At the free tier, up to 100 such queries can be made each day.
  • Saves 10 of the results for each agency to a csv.
  • Once all searches have been completed (either 100 or whenever the search quota is reached), these entries are put in a CSV, along with identifying information, and uploaded to the huggingface PDAP/possible_homepage_urls dataset
  • Following this, the agencies which have been searched for are added to PUBLIC_AGENCY_URL_SEARCH_CACHE
  • Also creates a yaml file which runs this functionality once per day.

Testing

  • Run test_agency_homepage_searcher_integration.py and test_agency_homepage_searcher_unit.py using pytest
  • Run main.py' to confirm functionality. Check https://huggingface.co/datasets/PDAP/possible_homepage_urls/tree/main/data to confirm search results added. Additionally, can check agency_url_search_cache` in database to confirm expected agencies are updated.
  • The yaml file agency_homepage_searcher.yaml can be verified via a specialized docker container, but if that's too much effort we can do a "dude just trust me" on this one, as I've verified it myself using said specialized docker container.
  • Note the environment variables referenced in the yaml file -- if these don't exist as Github secrets, this yaml file will not work in the repository.

- Add psycopg2-binary
- Add huggingface-hub
- Add condition for when there is nothing to return, such as in an INSERT statement
- correct column name in SQL_UPDATE_CACHE
- Update SQL_GET_AGENCIES_WITHOUT_HOMEPAGE_URLS to not return entries which already exist in the cache.
- Bug was causing two newlines to appear in windows.
- This should help prevent an entire set of searches from being lost if an error occurs in one
The CSV temporary file in the 'write_to_temporary_csv' function now has utf-8 encoding for better compatibility. Exception handling is incorporated to prevent crashes while writing rows to CSV. Print statements were added to log the number of search results obtained.
The STATE_ISO_TO_NAME_DICT dictionary has been removed and replaced by a newly implemented class, USStateReference. This class fetches state names from the database using state ISO codes. It is important to note that the get_state_name method from this class is now employed in the create_agency_info method to retrieve state names.
The pytest-postgresql dependency is removed from the requirements.txt file. This change is a part of ongoing refactoring efforts to simplify the project's dependencies and reduce potential conflict or compatibility issues.
@mbodeantor
Copy link
Contributor

@maxachis can you dm me the google key and id?

Instructions for obtaining `CUSTOM_SEARCH_API_KEY` and `CUSTOM_SEARCH_ENGINE_ID` have been elaborated. They now include specific directions on how to acquire an API key from the Google Custom Search Engine Overview and access the CSE ID from the Programmable Search Engine control panel.
@maxachis
Copy link
Collaborator Author

@maxachis can you dm me the google key and id?

I can do that, but my recommendation is to see if, using the README, you're able to obtain your own. I've added additional instructions under the Environment Setup section, so it should only take a few clicks to get your own.

The version of huggingface-hub was updated from 0.20.3 to 0.22.2 in the requirements file of agency_homepage_searcher. This change ensures that the application uses the most recent and secure version of huggingface-hub.
The assignment of iso and name variables in the row mapping for state_names has been changed from accessing by key to accessing by index.
This update revises mock object creation and method calls in the 'test_homepage_searcher' and 'test_create_agency_info_with_valid_agency_row' methods in the 'TestHomepageSearcher' class. The changes refine the process of mocking the USStateReference class and extracting the state name in test scenarios.
@maxachis
Copy link
Collaborator Author

Looks like requirements.txt needs some updates

I've cleaned up requirements.txt to resolve some issues, and then ran the main script successfully on a Docker container using only the requirements needed for running this (which are specified in agency_homepage_searcher/requirements_agency_homepage_searcher_action.txt. I tried to install everything in requirements.txt at once within the docker container and it took a long time and then bricked the container. Another reason why my initial thinking about consolidating the requirements files was wrong in the case of this repository.

@mbodeantor
Copy link
Contributor

@josh-chamberlain do we have a PDAP google project we can create a key for pipeline on?

@josh-chamberlain
Copy link
Contributor

josh-chamberlain commented Apr 15, 2024

@mbodeantor just added two secrets to this repo:

CUSTOM_SEARCH_ENGINE_ID and CUSTOM_SEARCH_API_KEY

(edit: added the wrong ones at first but fixed it)

to answer your question, yes—it's called "data sources search" and seems to work with those keys

Copy link
Contributor

@mbodeantor mbodeantor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you refactor the DBManger class in util to take a single argument DO_DATABASE_URL? Both Github and DigitalOcean have the db params stored that way in their environments

Tests/test_agency_homepage_searcher_integration.py Outdated Show resolved Hide resolved
@mbodeantor
Copy link
Contributor

@josh-chamberlain we will also need a pipeline HUGGINGFACE_ACCESS_TOKEN in the Github env

Unused imports 'csv' and 'factories' from pytest_postgresql were removed from test_agency_homepage_searcher_integration.py to reduce unnecessary overhead and increase code readability. This is part of an ongoing effort to ensure clean and efficient code base.
The parameters provided to initialize DBManager (user, password, host, port, and db_name) have been replaced with a single parameter - the database URL. This simplifies DBManager usage in the main module 'agency_homepage_searcher', enhancing readability and maintaining the security of database details.
@maxachis
Copy link
Collaborator Author

@mbodeantor Made updates per your requests. Give it a shot now and see how it looks.

@mbodeantor
Copy link
Contributor

@maxachis thanks, looks like just the README running tests and script section missing

Updated the README.md file in the agency_homepage_searcher directory. Replaced old database log-in details with unified database URL and included instructions for running script and tests. This simplifies database configuration and ensures users have proper guidance to run the application and its related tests.
@maxachis
Copy link
Collaborator Author

@mbodeantor Done. Ready for further review.

@mbodeantor
Copy link
Contributor

@maxachis Looks like there are still some errors cause the checks to fail

@mbodeantor
Copy link
Contributor

@josh-chamberlain Just want to confirm we have a HUGGINGFACE_ACCESS_TOKEN in the Github env

@maxachis
Copy link
Collaborator Author

@maxachis Looks like there are still some errors cause the checks to fail

@mbodeantor This is related to bugs in my implementation of the type_and_docstring checks. I'm working on revising that in #77. Unfortunately, the type_and_docstring_check is not working right now.

@josh-chamberlain
Copy link
Contributor

@maxachis I hadn't added that HUGGINGFACE_ACCESS_TOKEN token in GitHub, by the way. It's there now.

…endpoint rather than direct calls to database.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Work in progress
Development

Successfully merging this pull request may close these issues.

3 participants