-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mc agency homepage searcher #74
base: main
Are you sure you want to change the base?
Conversation
…raise runtime error
- Add psycopg2-binary - Add huggingface-hub
- Add condition for when there is nothing to return, such as in an INSERT statement
- correct column name in SQL_UPDATE_CACHE - Update SQL_GET_AGENCIES_WITHOUT_HOMEPAGE_URLS to not return entries which already exist in the cache.
- Bug was causing two newlines to appear in windows.
- This should help prevent an entire set of searches from being lost if an error occurs in one
…archer # Conflicts: # requirements.txt
The CSV temporary file in the 'write_to_temporary_csv' function now has utf-8 encoding for better compatibility. Exception handling is incorporated to prevent crashes while writing rows to CSV. Print statements were added to log the number of search results obtained.
The STATE_ISO_TO_NAME_DICT dictionary has been removed and replaced by a newly implemented class, USStateReference. This class fetches state names from the database using state ISO codes. It is important to note that the get_state_name method from this class is now employed in the create_agency_info method to retrieve state names.
The pytest-postgresql dependency is removed from the requirements.txt file. This change is a part of ongoing refactoring efforts to simplify the project's dependencies and reduce potential conflict or compatibility issues.
@maxachis can you dm me the google key and id? |
Instructions for obtaining `CUSTOM_SEARCH_API_KEY` and `CUSTOM_SEARCH_ENGINE_ID` have been elaborated. They now include specific directions on how to acquire an API key from the Google Custom Search Engine Overview and access the CSE ID from the Programmable Search Engine control panel.
I can do that, but my recommendation is to see if, using the README, you're able to obtain your own. I've added additional instructions under the Environment Setup section, so it should only take a few clicks to get your own. |
The version of huggingface-hub was updated from 0.20.3 to 0.22.2 in the requirements file of agency_homepage_searcher. This change ensures that the application uses the most recent and secure version of huggingface-hub.
The assignment of iso and name variables in the row mapping for state_names has been changed from accessing by key to accessing by index.
This update revises mock object creation and method calls in the 'test_homepage_searcher' and 'test_create_agency_info_with_valid_agency_row' methods in the 'TestHomepageSearcher' class. The changes refine the process of mocking the USStateReference class and extracting the state name in test scenarios.
I've cleaned up requirements.txt to resolve some issues, and then ran the main script successfully on a Docker container using only the requirements needed for running this (which are specified in |
@josh-chamberlain do we have a PDAP google project we can create a key for pipeline on? |
@mbodeantor just added two secrets to this repo:
(edit: added the wrong ones at first but fixed it) to answer your question, yes—it's called "data sources search" and seems to work with those keys |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you refactor the DBManger class in util to take a single argument DO_DATABASE_URL? Both Github and DigitalOcean have the db params stored that way in their environments
@josh-chamberlain we will also need a pipeline HUGGINGFACE_ACCESS_TOKEN in the Github env |
Unused imports 'csv' and 'factories' from pytest_postgresql were removed from test_agency_homepage_searcher_integration.py to reduce unnecessary overhead and increase code readability. This is part of an ongoing effort to ensure clean and efficient code base.
The parameters provided to initialize DBManager (user, password, host, port, and db_name) have been replaced with a single parameter - the database URL. This simplifies DBManager usage in the main module 'agency_homepage_searcher', enhancing readability and maintaining the security of database details.
@mbodeantor Made updates per your requests. Give it a shot now and see how it looks. |
@maxachis thanks, looks like just the README running tests and script section missing |
Updated the README.md file in the agency_homepage_searcher directory. Replaced old database log-in details with unified database URL and included instructions for running script and tests. This simplifies database configuration and ensures users have proper guidance to run the application and its related tests.
@mbodeantor Done. Ready for further review. |
@maxachis Looks like there are still some errors cause the checks to fail |
@josh-chamberlain Just want to confirm we have a HUGGINGFACE_ACCESS_TOKEN in the Github env |
@mbodeantor This is related to bugs in my implementation of the type_and_docstring checks. I'm working on revising that in #77. Unfortunately, the type_and_docstring_check is not working right now. |
@maxachis I hadn't added that HUGGINGFACE_ACCESS_TOKEN token in GitHub, by the way. It's there now. |
…endpoint rather than direct calls to database.
Fixes
#53 - "Alternative way of getting more urls: Automated Search Engine Calls"
Description
This performs several related actions:
PDAP/possible_homepage_urls
datasetTesting
test_agency_homepage_searcher_integration.py
andtest_agency_homepage_searcher_unit.py
usingpytest
main.py' to confirm functionality. Check https://huggingface.co/datasets/PDAP/possible_homepage_urls/tree/main/data to confirm search results added. Additionally, can check
agency_url_search_cache` in database to confirm expected agencies are updated.agency_homepage_searcher.yaml
can be verified via a specialized docker container, but if that's too much effort we can do a "dude just trust me" on this one, as I've verified it myself using said specialized docker container.