-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use GitHub GraphQL for Metadata fetching (with new metadata fields) #138
base: master
Are you sure you want to change the base?
Conversation
- Created new function add_gh_metadata - Use github graphql api to get all github metadata - Get all metadata already via old function - Get latest release with tag and date - Get commit history with commit count (only the for the current month) - Created new function gh_metadata_cleanup - Clean up old commit history wich is older then 12 months This code is not tested yet, tbd.
- Removed old `get_gh_metadata` function and renamed new function to the same name - Set GitHub graphql API batch amount to 60 to avoid API errors - Fixed issue that `isArchived` field did not exist in the response - Added simple error handling for the case that the github metadata could not be fetched - Fixed duplicated values for `stargazers_count` and `updated_at` - Fixed date syntax for `current_release/published_at` and `commit_history`
- Re-implement sleep time for GitHub API to avaoid rate limit - Fix gh_metadata_cleanup task
Thank you. |
Pending Issues have been fixed, and a test has been done by running a full metadata processing on the awesome-selfhosted-data repository. Some things in the code still have open Questions; see comments marked with |
These changes have been tested by running a full metadata processing on the awesome-selfhosted-data repository and checking the metadata files for the correct affiliation. Bug Fixes: - Metadata is now not being assigned via a index but instead by matching the `url` field in the return of GraphQL query to the `source_code_url` Logging: - Added more information about the status of the metadata processing (as this can now take a while to process) - Added more debug information for Ratelimit Information from GitHub API Defaults: - Added a default wait-time between API requests to GitHub to avoid hitting the rate limit (default is now 60 seconds, can be configured in the `hecat.yml` file) - Added a default batch-size for the metadata processing (default is now 30, can be configured in the `hecat.yml` file) Others: - Added new function `extract_repo_name` to extract the repo name from the `source_code_url` - Added try-catch block to catch exceptions when the writing metadata to a file - Updated documentation to reflect new batch_size configuration option and new API restrictions from GitHub Co-authored-by: Le Duc Lischetzke <[email protected]>
Just a quick ping to remind you about this. 😊 |
I am currently trying to implement awesome-selfhosted/awesome-selfhosted-data#84, as well as some other nice-to-have metadata (which I personally would like to see, such as the current release and release date), which means fetching the metadata over the GrapSQL API instead of the Python package used.
I don't to anything normally with python so I happy enough that the current code seems to work? There will probably still be a lot to re-write and change. Sorry in advance for the quality of the code.
I didn't get the Batch to be 100 as GitHub already returns an error with 75 or more (I have no clue why).
New metadata preview:
Edit: There are probably still things that need to be done, like tests. I also created a bug that I can't seem to fix? Currently, the metadata is assigned to the wrong file, but I don't understand why.