Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use GitHub GraphQL for Metadata fetching (with new metadata fields) #138

Draft
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

Rabenherz112
Copy link

@Rabenherz112 Rabenherz112 commented May 4, 2024

I am currently trying to implement awesome-selfhosted/awesome-selfhosted-data#84, as well as some other nice-to-have metadata (which I personally would like to see, such as the current release and release date), which means fetching the metadata over the GrapSQL API instead of the Python package used.

I don't to anything normally with python so I happy enough that the current code seems to work? There will probably still be a lot to re-write and change. Sorry in advance for the quality of the code.

I didn't get the Batch to be 100 as GitHub already returns an error with 75 or more (I have no clue why).

New metadata preview:

name: Paperless-ngx
website_url: https://docs.paperless-ngx.com/
description: Scan, index, and archive all of your paper documents with an improved interface (fork of Paperless).
licenses:
  - GPL-3.0
platforms:
  - Python
  - Docker
tags:
  - Document Management
source_code_url: https://github.com/paperless-ngx/paperless-ngx
demo_url: https://demo.paperless-ngx.com/
stargazers_count: 17058
updated_at: '2024-05-07'
archived: false
current_release:
  tag: v2.8.1
  published_at: '2024-05-07'
commit_history:
  2024-05: 46

Edit: There are probably still things that need to be done, like tests. I also created a bug that I can't seem to fix? Currently, the metadata is assigned to the wrong file, but I don't understand why.

- Created new function add_gh_metadata
    - Use github graphql api to get all github metadata
    - Get all metadata already via old function
    - Get latest release with tag and date
    - Get commit history with commit count (only the for the current month)
- Created new function gh_metadata_cleanup
    - Clean up old commit history wich is older then 12 months

This code is not tested yet, tbd.
- Removed old `get_gh_metadata` function and renamed new function to the same name
- Set GitHub graphql API batch amount to 60 to avoid API errors
- Fixed issue that `isArchived` field did not exist in the response
- Added simple error handling for the case that the github metadata could not be fetched
- Fixed duplicated values for `stargazers_count` and `updated_at`
- Fixed date syntax for `current_release/published_at` and `commit_history`
- Re-implement sleep time for GitHub API to avaoid rate limit
- Fix gh_metadata_cleanup task
@Rabenherz112 Rabenherz112 marked this pull request as draft May 4, 2024 15:09
@Rabenherz112 Rabenherz112 changed the title Use GitHub GraphQL for Metadat fetching (with new metadata fields) Use GitHub GraphQL for Metadata fetching (with new metadata fields) May 4, 2024
@nodiscc nodiscc added enhancement New feature or request help wanted Extra attention is needed labels May 4, 2024
@nodiscc nodiscc added this to the 1.3.0 milestone May 4, 2024
@nodiscc
Copy link
Owner

nodiscc commented May 5, 2024

Thank you.
I will review this and #133 when I get some time, it might take a while, I will do it eventually but don't know when.

@nodiscc nodiscc removed the help wanted Extra attention is needed label May 5, 2024
@Rabenherz112
Copy link
Author

Rabenherz112 commented May 5, 2024

Pending Issues have been fixed, and a test has been done by running a full metadata processing on the awesome-selfhosted-data repository.

Some things in the code still have open Questions; see comments marked with TODO:.

These changes have been tested by running a full metadata processing on the awesome-selfhosted-data repository and checking the metadata files for the correct affiliation.

Bug Fixes:
- Metadata is now not being assigned via a index but instead by matching the `url` field in the return of GraphQL query to the `source_code_url`

Logging:
- Added more information about the status of the metadata processing (as this can now take a while to process)
- Added more debug information for Ratelimit Information from GitHub API

Defaults:
- Added a default wait-time between API requests to GitHub to avoid hitting the rate limit (default is now 60 seconds, can be configured in the `hecat.yml` file)
- Added a default batch-size for the metadata processing (default is now 30, can be configured in the `hecat.yml` file)

Others:
- Added new function `extract_repo_name` to extract the repo name from the `source_code_url`
- Added try-catch block to catch exceptions when the writing metadata to a file
- Updated documentation to reflect new batch_size configuration option and new API restrictions from GitHub

Co-authored-by: Le Duc Lischetzke <[email protected]>
@Rabenherz112
Copy link
Author

Just a quick ping to remind you about this. 😊
No rush at all, though!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants