Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How do you obtain the latest data from table 'bigquery-public-data.github_repos.languages' #117

Open
junluo-aspecta opened this issue Mar 21, 2024 · 7 comments
Assignees
Labels

Comments

@junluo-aspecta
Copy link

junluo-aspecta commented Mar 21, 2024

The table 'bigquery-public-data.github_repos.languages' was last updated in 2022 and contains just over 3 million rows in total. How do you obtain the latest data and ensure that the data statistics reflect the entire GitHub ecosystem?
20240321-110718

@madnight
Copy link
Owner

Hi @junluo-aspecta,

Yes, the bigquery-public-data.github_repos.languages include "only" 3 million repositories, but this sample size is statistically safe. For instance, in an election poll, they usually ask just a few thousand and scale that up to the whole population, e.g., the US population of 330 million. If you ask a few thousand, you have a quite big rate of error (a few percent), but if you ask 3 million, the error rate is extremely low.

Regarding the last updated in 2022, precisely Nov 27, 2022, well, that is an actual issue that I was not aware of yet. This means that I'm currently counting the Events correctly, but it does not include repositories after Nov 27, 2022. This alters the statistics significantly in the long run because we are only matching all the Events against a sample of 3M repos that were created before Nov 27, 2022. Hence, I have to find a new way to obtain a big enough sample size of repository language metadata that is up-to-date. Thanks for discovering and reporting this.

@madnight madnight added the bug label Mar 23, 2024
@madnight madnight self-assigned this Mar 23, 2024
@madnight
Copy link
Owner

Hi @junluo-aspecta,

Okay, I did some research, thought for a while, and came up with a new idea. We can extract language information directly from the GH Archive Events because they are stored in the PullRequest Events. This amounts to a large sample size (millions of repositories) and they are up-to-date since we can count the language from the PullRequest events of the current quarter. The issue is that with this approach, we ignore any repository that has not seen any PullRequest over the last quarter (also not from any kind of bot such as Dependabot). I think it is a fair trade-off for now until we can maybe come up with a better idea.

f8adb52

@vincentdephily
Copy link

I think it's fair to only count repos that have seen some activity in the given timeframe. But not counting pushes/issues/stars unless they got a pull request seems like a significant bias. Many small projects and even some bigger ones don't use pull requests.

Apparently languish switched to using github's GraphQL API to get the repo language, would that suit githut's needs ?

@madnight
Copy link
Owner

I agree with you regarding the bias.

Regarding the GitHub GraphQL API, they have pretty tight rate limits. These limits are designed for normal users, not for those who want to fetch information from millions of repositories. I'm not sure how Languish handles it, but they might be using a much smaller sample size. However, reducing the sample size can create statistical challenges, especially for the lower ranks.

@vincentdephily
Copy link

Languish fetches 500 repos per GQL query, that's 6000 queries for 3 million repos. You could spread that over 24h and/or use a lot of caching (repo language changes should be rare).

I couldn't figure out what languish does on that front, maybe @tjpalmer can enlighten us.

@tjpalmer
Copy link

I run once per quarter, and the graphql churns for a long time. Can take hours. But they haven't blocked me off yet. And yeah, I do cache results offline. And requests frequently error, so I run repeatedly. And some things never seem to show up. But I still feel like I see recent data better anyway. I may have been caching for too long now, though, since some may have changed their primary language since I started caching. Anyway, I just hobble along as well as I can. Maybe we should ask them for better access sometime. Seriously if they just dumped repos regularly still to bigquery, that would be awesome.

@tjpalmer
Copy link

Oh. I also only look at repos that have at least 10 events in the quarter or some such. My memory is that it substantially reduces the number of repos I query on. Still lots and lots, though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants