You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've tried using the following packages to interface with Google Scholar:
scholarly
gscholar
mechanize (via a simulated browser)
I can get each of these to return valid information for a small number of queries. However, when I submit many queries (I'm not sure of the precise number-- 20? 50? 100?) I start seeing 429 HTTP errors ("too many requests"). It seems that the Google Scholar backend limits the number of queries per day (or possibly the rate?) that can come from a single browser/ip address/user (I'm not sure how it's parameterized).
This seems to make it impossible (or at least "non-trivial") to use Google Scholar to verify and/or look up bibliographic information.
I've also tried using the semanticscholar package to interface with Semantic Scholar. Unfortunately, the semantic scholar API requires knowing the DOI, author ID, or semantic scholar code-- which I don't have for most papers. The Google Scholar API does support DOI lookups, but it's not useful (if I could reliably access Google Scholar we wouldn't need Semantic Scholar!). I also tried submitting requests to crossref (using the mechanize package to simulate browser requests, and then regular expressions to parse out DOIs), but the results were highly unreliable (only a very small proportion of queries seemed to return useful information).
So: I'm stumped. Until I can figure out a way forward (e.g. a way around Google Scholar's limits, a way to look up information via Semantic Scholar, and/or another reliable source for bibliographic information) I'm going to remove bibliographic lookups from the bibtex checker code. My (broken) attempts can be found in the notebook (dev folder) of this commit.
The main issue I was trying to solve was that some of the page numbers are either self-inconsistent or invalid (e.g. the given page range doesn't make sense, like starting from a high number and going to a low number, or containing mixes of alpha and numeric characters that seem suspect). I'm going to implement some heuristics for cleaning up those sorts of issues (to the extent that I can reliably detect them), and I'll ignore for now the likelihood that some bibliographic information may be entered incorrectly.
The text was updated successfully, but these errors were encountered:
I've tried using the following packages to interface with Google Scholar:
I can get each of these to return valid information for a small number of queries. However, when I submit many queries (I'm not sure of the precise number-- 20? 50? 100?) I start seeing 429 HTTP errors ("too many requests"). It seems that the Google Scholar backend limits the number of queries per day (or possibly the rate?) that can come from a single browser/ip address/user (I'm not sure how it's parameterized).
This seems to make it impossible (or at least "non-trivial") to use Google Scholar to verify and/or look up bibliographic information.
I've also tried using the semanticscholar package to interface with Semantic Scholar. Unfortunately, the semantic scholar API requires knowing the DOI, author ID, or semantic scholar code-- which I don't have for most papers. The Google Scholar API does support DOI lookups, but it's not useful (if I could reliably access Google Scholar we wouldn't need Semantic Scholar!). I also tried submitting requests to crossref (using the mechanize package to simulate browser requests, and then regular expressions to parse out DOIs), but the results were highly unreliable (only a very small proportion of queries seemed to return useful information).
So: I'm stumped. Until I can figure out a way forward (e.g. a way around Google Scholar's limits, a way to look up information via Semantic Scholar, and/or another reliable source for bibliographic information) I'm going to remove bibliographic lookups from the bibtex checker code. My (broken) attempts can be found in the notebook (dev folder) of this commit.
The main issue I was trying to solve was that some of the page numbers are either self-inconsistent or invalid (e.g. the given page range doesn't make sense, like starting from a high number and going to a low number, or containing mixes of alpha and numeric characters that seem suspect). I'm going to implement some heuristics for cleaning up those sorts of issues (to the extent that I can reliably detect them), and I'll ignore for now the likelihood that some bibliographic information may be entered incorrectly.
The text was updated successfully, but these errors were encountered: