Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transfer our bibliography data to zbmath, then replace our bibliography with a link to swmath #343

Open
fingolfin opened this issue Mar 25, 2024 · 4 comments

Comments

@fingolfin
Copy link
Member

The link: https://zbmath.org/software/320

This list provides basically everything we have at https://www.gap-system.org/Doc/Bib/bib.html and even has additional nice features. And unlike MathSciNet it is free to use

While it has overall more publications than we do, it does miss some -- potentially in some cases papers might not be indexed by them at all, but so far all cases I found were a paper is in our list but in theirs is a matter of missing metadata on their part, i.e., the "tag" "sw:gap" is missing on some papers for whatever reasons.

I have contacted them and in principle I can send them lists of papers that are missing this tag and they'll add it (presumably after some validation, of course).

That leaves the problem as to how we get that list. Of course we can manually check things but there are thousands. So better to automate it. Here is how one could do that:

  1. get our data -- easy, just download https://www.gap-system.org/Doc/Bib/gap-publishednicer.bib
  2. get their data
    • I wrote a script to do so with their help and some manual tweaking, and have that .bib file (it is 1.8 MB so I am not attaching it but instead I'll add the crude script below)
  3. write a tool which parses the bib files (e.g. in Python and using https://bibtexparser.readthedocs.io/en/main/), then lists papers we have but they don't
    • this is easy for papers with a DOI and if both sides have the DOI, so let's drop those first
    • next compare using title, year, author(s)?
    • keep refining but at some point it will be more efficient to just let humans consider the lists...
  4. for the remaining papers, try to get their zbmath ID ... this could use their website, but it seems they have an API for that, with some Python bindings here: https://github.com/zbMATHOpen/zbRestApiClient
    • actually it may make sense to combine 3 with 4: if we can identify one of "our" papers using the zbmath API then it is easy to determine if it is in their list of "papers using GAP" or not...
  5. the final result would be two lists of papers
    • one with papers we have but they don't and which we successfully identified (we probably just need the list of ids here and then can send it to them
    • papers they don't seem to have in the database at all
      • this will certainly include many theses!
      • how we deal with this we'll have to decide once we have that list..

Script for getting zbmath data

#!/bin/sh
echo > zbmath.bib
curl "https://zbmath.org/bibtexoutput/?q=si%3A320&start=0&count=200" >> zbmath.bib
curl "https://zbmath.org/bibtexoutput/?q=si%3A320&start=200&count=200" >> zbmath.bib
curl "https://zbmath.org/bibtexoutput/?q=si%3A320&start=400&count=200" >> zbmath.bib
curl "https://zbmath.org/bibtexoutput/?q=si%3A320&start=600&count=200" >> zbmath.bib
curl "https://zbmath.org/bibtexoutput/?q=si%3A320&start=800&count=200" >> zbmath.bib
curl "https://zbmath.org/bibtexoutput/?q=si%3A320&start=1000&count=200" >> zbmath.bib
curl "https://zbmath.org/bibtexoutput/?q=si%3A320&start=1200&count=200" >> zbmath.bib
curl "https://zbmath.org/bibtexoutput/?q=si%3A320&start=1400&count=200" >> zbmath.bib
curl "https://zbmath.org/bibtexoutput/?q=si%3A320&start=1600&count=200" >> zbmath.bib
curl "https://zbmath.org/bibtexoutput/?q=si%3A320&start=1800&count=200" >> zbmath.bib
curl "https://zbmath.org/bibtexoutput/?q=si%3A320&start=2000&count=200" >> zbmath.bib
curl "https://zbmath.org/bibtexoutput/?q=si%3A320&start=2200&count=200" >> zbmath.bib
curl "https://zbmath.org/bibtexoutput/?q=si%3A320&start=2400&count=200" >> zbmath.bib
curl "https://zbmath.org/bibtexoutput/?q=si%3A320&start=2600&count=200" >> zbmath.bib
curl "https://zbmath.org/bibtexoutput/?q=si%3A320&start=2800&count=200" >> zbmath.bib
curl "https://zbmath.org/bibtexoutput/?q=si%3A320&start=3000&count=200" >> zbmath.bib
curl "https://zbmath.org/bibtexoutput/?q=si%3A320&start=3200&count=200" >> zbmath.bib
curl "https://zbmath.org/bibtexoutput/?q=si%3A320&start=3400&count=200" >> zbmath.bib
curl "https://zbmath.org/bibtexoutput/?q=si%3A320&start=3600&count=200" >> zbmath.bib
@fingolfin
Copy link
Member Author

Overall, it's still like this: our bib data only goes up to 2021. We list 3377 papers in total which cite GAP. In contrast, https://zbmath.org/software/320 has 3782 citations, which of course is more. That said, if I ask for a list only up to 2021 then it contains just 3171 documents, so they are "missing" about 206. Well, at least that many -- it is quite possible that they also have papers that we don't have, and thus that there are more than 206 publications we list and they don't...

It would thus still be interesting to write a script as outlined above which downloads the zbMATH list of papers, and which tries to find papers in our list that they don't have...

I didn't do that but just manually looked at the data.

Unfortunately it seems a lot of the "missing" papers are due to publishers having really bad metadata for many older pages. As in, the list of references on the website of a paper may contain many obvious errors caused by bad OCR or whatnot. zbMATH does not want to fix these reference lists manually, saying that the proper way is to contact publishers. So I attempted that for several papers, but even getting a reply was rare (less than half the cases or so), and getting an actual change even rarer. This just doesn't scale.

However, zbMATH has offered to at least add the "GAP" keyword/tag on request. They have already done it in one case in the past upon my request. I have now sent them another email with two dozen items, let's see if they also process those. If yes, then this might be a path forward.

Interestingly, after line 54243 gap-published.bib the items obtained from MathSciNet end, and it continues thusly:

% this file contains citations that should appear in MR over time (they then
% should be moved to GapCite.MR)

...

Counting the bib items there I find 210. This matches up quite closely with the 206 missing one, but I think it's more of a coincidence -- several of those paper below that mark are actually listed in zbMATH (and MathSciNet) and have dates <= 2021. Still, there are a few items in there which will never be on ZBMath or MathSciNet, e.g.

@article{konovalov2005f,
 author = {Konovalov, A.},
 title = {The computer algebra system GAP 4.4.5 on CHIP-CD 9/2005},
 journal = {``CHIP'' Magazine},
 number = {9},
 year = {2005},
 note = {Supplementary article for the GAP 4.4.5 distribution on
   the CD-appendix to the magazine.}
}

But I don't know how much this accounts for the "missing" papers.

@fingolfin
Copy link
Member Author

Some of the "missing" items are also "Diplomarbeiten" resp. bachelor/master theses which probably will never be listed in zbMATH. It might be a good idea to move those into a separate "database" (= bib file).

@fingolfin
Copy link
Member Author

We have mostly removed our bib, but I am leaving this open because I still hope to migrate some of the data to zbMATH, and I am still waiting for them to reply to my email from earlier this week.

@fingolfin
Copy link
Member Author

It took a couple weeks but they replied and added all the corrections I sent them (missing DOIs and missing "GAP" tags on papers). Hence https://zbmath.org/?q=si%3A320+py%3A1980-2021 now lists 3194 documents (up from 3171). So that's still about 183 "missing' papers. But at least it is now plausible for someone to go through, identify the "missing" papers and then submit corrections.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant