Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: consolidate author remote_ids and wikidata identifiers #10092

Draft
wants to merge 272 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
272 commits
Select commit Hold shift + click to select a range
dacef92
merge
pidgezero-one Aug 2, 2024
1186696
remove unnecessary print
pidgezero-one Aug 2, 2024
dd82901
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 2, 2024
47b57c8
uncomment imports
pidgezero-one Aug 2, 2024
3c82121
Merge branch '9671/feat/add-wikisource-import-script' of https://gith…
pidgezero-one Aug 2, 2024
5e58373
better template check:
pidgezero-one Aug 2, 2024
2c268b2
publishers?
pidgezero-one Aug 2, 2024
be0d1a8
fix array
pidgezero-one Aug 2, 2024
d52c109
unused import
pidgezero-one Aug 2, 2024
df683a1
different wiki markup strip
pidgezero-one Aug 2, 2024
8e7cb38
reduce image calls
pidgezero-one Aug 2, 2024
66744ef
unstash
pidgezero-one Aug 2, 2024
5c51bcc
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 2, 2024
b5e4319
fix newlines
pidgezero-one Aug 2, 2024
34da18d
undo comments
pidgezero-one Aug 2, 2024
76a1724
logger name
pidgezero-one Aug 2, 2024
2ce0e17
fix array typing
pidgezero-one Aug 2, 2024
3645e9e
more cleanup
pidgezero-one Aug 2, 2024
0e9650b
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 2, 2024
9663789
dry run outputs to a jsonl file in a gitignored folder
pidgezero-one Aug 6, 2024
8f0810d
Merge branch '9671/feat/add-wikisource-import-script' of https://gith…
pidgezero-one Aug 6, 2024
2a3e617
add this directory
pidgezero-one Aug 6, 2024
0268018
.
pidgezero-one Aug 6, 2024
99f3c93
Merge branch 'master' into 9671/feat/add-wikisource-import-script
pidgezero-one Aug 6, 2024
5390b54
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 6, 2024
b06b60d
.
pidgezero-one Aug 6, 2024
d7561f8
Merge branch '9671/feat/add-wikisource-import-script' of https://gith…
pidgezero-one Aug 6, 2024
3e10c5d
unicode
pidgezero-one Aug 6, 2024
4bd8e54
remove dry run flag
pidgezero-one Aug 6, 2024
e445276
this produces around 500 records
pidgezero-one Aug 13, 2024
d633c62
wikisource API gives better image results. this script now gets most …
pidgezero-one Aug 13, 2024
4053014
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 13, 2024
e1bd982
undo comments
pidgezero-one Aug 13, 2024
62d1798
clearer comments
pidgezero-one Aug 13, 2024
25d9243
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 13, 2024
bbe242f
formatting
pidgezero-one Aug 13, 2024
15aa2b2
formatting
pidgezero-one Aug 13, 2024
cae28f9
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 13, 2024
cb2a959
condense
pidgezero-one Aug 13, 2024
5cb9dc9
more cleanup
pidgezero-one Aug 13, 2024
5370e41
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 13, 2024
b96a07d
more cleanup
pidgezero-one Aug 13, 2024
4d4d091
precommit
pidgezero-one Aug 13, 2024
da00d4e
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 13, 2024
d116d6e
aaaaa
pidgezero-one Aug 13, 2024
8d83830
more false positives, letter filter literally does not work for reaso…
pidgezero-one Aug 14, 2024
f7e61c0
this is annoying
pidgezero-one Aug 14, 2024
3018a5f
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 14, 2024
878051c
ruff
pidgezero-one Aug 14, 2024
9135ff5
ruff
pidgezero-one Aug 14, 2024
113e6a7
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 14, 2024
42c05d0
cmt
pidgezero-one Aug 14, 2024
e9fef40
Merge branch '9671/feat/add-wikisource-import-script' of https://gith…
pidgezero-one Aug 14, 2024
5193267
filters
pidgezero-one Aug 15, 2024
2075f4e
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 15, 2024
dbac756
fix
pidgezero-one Aug 16, 2024
916c3ae
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 16, 2024
d1683a5
ruff
pidgezero-one Aug 16, 2024
287bfe0
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 16, 2024
4ad37de
aint no way precommit thinks 'pleas' is a typo
pidgezero-one Aug 16, 2024
e8fb019
Merge branch '9671/feat/add-wikisource-import-script' of https://gith…
pidgezero-one Aug 16, 2024
d8452d8
comment clarity
pidgezero-one Aug 16, 2024
0d15c54
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 16, 2024
db5c4a9
fix publishers
pidgezero-one Aug 16, 2024
3c46c9a
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 16, 2024
31d84a7
fix WS-side category filtering
pidgezero-one Aug 16, 2024
3dbe7b9
Merge branch '9671/feat/add-wikisource-import-script' of https://gith…
pidgezero-one Aug 16, 2024
d6d303e
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 16, 2024
aa73b6b
ruff
pidgezero-one Aug 16, 2024
a8fdcfb
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 16, 2024
916dabc
clean up some re-request loops
pidgezero-one Aug 16, 2024
bb3c30c
clean up some re-request loops
pidgezero-one Aug 16, 2024
e28c8cd
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 16, 2024
f6f6c99
addresses most PR comments
pidgezero-one Sep 29, 2024
beaf68c
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Sep 29, 2024
e6fe169
precommit
pidgezero-one Sep 29, 2024
408da50
Merge branch '9671/feat/add-wikisource-import-script' of https://gith…
pidgezero-one Sep 29, 2024
e7f714b
fetches more author info, not sure how to format it yet
pidgezero-one Sep 29, 2024
8f01b5b
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Sep 29, 2024
77b23d8
brackets in wrong placE
pidgezero-one Sep 29, 2024
2aca912
Merge branch 'master' into 9671/feat/add-wikisource-import-script
pidgezero-one Oct 12, 2024
7bfc39b
format that works with /import/api
pidgezero-one Oct 13, 2024
461b02a
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 13, 2024
147fab3
wip
pidgezero-one Oct 13, 2024
b73b8d1
Merge branch '9671/feat/add-wikisource-import-script' of https://gith…
pidgezero-one Oct 13, 2024
92065ed
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 13, 2024
0750cb2
?
pidgezero-one Oct 13, 2024
5842e13
Merge branch '9671/feat/add-wikisource-import-script' of https://gith…
pidgezero-one Oct 13, 2024
83a00f8
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 13, 2024
bc31426
precommit errors
pidgezero-one Oct 13, 2024
2e99560
Merge branch '9671/feat/add-wikisource-import-script' of https://gith…
pidgezero-one Oct 13, 2024
cb14b14
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 13, 2024
8d4dac0
support author id matching
pidgezero-one Oct 13, 2024
7d85a0f
Merge branch '9671/feat/add-wikisource-import-script' of https://gith…
pidgezero-one Oct 13, 2024
f1b0edd
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 13, 2024
5b57217
can't get it to work
pidgezero-one Oct 13, 2024
0bd52e8
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 13, 2024
05f3644
unnecessary changes
pidgezero-one Oct 13, 2024
12b96f4
idk
pidgezero-one Oct 13, 2024
6a8234c
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 13, 2024
b9c36e4
it works
pidgezero-one Oct 14, 2024
2fd4569
Merge branch '9671/feat/add-wikisource-import-script' of https://gith…
pidgezero-one Oct 14, 2024
5a8ea7a
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 14, 2024
268f055
wip: support more identifiers
pidgezero-one Oct 14, 2024
60e5d00
Merge branch '9671/feat/add-wikisource-import-script' of https://gith…
pidgezero-one Oct 14, 2024
b6b68f3
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 14, 2024
e02ae78
fix remote ids
pidgezero-one Oct 14, 2024
d7dd818
fix remote ids
pidgezero-one Oct 14, 2024
ee00b9a
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 14, 2024
0ecba61
comment
pidgezero-one Oct 14, 2024
f952c4f
Merge branch '9671/feat/add-wikisource-import-script' of https://gith…
pidgezero-one Oct 14, 2024
0e2c510
fix wd condition
pidgezero-one Oct 14, 2024
1265681
remote_ids will never be empty in script
pidgezero-one Oct 14, 2024
216cb50
attempt unit tests
pidgezero-one Oct 14, 2024
eb8d3d4
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 14, 2024
d298d39
irrelevant
pidgezero-one Oct 14, 2024
8c17a14
update comment
pidgezero-one Oct 14, 2024
8917ae8
unnecessary change
pidgezero-one Oct 14, 2024
a174609
Update openlibrary/components/AuthorIdentifiers.vue
pidgezero-one Oct 15, 2024
ad263aa
Update openlibrary/components/AuthorIdentifiers.vue
pidgezero-one Oct 15, 2024
ab624f6
Update openlibrary/components/AuthorIdentifiers.vue
pidgezero-one Oct 15, 2024
320be8b
suggested rename
pidgezero-one Oct 16, 2024
f07e5fb
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 16, 2024
3bfe2d9
why did changing remote_ids to identifiers break tests?
pidgezero-one Oct 16, 2024
a0aaf40
Merge branch '9671/feat/add-wikisource-import-script' of https://gith…
pidgezero-one Oct 16, 2024
d97921d
fix import key
pidgezero-one Oct 16, 2024
565853b
identifiers
pidgezero-one Oct 16, 2024
d84c5fd
identifiers
pidgezero-one Oct 16, 2024
f2fdde1
Merge branch 'master' into backup/author-identifier-imports
pidgezero-one Nov 24, 2024
d9e921d
beginning of work
pidgezero-one Nov 24, 2024
24fbffd
prestash
pidgezero-one Nov 24, 2024
cac80aa
set up import pipeline to leverage wikidata
pidgezero-one Nov 27, 2024
df6273b
first draft
pidgezero-one Aug 1, 2024
ea1284d
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 1, 2024
cfd0377
linting
pidgezero-one Aug 1, 2024
2ac22d5
use a class for imports
pidgezero-one Aug 1, 2024
80cad49
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 1, 2024
b6988e2
mypy fixes
pidgezero-one Aug 1, 2024
ca98443
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 1, 2024
59ba078
more linting
pidgezero-one Aug 1, 2024
4d20241
is this deprecated too?
pidgezero-one Aug 1, 2024
eb3e872
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 1, 2024
fbd4b49
is this deprecated too?
pidgezero-one Aug 1, 2024
1f0cd7e
is this deprecated too?
pidgezero-one Aug 1, 2024
6ab7e5e
improved data model
pidgezero-one Aug 2, 2024
9142c67
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 2, 2024
c62caef
reformat name formatter
pidgezero-one Aug 2, 2024
ed15ca9
ruff fix
pidgezero-one Aug 2, 2024
fd98ed8
improve infobox fetching
pidgezero-one Aug 2, 2024
1ba81d8
uncomment
pidgezero-one Aug 2, 2024
f8dcfb3
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 2, 2024
61aa478
remove unnecessary print
pidgezero-one Aug 2, 2024
d3439c1
uncomment imports
pidgezero-one Aug 2, 2024
7c141d5
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 2, 2024
de56c2a
better template check:
pidgezero-one Aug 2, 2024
7b9b590
publishers?
pidgezero-one Aug 2, 2024
df4d471
fix array
pidgezero-one Aug 2, 2024
e00eb4c
unused import
pidgezero-one Aug 2, 2024
b8fe653
different wiki markup strip
pidgezero-one Aug 2, 2024
81e8282
reduce image calls
pidgezero-one Aug 2, 2024
21bd563
unstash
pidgezero-one Aug 2, 2024
5123a82
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 2, 2024
6d8864b
fix newlines
pidgezero-one Aug 2, 2024
15cd9fa
undo comments
pidgezero-one Aug 2, 2024
c01a17e
logger name
pidgezero-one Aug 2, 2024
a103564
fix array typing
pidgezero-one Aug 2, 2024
cdcb18f
more cleanup
pidgezero-one Aug 2, 2024
1f2c984
dry run outputs to a jsonl file in a gitignored folder
pidgezero-one Aug 6, 2024
0ff56b1
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 2, 2024
27b991d
add this directory
pidgezero-one Aug 6, 2024
bd8f0c2
.
pidgezero-one Aug 6, 2024
9592016
.
pidgezero-one Aug 6, 2024
5c61c6b
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 6, 2024
4fd1491
unicode
pidgezero-one Aug 6, 2024
357886d
remove dry run flag
pidgezero-one Aug 6, 2024
2a4e7b9
this produces around 500 records
pidgezero-one Aug 13, 2024
e30d440
wikisource API gives better image results. this script now gets most …
pidgezero-one Aug 13, 2024
c161403
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 13, 2024
af0ebf9
undo comments
pidgezero-one Aug 13, 2024
9255be9
clearer comments
pidgezero-one Aug 13, 2024
bb50c86
formatting
pidgezero-one Aug 13, 2024
4669fc3
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 13, 2024
8c50a19
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 13, 2024
60fcc1b
condense
pidgezero-one Aug 13, 2024
38e6311
more cleanup
pidgezero-one Aug 13, 2024
fc5f8a3
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 13, 2024
cbd0e96
more cleanup
pidgezero-one Aug 13, 2024
79ef6e1
precommit
pidgezero-one Aug 13, 2024
a5889df
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 13, 2024
67736a8
aaaaa
pidgezero-one Aug 13, 2024
0d99753
more false positives, letter filter literally does not work for reaso…
pidgezero-one Aug 14, 2024
77aeed1
this is annoying
pidgezero-one Aug 14, 2024
ec9fe38
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 14, 2024
a4a906d
ruff
pidgezero-one Aug 14, 2024
bb62b15
ruff
pidgezero-one Aug 14, 2024
df24f8d
cmt
pidgezero-one Aug 14, 2024
089dd71
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 14, 2024
3e18b6b
filters
pidgezero-one Aug 15, 2024
9995782
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 15, 2024
479ad66
fix
pidgezero-one Aug 16, 2024
21b83cb
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 16, 2024
8b3c798
ruff
pidgezero-one Aug 16, 2024
f018c7c
aint no way precommit thinks 'pleas' is a typo
pidgezero-one Aug 16, 2024
b710116
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 16, 2024
27eb3cc
comment clarity
pidgezero-one Aug 16, 2024
c1edc35
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 16, 2024
e0a7700
fix publishers
pidgezero-one Aug 16, 2024
85da93d
fix WS-side category filtering
pidgezero-one Aug 16, 2024
eb9ddcd
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 16, 2024
fc9b86e
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 16, 2024
4cecf7b
ruff
pidgezero-one Aug 16, 2024
9f4992a
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 16, 2024
2598c30
clean up some re-request loops
pidgezero-one Aug 16, 2024
52e8b8c
clean up some re-request loops
pidgezero-one Aug 16, 2024
a7126ea
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 16, 2024
f779dfb
addresses most PR comments
pidgezero-one Sep 29, 2024
eea0c09
precommit
pidgezero-one Sep 29, 2024
1ca2012
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Sep 29, 2024
82d99a8
fetches more author info, not sure how to format it yet
pidgezero-one Sep 29, 2024
933c625
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Sep 29, 2024
ea9931f
brackets in wrong placE
pidgezero-one Sep 29, 2024
21e59f1
format that works with /import/api
pidgezero-one Oct 13, 2024
f32536b
wip
pidgezero-one Oct 13, 2024
65fa8d2
wikisource script goes in other PR
pidgezero-one Nov 27, 2024
6c5006b
merge
pidgezero-one Nov 27, 2024
9f2f6a9
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Nov 27, 2024
009749e
unnecessary change
pidgezero-one Nov 27, 2024
db20d32
unnecessary change
pidgezero-one Nov 27, 2024
c2b7908
this code goes in the other PR
pidgezero-one Nov 27, 2024
9aa8835
Revert "this code goes in the other PR"
pidgezero-one Nov 27, 2024
41cc424
?
pidgezero-one Nov 27, 2024
ad3bf1c
requirements.txt doesnt need to change here
pidgezero-one Nov 28, 2024
6916035
merge
pidgezero-one Dec 3, 2024
3fc91f8
merge
pidgezero-one Dec 3, 2024
82c19f1
merge
pidgezero-one Dec 3, 2024
46e22ce
Update openlibrary/catalog/add_book/tests/test_load_book.py
pidgezero-one Dec 3, 2024
eed54d0
Update scripts/backfill_author_identifiers.py
pidgezero-one Dec 3, 2024
033b822
.
pidgezero-one Dec 3, 2024
3f5ea0a
Merge branch '10029/feat/consolidate-remote-ids-and-wikisource-identi…
pidgezero-one Dec 3, 2024
31a0b9a
dont need this
pidgezero-one Dec 3, 2024
51d8618
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Dec 3, 2024
25a3383
these dont need to be in this pr
pidgezero-one Dec 3, 2024
ca286d3
these dont need to be in this pr
pidgezero-one Dec 3, 2024
cfdb166
use min OLID
pidgezero-one Dec 3, 2024
e4389c4
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Dec 3, 2024
d3bade0
remove try/catches
pidgezero-one Dec 3, 2024
011e5df
shouldnt be a return val here
pidgezero-one Dec 3, 2024
598f1c7
address import problems
pidgezero-one Dec 3, 2024
03f12b5
ruff fixes
pidgezero-one Dec 3, 2024
0e0fb1d
precommit fixes
pidgezero-one Dec 3, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 22 additions & 1 deletion openlibrary/core/models.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@
from openlibrary.core.observations import Observations
from openlibrary.core.ratings import Ratings
from openlibrary.core.vendors import get_amazon_metadata
from openlibrary.core.wikidata import WikidataEntity, get_wikidata_entity
from openlibrary.core.wikidata import REMOTE_IDS, WikidataEntity, get_wikidata_entity
from openlibrary.utils import extract_numeric_id_from_olid
from openlibrary.utils.isbn import canonical, isbn_13_to_isbn_10, to_isbn_13

Expand Down Expand Up @@ -807,6 +807,27 @@ def get_edition_count(self):
def get_lists(self, limit=50, offset=0, sort=True):
return self._get_lists(limit=limit, offset=offset, sort=sort)

def merge_remote_ids(
self, incoming_ids: dict[str, str]
) -> tuple[dict[str, str], int]:
output = {**self.remote_ids}
if len(incoming_ids.items()) == 0:
return output, -1
matches = 0
conflicts = 0
for identifier in REMOTE_IDS:
if identifier in output and identifier in incoming_ids:
if output[identifier] != incoming_ids[identifier]:
conflicts = conflicts + 1
else:
output[identifier] = incoming_ids[identifier]
matches = matches + 1
if conflicts > matches:
# This means that the identifiers we already have for this author have too many conflicts with whichever identifiers we're trying to merge into it.
# TODO: Raise this to librarians, somehow.
return self.remote_ids, -1
return output, matches


class User(Thing):
DEFAULT_PREFERENCES = {
Expand Down
87 changes: 87 additions & 0 deletions openlibrary/core/wikidata.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,13 +7,17 @@

import json
import logging
import re
from dataclasses import dataclass
from datetime import datetime
from typing import Any

import requests
import web

from openlibrary.core import db
from openlibrary.core.helpers import days_since
from openlibrary.utils import extract_numeric_id_from_olid

logger = logging.getLogger("core.wikidata")

Expand All @@ -30,6 +34,25 @@
}
]

# The keys in this dict need to match their corresponding names in openlibrary/plugins/openlibrary/config/author/identifiers.yml
# Ordered by what I assume is most (viaf) to least (amazon/youtube) reliable for author matching
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless we plan to do anything with the ordering, it’d probably be better to just do something like alphabetic ordering so it’s easy to add new identifiers to this.

Also, an alternative approach could maybe be to add a wikidata item to identifiers.yml which could be read here? Otherwise this approach means that there are more places to edit when adding/editing identifiers (e.g., #9982 (pending) and #10052 (merged and live on prod, but the identifier is not included here)). This would also mean that we wouldn’t need to maintain and handle separate REMOTE_IDS lists for authors, editions, and works (e.g., musicbrainz and bookbrainz have different Wikidata properties depending on whether it’s an Author, Edition, or Work, which can’t be handled with this current structure).

REMOTE_IDS = {
"viaf": "P214",
"isni": "P213",
"lc_naf": "P244",
"opac_sbn": "P396",
"project_gutenberg": "P1938",
"librivox": "P1899",
"bookbrainz": "P2607",
"musicbrainz": "P434",
"librarything": "P7400",
"goodreads": "P2963",
"storygraph": "P12430",
"imdb": "P345",
"amazon": "P4862",
"youtube": "P2397",
}


@dataclass
class WikidataEntity:
Expand Down Expand Up @@ -98,6 +121,33 @@ def get_external_profiles(self, language: str = 'en') -> list[dict]:
)
return profiles

def get_remote_ids(self) -> dict[str, Any]:
"""
Get remote IDs like viaf, isni, etc.

Returns:
Dict containing identifier names as keys and lists of corresponding values
"""
remote_ids = {}

for service, id in REMOTE_IDS.items():
values = self._get_statement_values(id)
remote_ids[service] = values
return remote_ids

def get_openlibrary_id(self) -> str | None:
"""
Get open library ID of WD item. Mostly used to validate connection between WD item and OL thing

Returns:
OL ID, if it exists
"""
res = self._get_statement_values("P648")
if len(res) > 0:
return min(res, key=extract_numeric_id_from_olid)

return None

def _get_wiki_profiles(self, language: str) -> list[dict]:
"""
Get formatted Wikipedia and Wikidata profile data for rendering.
Expand Down Expand Up @@ -161,6 +211,38 @@ def _get_statement_values(self, property_id: str) -> list[str]:
if "value" in statement and "content" in statement["value"]
]

# can this move into author def'n?
def consolidate_remote_author_ids(self) -> None:
output = {"wikidata": self.id}
ol_id = self.get_openlibrary_id()
if ol_id is None or not re.fullmatch(r"^OL\d+A$", ol_id):
return

key = "/authors/" + ol_id
q = {"type": "/type/author", "key~": key}
reply = list(web.ctx.site.things(q))
authors = [web.ctx.site.get(k) for k in reply]
if len(authors) != 1:
# There should never, ever be len(authors) > 1, because that would imply two OL author entities have the same OL ID.
return
author = authors[0]
if author.wikidata() is not None and author.wikidata().id != self.id:
# TODO: Flag this to librarians. This means the OL entity identified by the Wikidata JSON has a different Wikidata ID than the JSON expects.
return
wd_remote_ids: dict[str, str] = {
key: value[0] for key, value in self.get_remote_ids().items() if value != []
}

# Verify that the author's IDs are not significantly different.
output, matches = author.merge_remote_ids(wd_remote_ids)

# save if there are new identifiers to save
if matches >= 0:
author.remote_ids = output
web.ctx.site.save(
query={**author, "key": key, "type": {"key": "/type/author"}}
)


def _cache_expired(entity: WikidataEntity) -> bool:
return days_since(entity._updated) > WIKIDATA_CACHE_TTL_DAYS
Expand Down Expand Up @@ -228,6 +310,11 @@ def _add_to_cache(entity: WikidataEntity) -> None:
# TODO: after we upgrade to postgres 9.5+ we should use upsert here
oldb = db.get_db()
json_data = entity.to_wikidata_api_json_format()
ol_id = entity.get_openlibrary_id()

# here, we should write WD data to author remote ids
if ol_id is not None and re.fullmatch(r"^OL\d+A$", ol_id):
entity.consolidate_remote_author_ids()

if _get_from_cache(entity.id):
return oldb.update(
Expand Down
40 changes: 40 additions & 0 deletions scripts/backfill_author_identifiers.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
"""
Copies all author identifiers from the author's stored Wikidata info into their remote_ids.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question for later: should this also scrape wikidata for authors that have an OL ID on their side but we don't have their wikidata json on our side? not sure if any of these actually exist

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty sure they exist, but I’d suggest leaving that out of this one, and consider it for a later PR. Better to keep commits/PRs as atomic as possible. :)


To Run:

PYTHONPATH=. python ./scripts/populate_author_identifiers.py /olsystem/etc/openlibrary.yml

(If testing locally, run inside `docker compose exec web bash` and use ./conf/openlibrary.yml)
"""

#!/usr/bin/env python
import web

import infogami
from openlibrary.config import load_config
from openlibrary.core import db
from openlibrary.core.wikidata import get_wikidata_entity
from scripts.solr_builder.solr_builder.fn_to_cli import FnToCLI


def main(ol_config: str):
"""
:param str ol_config: Path to openlibrary.yml file
"""
load_config(ol_config)
infogami._setup()

# how i fix this lol there's no IP when running from within docker
web.ctx.ip = '127.0.0.1'

for row in db.query("select id from wikidata"):
e = get_wikidata_entity(row.id)
if e is not None:
e.consolidate_remote_author_ids()


# Get wikidata for authors who dont have it yet?

if __name__ == "__main__":
FnToCLI(main).run()
Loading