Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: consolidate author remote_ids and wikidata identifiers #10092

Closed
Show file tree
Hide file tree
Changes from 250 commits
Commits
Show all changes
272 commits
Select commit Hold shift + click to select a range
dacef92
merge
pidgezero-one Aug 2, 2024
1186696
remove unnecessary print
pidgezero-one Aug 2, 2024
dd82901
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 2, 2024
47b57c8
uncomment imports
pidgezero-one Aug 2, 2024
3c82121
Merge branch '9671/feat/add-wikisource-import-script' of https://gith…
pidgezero-one Aug 2, 2024
5e58373
better template check:
pidgezero-one Aug 2, 2024
2c268b2
publishers?
pidgezero-one Aug 2, 2024
be0d1a8
fix array
pidgezero-one Aug 2, 2024
d52c109
unused import
pidgezero-one Aug 2, 2024
df683a1
different wiki markup strip
pidgezero-one Aug 2, 2024
8e7cb38
reduce image calls
pidgezero-one Aug 2, 2024
66744ef
unstash
pidgezero-one Aug 2, 2024
5c51bcc
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 2, 2024
b5e4319
fix newlines
pidgezero-one Aug 2, 2024
34da18d
undo comments
pidgezero-one Aug 2, 2024
76a1724
logger name
pidgezero-one Aug 2, 2024
2ce0e17
fix array typing
pidgezero-one Aug 2, 2024
3645e9e
more cleanup
pidgezero-one Aug 2, 2024
0e9650b
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 2, 2024
9663789
dry run outputs to a jsonl file in a gitignored folder
pidgezero-one Aug 6, 2024
8f0810d
Merge branch '9671/feat/add-wikisource-import-script' of https://gith…
pidgezero-one Aug 6, 2024
2a3e617
add this directory
pidgezero-one Aug 6, 2024
0268018
.
pidgezero-one Aug 6, 2024
99f3c93
Merge branch 'master' into 9671/feat/add-wikisource-import-script
pidgezero-one Aug 6, 2024
5390b54
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 6, 2024
b06b60d
.
pidgezero-one Aug 6, 2024
d7561f8
Merge branch '9671/feat/add-wikisource-import-script' of https://gith…
pidgezero-one Aug 6, 2024
3e10c5d
unicode
pidgezero-one Aug 6, 2024
4bd8e54
remove dry run flag
pidgezero-one Aug 6, 2024
e445276
this produces around 500 records
pidgezero-one Aug 13, 2024
d633c62
wikisource API gives better image results. this script now gets most …
pidgezero-one Aug 13, 2024
4053014
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 13, 2024
e1bd982
undo comments
pidgezero-one Aug 13, 2024
62d1798
clearer comments
pidgezero-one Aug 13, 2024
25d9243
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 13, 2024
bbe242f
formatting
pidgezero-one Aug 13, 2024
15aa2b2
formatting
pidgezero-one Aug 13, 2024
cae28f9
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 13, 2024
cb2a959
condense
pidgezero-one Aug 13, 2024
5cb9dc9
more cleanup
pidgezero-one Aug 13, 2024
5370e41
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 13, 2024
b96a07d
more cleanup
pidgezero-one Aug 13, 2024
4d4d091
precommit
pidgezero-one Aug 13, 2024
da00d4e
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 13, 2024
d116d6e
aaaaa
pidgezero-one Aug 13, 2024
8d83830
more false positives, letter filter literally does not work for reaso…
pidgezero-one Aug 14, 2024
f7e61c0
this is annoying
pidgezero-one Aug 14, 2024
3018a5f
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 14, 2024
878051c
ruff
pidgezero-one Aug 14, 2024
9135ff5
ruff
pidgezero-one Aug 14, 2024
113e6a7
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 14, 2024
42c05d0
cmt
pidgezero-one Aug 14, 2024
e9fef40
Merge branch '9671/feat/add-wikisource-import-script' of https://gith…
pidgezero-one Aug 14, 2024
5193267
filters
pidgezero-one Aug 15, 2024
2075f4e
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 15, 2024
dbac756
fix
pidgezero-one Aug 16, 2024
916c3ae
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 16, 2024
d1683a5
ruff
pidgezero-one Aug 16, 2024
287bfe0
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 16, 2024
4ad37de
aint no way precommit thinks 'pleas' is a typo
pidgezero-one Aug 16, 2024
e8fb019
Merge branch '9671/feat/add-wikisource-import-script' of https://gith…
pidgezero-one Aug 16, 2024
d8452d8
comment clarity
pidgezero-one Aug 16, 2024
0d15c54
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 16, 2024
db5c4a9
fix publishers
pidgezero-one Aug 16, 2024
3c46c9a
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 16, 2024
31d84a7
fix WS-side category filtering
pidgezero-one Aug 16, 2024
3dbe7b9
Merge branch '9671/feat/add-wikisource-import-script' of https://gith…
pidgezero-one Aug 16, 2024
d6d303e
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 16, 2024
aa73b6b
ruff
pidgezero-one Aug 16, 2024
a8fdcfb
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 16, 2024
916dabc
clean up some re-request loops
pidgezero-one Aug 16, 2024
bb3c30c
clean up some re-request loops
pidgezero-one Aug 16, 2024
e28c8cd
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 16, 2024
f6f6c99
addresses most PR comments
pidgezero-one Sep 29, 2024
beaf68c
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Sep 29, 2024
e6fe169
precommit
pidgezero-one Sep 29, 2024
408da50
Merge branch '9671/feat/add-wikisource-import-script' of https://gith…
pidgezero-one Sep 29, 2024
e7f714b
fetches more author info, not sure how to format it yet
pidgezero-one Sep 29, 2024
8f01b5b
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Sep 29, 2024
77b23d8
brackets in wrong placE
pidgezero-one Sep 29, 2024
2aca912
Merge branch 'master' into 9671/feat/add-wikisource-import-script
pidgezero-one Oct 12, 2024
7bfc39b
format that works with /import/api
pidgezero-one Oct 13, 2024
461b02a
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 13, 2024
147fab3
wip
pidgezero-one Oct 13, 2024
b73b8d1
Merge branch '9671/feat/add-wikisource-import-script' of https://gith…
pidgezero-one Oct 13, 2024
92065ed
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 13, 2024
0750cb2
?
pidgezero-one Oct 13, 2024
5842e13
Merge branch '9671/feat/add-wikisource-import-script' of https://gith…
pidgezero-one Oct 13, 2024
83a00f8
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 13, 2024
bc31426
precommit errors
pidgezero-one Oct 13, 2024
2e99560
Merge branch '9671/feat/add-wikisource-import-script' of https://gith…
pidgezero-one Oct 13, 2024
cb14b14
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 13, 2024
8d4dac0
support author id matching
pidgezero-one Oct 13, 2024
7d85a0f
Merge branch '9671/feat/add-wikisource-import-script' of https://gith…
pidgezero-one Oct 13, 2024
f1b0edd
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 13, 2024
5b57217
can't get it to work
pidgezero-one Oct 13, 2024
0bd52e8
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 13, 2024
05f3644
unnecessary changes
pidgezero-one Oct 13, 2024
12b96f4
idk
pidgezero-one Oct 13, 2024
6a8234c
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 13, 2024
b9c36e4
it works
pidgezero-one Oct 14, 2024
2fd4569
Merge branch '9671/feat/add-wikisource-import-script' of https://gith…
pidgezero-one Oct 14, 2024
5a8ea7a
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 14, 2024
268f055
wip: support more identifiers
pidgezero-one Oct 14, 2024
60e5d00
Merge branch '9671/feat/add-wikisource-import-script' of https://gith…
pidgezero-one Oct 14, 2024
b6b68f3
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 14, 2024
e02ae78
fix remote ids
pidgezero-one Oct 14, 2024
d7dd818
fix remote ids
pidgezero-one Oct 14, 2024
ee00b9a
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 14, 2024
0ecba61
comment
pidgezero-one Oct 14, 2024
f952c4f
Merge branch '9671/feat/add-wikisource-import-script' of https://gith…
pidgezero-one Oct 14, 2024
0e2c510
fix wd condition
pidgezero-one Oct 14, 2024
1265681
remote_ids will never be empty in script
pidgezero-one Oct 14, 2024
216cb50
attempt unit tests
pidgezero-one Oct 14, 2024
eb8d3d4
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 14, 2024
d298d39
irrelevant
pidgezero-one Oct 14, 2024
8c17a14
update comment
pidgezero-one Oct 14, 2024
8917ae8
unnecessary change
pidgezero-one Oct 14, 2024
a174609
Update openlibrary/components/AuthorIdentifiers.vue
pidgezero-one Oct 15, 2024
ad263aa
Update openlibrary/components/AuthorIdentifiers.vue
pidgezero-one Oct 15, 2024
ab624f6
Update openlibrary/components/AuthorIdentifiers.vue
pidgezero-one Oct 15, 2024
320be8b
suggested rename
pidgezero-one Oct 16, 2024
f07e5fb
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 16, 2024
3bfe2d9
why did changing remote_ids to identifiers break tests?
pidgezero-one Oct 16, 2024
a0aaf40
Merge branch '9671/feat/add-wikisource-import-script' of https://gith…
pidgezero-one Oct 16, 2024
d97921d
fix import key
pidgezero-one Oct 16, 2024
565853b
identifiers
pidgezero-one Oct 16, 2024
d84c5fd
identifiers
pidgezero-one Oct 16, 2024
f2fdde1
Merge branch 'master' into backup/author-identifier-imports
pidgezero-one Nov 24, 2024
d9e921d
beginning of work
pidgezero-one Nov 24, 2024
24fbffd
prestash
pidgezero-one Nov 24, 2024
cac80aa
set up import pipeline to leverage wikidata
pidgezero-one Nov 27, 2024
df6273b
first draft
pidgezero-one Aug 1, 2024
ea1284d
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 1, 2024
cfd0377
linting
pidgezero-one Aug 1, 2024
2ac22d5
use a class for imports
pidgezero-one Aug 1, 2024
80cad49
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 1, 2024
b6988e2
mypy fixes
pidgezero-one Aug 1, 2024
ca98443
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 1, 2024
59ba078
more linting
pidgezero-one Aug 1, 2024
4d20241
is this deprecated too?
pidgezero-one Aug 1, 2024
eb3e872
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 1, 2024
fbd4b49
is this deprecated too?
pidgezero-one Aug 1, 2024
1f0cd7e
is this deprecated too?
pidgezero-one Aug 1, 2024
6ab7e5e
improved data model
pidgezero-one Aug 2, 2024
9142c67
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 2, 2024
c62caef
reformat name formatter
pidgezero-one Aug 2, 2024
ed15ca9
ruff fix
pidgezero-one Aug 2, 2024
fd98ed8
improve infobox fetching
pidgezero-one Aug 2, 2024
1ba81d8
uncomment
pidgezero-one Aug 2, 2024
f8dcfb3
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 2, 2024
61aa478
remove unnecessary print
pidgezero-one Aug 2, 2024
d3439c1
uncomment imports
pidgezero-one Aug 2, 2024
7c141d5
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 2, 2024
de56c2a
better template check:
pidgezero-one Aug 2, 2024
7b9b590
publishers?
pidgezero-one Aug 2, 2024
df4d471
fix array
pidgezero-one Aug 2, 2024
e00eb4c
unused import
pidgezero-one Aug 2, 2024
b8fe653
different wiki markup strip
pidgezero-one Aug 2, 2024
81e8282
reduce image calls
pidgezero-one Aug 2, 2024
21bd563
unstash
pidgezero-one Aug 2, 2024
5123a82
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 2, 2024
6d8864b
fix newlines
pidgezero-one Aug 2, 2024
15cd9fa
undo comments
pidgezero-one Aug 2, 2024
c01a17e
logger name
pidgezero-one Aug 2, 2024
a103564
fix array typing
pidgezero-one Aug 2, 2024
cdcb18f
more cleanup
pidgezero-one Aug 2, 2024
1f2c984
dry run outputs to a jsonl file in a gitignored folder
pidgezero-one Aug 6, 2024
0ff56b1
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 2, 2024
27b991d
add this directory
pidgezero-one Aug 6, 2024
bd8f0c2
.
pidgezero-one Aug 6, 2024
9592016
.
pidgezero-one Aug 6, 2024
5c61c6b
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 6, 2024
4fd1491
unicode
pidgezero-one Aug 6, 2024
357886d
remove dry run flag
pidgezero-one Aug 6, 2024
2a4e7b9
this produces around 500 records
pidgezero-one Aug 13, 2024
e30d440
wikisource API gives better image results. this script now gets most …
pidgezero-one Aug 13, 2024
c161403
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 13, 2024
af0ebf9
undo comments
pidgezero-one Aug 13, 2024
9255be9
clearer comments
pidgezero-one Aug 13, 2024
bb50c86
formatting
pidgezero-one Aug 13, 2024
4669fc3
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 13, 2024
8c50a19
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 13, 2024
60fcc1b
condense
pidgezero-one Aug 13, 2024
38e6311
more cleanup
pidgezero-one Aug 13, 2024
fc5f8a3
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 13, 2024
cbd0e96
more cleanup
pidgezero-one Aug 13, 2024
79ef6e1
precommit
pidgezero-one Aug 13, 2024
a5889df
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 13, 2024
67736a8
aaaaa
pidgezero-one Aug 13, 2024
0d99753
more false positives, letter filter literally does not work for reaso…
pidgezero-one Aug 14, 2024
77aeed1
this is annoying
pidgezero-one Aug 14, 2024
ec9fe38
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 14, 2024
a4a906d
ruff
pidgezero-one Aug 14, 2024
bb62b15
ruff
pidgezero-one Aug 14, 2024
df24f8d
cmt
pidgezero-one Aug 14, 2024
089dd71
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 14, 2024
3e18b6b
filters
pidgezero-one Aug 15, 2024
9995782
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 15, 2024
479ad66
fix
pidgezero-one Aug 16, 2024
21b83cb
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 16, 2024
8b3c798
ruff
pidgezero-one Aug 16, 2024
f018c7c
aint no way precommit thinks 'pleas' is a typo
pidgezero-one Aug 16, 2024
b710116
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 16, 2024
27eb3cc
comment clarity
pidgezero-one Aug 16, 2024
c1edc35
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 16, 2024
e0a7700
fix publishers
pidgezero-one Aug 16, 2024
85da93d
fix WS-side category filtering
pidgezero-one Aug 16, 2024
eb9ddcd
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 16, 2024
fc9b86e
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 16, 2024
4cecf7b
ruff
pidgezero-one Aug 16, 2024
9f4992a
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 16, 2024
2598c30
clean up some re-request loops
pidgezero-one Aug 16, 2024
52e8b8c
clean up some re-request loops
pidgezero-one Aug 16, 2024
a7126ea
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 16, 2024
f779dfb
addresses most PR comments
pidgezero-one Sep 29, 2024
eea0c09
precommit
pidgezero-one Sep 29, 2024
1ca2012
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Sep 29, 2024
82d99a8
fetches more author info, not sure how to format it yet
pidgezero-one Sep 29, 2024
933c625
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Sep 29, 2024
ea9931f
brackets in wrong placE
pidgezero-one Sep 29, 2024
21e59f1
format that works with /import/api
pidgezero-one Oct 13, 2024
f32536b
wip
pidgezero-one Oct 13, 2024
65fa8d2
wikisource script goes in other PR
pidgezero-one Nov 27, 2024
6c5006b
merge
pidgezero-one Nov 27, 2024
9f2f6a9
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Nov 27, 2024
009749e
unnecessary change
pidgezero-one Nov 27, 2024
db20d32
unnecessary change
pidgezero-one Nov 27, 2024
c2b7908
this code goes in the other PR
pidgezero-one Nov 27, 2024
9aa8835
Revert "this code goes in the other PR"
pidgezero-one Nov 27, 2024
41cc424
?
pidgezero-one Nov 27, 2024
ad3bf1c
requirements.txt doesnt need to change here
pidgezero-one Nov 28, 2024
6916035
merge
pidgezero-one Dec 3, 2024
3fc91f8
merge
pidgezero-one Dec 3, 2024
82c19f1
merge
pidgezero-one Dec 3, 2024
46e22ce
Update openlibrary/catalog/add_book/tests/test_load_book.py
pidgezero-one Dec 3, 2024
eed54d0
Update scripts/backfill_author_identifiers.py
pidgezero-one Dec 3, 2024
033b822
.
pidgezero-one Dec 3, 2024
3f5ea0a
Merge branch '10029/feat/consolidate-remote-ids-and-wikisource-identi…
pidgezero-one Dec 3, 2024
31a0b9a
dont need this
pidgezero-one Dec 3, 2024
51d8618
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Dec 3, 2024
25a3383
these dont need to be in this pr
pidgezero-one Dec 3, 2024
ca286d3
these dont need to be in this pr
pidgezero-one Dec 3, 2024
cfdb166
use min OLID
pidgezero-one Dec 3, 2024
e4389c4
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Dec 3, 2024
d3bade0
remove try/catches
pidgezero-one Dec 3, 2024
011e5df
shouldnt be a return val here
pidgezero-one Dec 3, 2024
598f1c7
address import problems
pidgezero-one Dec 3, 2024
03f12b5
ruff fixes
pidgezero-one Dec 3, 2024
0e0fb1d
precommit fixes
pidgezero-one Dec 3, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
114 changes: 86 additions & 28 deletions openlibrary/catalog/add_book/load_book.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
import web
from openlibrary.catalog.utils import flip_name, author_dates_match, key_int
from openlibrary.core.helpers import extract_year
from openlibrary.utils import uniq

if TYPE_CHECKING:
from openlibrary.plugins.upstream.models import Author
Expand Down Expand Up @@ -142,10 +143,59 @@ def walk_redirects(obj, seen):
seen.add(obj['key'])
return obj

# Try for an 'exact' (case-insensitive) name match, but fall back to alternate_names,
# then last name with identical birth and death dates (that are not themselves `None` or '').
def get_redirected_authors(authors: list["Author"]):
if any(a.type.key != '/type/author' for a in authors):
seen: set[dict] = set()
all_authors = [
walk_redirects(a, seen) for a in authors if a['key'] not in seen
]
return all_authors
return authors

# Look for OL ID first.
if key := author.get("key"):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a new feature to use author identifiers for author matching that is being mixed in with other work rather than being put forward as a clear stand alone pre-requisite.

#10029 feels like a planning step, without actually making the feature request because it is presented as a question. #9927 covers the first step, but it focuses on importing identifiers in the first place, which we mostly do.

The desired feature appears to be "Use author identifiers on import to determine existing author record matches, (not just Name and date, which is the current method)"

This would involve extending the import schema to support populating the Author.remote_ids values, which does seem to be missing at the moment, although the UI allows them to be edited after import.

I'm flagging this line in particular because I don't the the author OLID value should be called key in the import schema. At this point it is just another identifier sourced from somewhere other than Open Library. It should be something like identifiers: {openlibrary: OL1234A}

Being clear about which identifiers are suitable for this purpose up front would be good too. I agree that Amazon ids aren't very 'strong' for example. VIAF, ISNI, and Wikidata are the core ones I believe have decent coverage in OL and are likely to be useful right now.

Obviously, when a general mechanism is in place, the list can be extended as new usecases are brought forward.

There was quite a bit of reading between the lines to figure out what the core new feature is here. Changing the import schema significantly in code without a clear driving feature or design had me worried.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can move the importing piece out into a separate PR. That code is here because I'd already done a proof of concept for it, but I was wary about trying to push it through when the remote_ids we'd be matching to aren't filled out as much as they could be. So backfilling that info for existing authors and proactively filling it out for future WD fetches going forward would ideally be a prerequisite to incorporating that import change.

I should probably hold off on further work on this at least until an agreement is reached about which identifiers should be used, @Freso mentioned in the issue thread that "library identifiers are, in my experience, often conflated and/or lacking a lot of entries, like OCLC/VIAF/ISNI are ripe with both duplicates and conflated entities and also don’t have information on a lot of items (either reliable/useful information, or just straight no information at all). In my experience, identifiers that are community maintained/curated (like MBIDs, BBIDs, WD ids) are far more reliable, but all datasets—community or institution managed—has its holes/gaps."

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hornc please see also my mid-October comments on the related PR #9674

The whole issue of author matching and strong identifier usage is much too important to be hidden in a PR about WikiSource importing.

@hornc should be involved and the main use case of MARC import of records containing VIAF, LCCN, etc identifiers should, in my opinion, be implemented and debugged first before addressing obscure use cases like WikiSource.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The desired feature appears to be "Use author identifiers on import to determine existing author record matches, (not just Name and date, which is the current method)"

#9411
#9448

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for those issue links @Freso , individual PRs that address #9411 and #9448 would be much easier to review and understand. I don't think either are really a pre-requisite to the Wikisource import feature, but #9411 and #9448 are clear and self contained, and seem reasonable to do.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#7724 from a year and a half ago is also relevant. Since MARC records have the highest volume of identifiers, I'd suggest starting with that.

if reply := list(web.ctx.site.things({"type": "/type/author", "key~": key})):
# Always match on OL ID, even if remote identifiers don't match.
return get_redirected_authors([web.ctx.site.get(k) for k in reply])
# Try other identifiers next.
if identifiers := author.get("identifiers"):
queries = []
matched_authors = []
# Get all the authors that match any incoming identifier.
for identifier, val in identifiers.items():
queries.append({"type": "/type/author", f"remote_ids.{identifier}~": val})
for query in queries:
if reply := list(web.ctx.site.things(query)):
matched_authors.extend(
get_redirected_authors([web.ctx.site.get(k) for k in reply])
)
matched_authors = uniq(matched_authors)
# The match is whichever one has the most identifiers in common AND does not have more conflicts than matched identifiers.
highest_matches = 0
selected_match = None
for a in matched_authors:
try:
_, matches = a.merge_remote_ids(identifiers)
if matches > highest_matches:
selected_match = a
highest_matches = matches
elif matches == highest_matches and matches > 0:
# Prioritize the lower OL ID when matched identifiers are equal
selected_match = (
a
if a.get_key_numeric() < selected_match.get_key_numeric()
else selected_match
)
except:
# Reject if too many conflicts
# TODO: raise a flag to librarians here?
pass
if highest_matches > 0 and selected_match is not None:
return [selected_match]
# Fall back to name/date matching, which we did before introducing identifiers.
name = author["name"].replace("*", r"\*")
queries = [
{"type": "/type/author", "name~": name},
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note to self: remove this

{"type": "/type/author", "name~": name},
{"type": "/type/author", "alternate_names~": name},
{
Expand All @@ -155,37 +205,17 @@ def walk_redirects(obj, seen):
"death_date~": f"*{extract_year(author.get('death_date', '')) or -1}*",
}, # Use `-1` to ensure an empty string from extract_year doesn't match empty dates.
]
things = []
for query in queries:
if reply := list(web.ctx.site.things(query)):
break

authors = [web.ctx.site.get(k) for k in reply]
if any(a.type.key != '/type/author' for a in authors):
seen: set[dict] = set()
authors = [walk_redirects(a, seen) for a in authors if a['key'] not in seen]
return authors


def find_entity(author: dict[str, Any]) -> "Author | None":
"""
Looks for an existing Author record in OL
and returns it if found.

:param dict author: Author import dict {"name": "Some One"}
:return: Existing Author record if found, or None.
"""
assert isinstance(author, dict)
things = find_author(author)
if author.get('entity_type', 'person') != 'person':
return things[0] if things else None
things = get_redirected_authors([web.ctx.site.get(k) for k in reply])
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note to self: break here

match = []
seen = set()
for a in things:
key = a['key']
if key in seen:
continue
seen.add(key)
orig_key = key
assert a.type.key == '/type/author'
if 'birth_date' in author and 'birth_date' not in a:
continue
Expand All @@ -195,10 +225,27 @@ def find_entity(author: dict[str, Any]) -> "Author | None":
continue
match.append(a)
if not match:
return None
return []
if len(match) == 1:
return match[0]
return pick_from_matches(author, match)
return [match[0]]
return [pick_from_matches(author, match)]


def find_entity(author: dict[str, Any]) -> "Author | None":
"""
Looks for an existing Author record in OL
and returns it if found.

:param dict author: Author import dict {"name": "Some One"}
:return: Existing Author record if found, or None.
"""
assert isinstance(author, dict)
things = find_author(author)
if "identifiers" in author:
for index, t in enumerate(things):
t.remote_ids, _ = t.merge_remote_ids(author["identifiers"])
things[index] = t
return things[0] if things else None


def remove_author_honorifics(name: str) -> str:
Expand Down Expand Up @@ -246,9 +293,20 @@ def import_author(author: dict[str, Any], eastern=False) -> "Author | dict[str,
new['death_date'] = author['death_date']
return new
a = {'type': {'key': '/type/author'}}
for f in 'name', 'title', 'personal_name', 'birth_date', 'death_date', 'date':
for f in (
'name',
'title',
'personal_name',
'birth_date',
'death_date',
'date',
'remote_ids',
):
if f in author:
a[f] = author[f]
# Import record hitting endpoing should list external IDs under "identifiers", but needs to be "remote_ids" when going into the DB.
if "identifiers" in author:
a["remote_ids"] = author["identifiers"]
return a


Expand Down
74 changes: 71 additions & 3 deletions openlibrary/catalog/add_book/tests/test_load_book.py
Original file line number Diff line number Diff line change
Expand Up @@ -136,9 +136,77 @@ def test_author_wildcard_match_with_no_matches_creates_author_with_wildcard(
new_author_name = import_author(author)
assert author["name"] == new_author_name["name"]

def test_first_match_priority_name_and_dates(self, mock_site):
def test_first_match_ol_key(self, mock_site):
"""
Highest priority match is name, birth date, and death date.
Highest priority match is OL key.
"""
self.add_three_existing_authors(mock_site)

# Author with VIAF
author = {
"name": "William H. Brewer",
"key": "/authors/OL3A",
"type": {"key": "/type/author"},
"remote_ids": {"viaf": "12345678"},
}

# Another author with VIAF
author_different_key = {
"name": "William Brewer",
"key": "/authors/OL4A",
"type": {"key": "/type/author"},
"remote_ids": {"viaf": "87654321"},
}

mock_site.save(author)
mock_site.save(author_different_key)

# Look for exact match on OL ID, regardless of other fields.
# We ideally shouldn't ever have a case where different authors have the same VIAF, but this demonstrates priority.
searched_author = {
"name": "William H. Brewer",
"key": "/authors/OL4A",
"identifiers": {"viaf": "12345678"},
}
found = import_author(searched_author)
assert found.key == author_different_key["key"]

def test_second_match_strong_identifier(self, mock_site):
"""
Next highest priority match is any other strong identifier, such as VIAF, Goodreads ID, Amazon ID, etc.
pidgezero-one marked this conversation as resolved.
Show resolved Hide resolved
"""
self.add_three_existing_authors(mock_site)

# Author with VIAF
author = {
"name": "William H. Brewer",
"key": "/authors/OL3A",
"type": {"key": "/type/author"},
"remote_ids": {"viaf": "12345678"},
}

# Another author with VIAF
author_different_viaf = {
"name": "William Brewer",
"key": "/authors/OL4A",
"type": {"key": "/type/author"},
"remote_ids": {"viaf": "87654321"},
}

mock_site.save(author)
mock_site.save(author_different_viaf)

# Look for exact match on VIAF, regardless of name field.
searched_author = {
"name": "William Brewer",
"identifiers": {"viaf": "12345678"},
}
found = import_author(searched_author)
assert found.key == author["key"]

def test_third_match_priority_name_and_dates(self, mock_site):
"""
Next highest priority match is name, birth date, and death date.
"""
self.add_three_existing_authors(mock_site)

Expand Down Expand Up @@ -201,7 +269,7 @@ def test_non_matching_birth_death_creates_new_author(self, mock_site):
assert isinstance(found, dict)
assert found["death_date"] == searched_and_not_found_author["death_date"]

def test_second_match_priority_alternate_names_and_dates(self, mock_site):
def test_match_priority_alternate_names_and_dates(self, mock_site):
"""
Matching, as a unit, alternate name, birth date, and death date, get
second match priority.
Expand Down
2 changes: 2 additions & 0 deletions openlibrary/components/AuthorIdentifiers.vue
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,8 @@ const identifierPatterns = {
lc_naf: /^n[bors]?[0-9]+$/,
amazon: /^B[0-9A-Za-z]{9}$/,
youtube: /^@[A-Za-z0-9_\-.]{3,30}/,
imdb: /^\w{2}\d+$/,
opac_sbn: /^\D{2}[A-Z0-3]V\d{6}$/,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this part of this PR?

}
export default {
// Props are for external options; if a subelement of this is modified,
Expand Down
26 changes: 25 additions & 1 deletion openlibrary/core/models.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
import web
import json
import requests
import re
from typing import Any, TypedDict
from collections import defaultdict
from dataclasses import dataclass, field
Expand All @@ -29,7 +30,7 @@
from openlibrary.core.ratings import Ratings
from openlibrary.utils import extract_numeric_id_from_olid
from openlibrary.utils.isbn import to_isbn_13, isbn_13_to_isbn_10, canonical
from openlibrary.core.wikidata import WikidataEntity, get_wikidata_entity
from openlibrary.core.wikidata import WikidataEntity, get_wikidata_entity, REMOTE_IDS

from . import cache, waitinglist

Expand Down Expand Up @@ -217,6 +218,10 @@ def _get_d(self):
"l": self._get_lists_cached(),
}

def get_key_numeric(self):
"""Returns just the numeric part of the key."""
return int(re.search(r'\d+', self.key))
Copy link
Contributor

@Freso Freso Nov 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that we know our own identifiers, we also know that they always start with OL and end with a single letter (currently A, W, or E). This means that, instead of doing an expensive regular expression, we could probably just some .strip()’ing – or even just slicing the string:

Suggested change
return int(re.search(r'\d+', self.key))
return int(self.key[2:-1])

But also, what does extract_numeric_id_from_olid from openlibrary.utils do? Just based on the name, it seems eerily similar to this method, but I didn’t look it up. :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's used here to choose the lower of two OL IDs when identifiers are otherwise the same: https://github.com/internetarchive/openlibrary/pull/10092/files#diff-8a66754753640315d80bf708c3483a13439e24736f081161b22e2d59cee76314R183

I haven't tried yet if this will work with just a string, which was going to be part of the unit tests I add when I mark this PR as ready for review

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But why this method instead of openlibrary.utils.extract_numeric_id_from_olid() for this? What does this method do that the existing function doesn’t do?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably nothing, I've just never seen that function before



class ThingReferenceDict(TypedDict):
key: ThingKey
Expand Down Expand Up @@ -806,6 +811,25 @@ def get_edition_count(self):
def get_lists(self, limit=50, offset=0, sort=True):
return self._get_lists(limit=limit, offset=offset, sort=sort)

def merge_remote_ids(
self, incoming_ids: dict[str, str]
) -> tuple[dict[str, str], int]:
output = {**self.remote_ids}
if len(incoming_ids.items()) == 0:
return output, 0
matches = 0
conflicts = 0
for identifier in REMOTE_IDS:
if identifier in output and identifier in incoming_ids:
if output[identifier] != incoming_ids[identifier]:
conflicts = conflicts + 1
else:
output[identifier] = incoming_ids[identifier]
matches = matches + 1
if conflicts > matches:
raise Exception("wikidata json conflicts with existing remote ids")
return output, matches


class User(Thing):
DEFAULT_PREFERENCES = {
Expand Down
Loading
Loading