Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing item records #2536

Closed
paul-butcher opened this issue Jan 26, 2024 · 20 comments
Closed

Missing item records #2536

paul-butcher opened this issue Jan 26, 2024 · 20 comments
Assignees

Comments

@paul-butcher
Copy link
Contributor

paul-butcher commented Jan 26, 2024

There are six records where

Sierra suggests that item records should be visible online but they're not (although linked bib records are - work IDs listed in the csv), and this seems to be somehow connected to the 962 field in Sierra

Slack

@paul-butcher paul-butcher converted this from a draft issue Jan 26, 2024
@paul-butcher paul-butcher self-assigned this Jan 26, 2024
@paul-butcher
Copy link
Contributor Author

paul-butcher commented Jan 26, 2024

In the original message, two records (EPB/A/57329.v1, i12614166 and EPB/A/57329.v2, i12614178) are stated as being completely absent, both corresponding to b12666579, which is szjj6gqy, which redirects to (i.e. merges with) https://wellcomecollection.org/works/jb4kp3td

(found by searching for the b number in the works-identified index, thus: _search?q=b12666579)

@paul-butcher
Copy link
Contributor Author

paul-butcher commented Jan 26, 2024

szjj6gqy has a relatively large matcher graph, including b33013950 and L0005170, which are suppressed. Could this be related to #1561

szjj6gqy.pdf

@paul-butcher
Copy link
Contributor Author

paul-butcher commented Jan 26, 2024

The second one on the list - mkmgxzfz has nothing else in the graph. It's just L0025074, so it's not likely to be a merger-based problem. However, where has b12695567 gone?

@paul-butcher
Copy link
Contributor Author

paul-butcher commented Jan 29, 2024

11:15:54.958 [main-actor-system-akka.actor.default-dispatcher-11] ERROR w.p.t.sierra.SierraTransformer - Failed to perform transform to unified item of Work[sierra-system-number/b12695567]

I have a suspicion

@paul-butcher
Copy link
Contributor Author

Suspicion confirmed (for this one at least):
In b12695567, there is a MARC 650 field with a second indicator value of 0 (i.e. LCSH), which contains a MeSH code (D009524).

650 0 Press.|0sh 85106500
650 0 Newspapers|xEnglish.|0D009524.
650 2 Journalism|xhistory.|0D020452Q000266
651 0 Great Britain.|0n 79023147

This causes the transformer to fail because it cannot work out which kind of Library of Congress Identifier D009524 is (because it isn't one).

@paul-butcher
Copy link
Contributor Author

paul-butcher commented Jan 29, 2024

This is only a problem for b12695567. The MARC records behind the other b numbers do not bear such an identifier/scheme mismatch.

  • b32950974: ¿Both 650 fields look fine?
  • b1456029x: ¿No 6xx fields?
  • b12666579: ¿Both 650 and 655 look fine, must be something else?

All three of these appear to go through the transformer without a hitch.

@paul-butcher
Copy link
Contributor Author

paul-butcher commented Jan 29, 2024

b12666579 corresponds to xwtagnu3, and is mentioned in three MIRO records out of the id minter: szjj6gqy(L0001320), hqj4zaem(L0005172), qjxnq9es(L0005171). These all redirect to https://wellcomecollection.org/works/jb4kp3td

@paul-butcher
Copy link
Contributor Author

I wonder if it relates to #2460

@paul-butcher
Copy link
Contributor Author

paul-butcher commented Jan 29, 2024

@paul-butcher
Copy link
Contributor Author

Before merging, xwtagnu3 has two items

      "items": [
        {
          "id": {
            "canonicalId": "ck5zktvq",
            "sourceIdentifier": {
              "identifierType": {
                "id": "sierra-system-number"
              },
              "ontologyType": "Item",
              "value": "i12614166"
            },
            "otherIdentifiers": [
              {
                "identifierType": {
                  "id": "sierra-identifier"
                },
                "ontologyType": "Item",
                "value": "1261416"
              }
            ],
            "type": "Identified"
          },
          "title": "Vol. 1",
          "locations": [
            {
              "locationType": {
                "id": "closed-stores"
              },
              "label": "Closed stores",
              "shelfmark": "EPB/A/57329.v1",
              "accessConditions": [
                {
                  "method": {
                    "type": "OnlineRequest"
                  },
                  "status": {
                    "type": "Open"
                  }
                }
              ],
              "type": "PhysicalLocation"
            }
          ]
        },
        {
          "id": {
            "canonicalId": "b6v5kccp",
            "sourceIdentifier": {
              "identifierType": {
                "id": "sierra-system-number"
              },
              "ontologyType": "Item",
              "value": "i12614178"
            },
            "otherIdentifiers": [
              {
                "identifierType": {
                  "id": "sierra-identifier"
                },
                "ontologyType": "Item",
                "value": "1261417"
              }
            ],
            "type": "Identified"
          },
          "title": "Vol. 2",
          "locations": [
            {
              "locationType": {
                "id": "closed-stores"
              },
              "label": "Closed stores",
              "shelfmark": "EPB/A/57329.v2",
              "accessConditions": [
                {
                  "method": {
                    "type": "OnlineRequest"
                  },
                  "status": {
                    "type": "Open"
                  }
                }
              ],
              "type": "PhysicalLocation"
            }
          ]
        }
      ],

jb4kp3td has one

      "items": [
        {
          "id": {
            "canonicalId": "fjy67ayz",
            "sourceIdentifier": {
              "identifierType": {
                "id": "sierra-system-number"
              },
              "ontologyType": "Item",
              "value": "i12020114"
            },
            "otherIdentifiers": [
              {
                "identifierType": {
                  "id": "sierra-identifier"
                },
                "ontologyType": "Item",
                "value": "1202011"
              }
            ],
            "type": "Identified"
          },
          "locations": [
            {
              "locationType": {
                "id": "closed-stores"
              },
              "label": "Closed stores",
              "accessConditions": [
                {
                  "method": {
                    "type": "OnlineRequest"
                  },
                  "status": {
                    "type": "Open"
                  }
                }
              ],
              "type": "PhysicalLocation"
            }
          ]
        }
      ],

After the merge, jb4kp3td has none.

@paul-butcher
Copy link
Contributor Author

paul-butcher commented Jan 29, 2024

The relevant merge rules all chain with orElse, finishing with just returning the target's item list. This means that one of the earlier rules is being applied, returning Nil.

It's made of six Miros and four Sierras.

@paul-butcher
Copy link
Contributor Author

paul-butcher commented Jan 29, 2024

Yes. It looks like the culprit is mergeDigitalIntoPhysicalSierraTarget.

jb4kp3td has a single physical item, which means that it merges the list of digitised items with the target's list. However, that only happens if the source is a "physical/digitised Sierra work" otherwise the target's item list is emptied

@paul-butcher
Copy link
Contributor Author

paul-butcher commented Jan 29, 2024

So, I think there are two things at play here:

  1. I suspect that emptying out the list of physical locations when it can't find a linked digitised item is incorrect, and it should probably just return the target's items.
  2. I think it is related to Some works are merged non-deterministically #1561. All four of those Sierras in the jb4kp3td graph have the same target precedence (they are all Sierras with at least one Physical item), but only xwtagnu3 has two items. The others have one. (see also Slack)

@paul-butcher
Copy link
Contributor Author

There is only a test for the positive side of this - i.e. Does it add Digital items to a Physical one? and not the rest of the possible scenarios - Does it combine Physical ones, what does it do if there are no digital ones? I suspect that's a bit of an oversight.

it("merges an 856 item from a digitised Sierra work into a physical work") {

@paul-butcher
Copy link
Contributor Author

It looks like the same thing is happening for f55dwbyz and wpznxk63. wpznxk63 has one item, f55dwbyz has three. Everything gets merged into wpznxk63, emptying wpznxk63's item list because none of f55dwbyz's items are of the right kind.

Both ps2phs5n and f55dwbyz have one item each, but neither of them are of the right kind to merge into the other.

@paul-butcher
Copy link
Contributor Author

paul-butcher commented Jan 30, 2024

It gets weirder! Why does b11675676 link to M0011992? I can't see it in the MARC. False alarm, it's here:

089 00 M 11992 

@paul-butcher
Copy link
Contributor Author

Slack ramble

@paul-butcher
Copy link
Contributor Author

paul-butcher commented Feb 2, 2024

I've got to the bottom of this, and there are a few different problems.

The simplest one is this:

MeSH IDs marked as LCSH

mkmgxzfz comes from the Miro record only, so the items are missing. Sierra b12695567 is failing in the transform because it has a MeSH id in a field that should be a LoC id 650 0 Newspapers|xEnglish.|0D009524 There are a few of these scattered around Sierra. It looks like something happened in August last year that set the second indicator on some fields to 0 when they had previously been 2 (b30489313 also has an example of this happening 651 2 Latin America.|0D007843), unfortunately, there was something wrong with reporting around then, so I can't easily create a handy list of them.

The next two are related to the merging of records:

Dropping items

We had a specific scenario in which items were simply dropped. I'm fixing this now.

Suspicious merges

Finally (I don't know whether this is related to the boundwith problem mentioned in the other thread?) There are some merges happening via links to Miro images that seem suspicious. I'm not sure if this is something that needs to be fixed in the source data, of if it's something I can deal with in the merger.

wpznxk63 is the result of merging these two Sierra records with the corresponding Miro records
b1456029x (i13863034) merges with 9 Miro images, including L0013383,
962 000:000:URL:b0000000:000000:0:0:0:0:0:0|vn|uhttp:// wellcomeimages.org/indexplus/image/L0013383.html|ehttp:// wellcomeimages.org/ixbin/imageserv?MIRO=L0013383
b15260963 (i14614224, i15122633, i15122645) also merges with L0013383.
089 00 L 13383

Something similar is happening with jb4kp3td (b1180232, b12666579, b11802339, b11802315), where there is a mixture of 089 and 962 fields linking to Miro.

b11675676 and b32950974 have the same phenomenon - b11675676 links to M0011992 via an 089 field, and b32950974 with 962.
Which items do we expect to see on the resulting record on the website? Do we actually expect these two records to merge?

@amme2
Copy link

amme2 commented Feb 6, 2024

Thanks @paul-butcher.

We've sorted out the subject headings IDs in mkmgxzfz and the relevant item records are now appearing online. It does look like there's a wider issue here connected to the automated subject authority control process back in August 2023, so Collections Information will pick this up and investigate further.

Similarly the suspicious merges examples are probably best fixed at source rather than in the merger, but that decision is based on there only being 3 examples. Are you able to check whether there are only 3? If it turns out there are thousands of examples of this issue, we might want to take a different tack in resolving it...

@paul-butcher
Copy link
Contributor Author

I don't think we have an efficient way to find matcher relationships by that criterion. If you can fix these three for now, and we can have a think about how we can report this in the future.

@paul-butcher paul-butcher moved this from Blocked to Done in Digital platform Feb 9, 2024
@pollecuttn pollecuttn moved this from Done to Archive in Digital platform Feb 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants