Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data in the reporting cluster does not match data elsewhere #2541

Open
paul-butcher opened this issue Jan 31, 2024 · 4 comments
Open

Data in the reporting cluster does not match data elsewhere #2541

paul-butcher opened this issue Jan 31, 2024 · 4 comments

Comments

@paul-butcher
Copy link
Contributor

Slack

When investigating #2536, I noticed that one of the problematic records has an invalid MARC 650 field. It declares that it is a Library of Congress id (ind2=0), but contains a MeSH id (subfield 0 starts with D)

The record in question is b1269556, but this is not unique to that record. I have also seen this error occur with D009524Q000266 and D010297 and plenty of other incorrectly marked MeSH ids.

650  0 Newspapers|xEnglish.|0D009524. 

When this record last went through the pipeline, it logged an error:

Could not determine LoC scheme from id 'D009524'

When I looked in VHS for it, the field in question is incorrect as expected (ind2=0, subfield 0= D009524)

{
      "fieldTag": "d",
      "marcTag": "650",
      "ind1": " ",
      "ind2": "0",
      "subfields": [
        {
          "tag": "a",
          "content": "Newspapers"
        },
        {
          "tag": "x",
          "content": "English."
        },
        {
          "tag": "0",
          "content": "D009524."
        }
      ]
    },

I decided I should make a report on this, to see how widespread the issue is, and facilitate its resolution.

Imagine my surprise when there appear to be no 6xx varfields with ind2=0 and a MeSH id. I know this to be incorrect, as I am currently looking at one. I searched for a few other known offenders, and they all have the correct ind2 value (2).

I cannot work out where this is coming from. How does the reporting cluster end up with different content to everywhere else?

@paul-butcher
Copy link
Contributor Author

I did fix a problem in the adapters recently, that could cause reporting to have missed some changes that the pipeline will have picked up.

However, that's the wrong way round.

  • The source data error is the kind of error one makes when originally writing the content. I can't imagine someone coming along and changing ind2 to the wrong value.
  • I have looked back through some of the history of this record on VHS. It last changed before that problem arose, this field seems to always have contained ind2=0

@paul-butcher
Copy link
Contributor Author

paul-butcher commented Jan 31, 2024

I have just run a window covering the last known change on that document {"start": "2023-08-18T09:35:00.000000+00:00", "end": "2023-08-18T09:35:36+00:00"}, and ind2 in reporting is now 0.

This is now the only ind2=0 varfield whose code does not start with sh.

Perhaps they were already all correct and then a batch process on Sierra broke them all. This record was correct in 2021.

@paul-butcher
Copy link
Contributor Author

Right. Very odd.

b1269556 was originally correct, then at 2023-08-10T18:08:07Z it changed ind2 from 2 to 0.

So, we have (had) two problems here

  1. The source data problem. It's wrong.
  2. The reporting hole, perhaps I was wrong when I estimated it had been broken since September. The data is now going into reporting as expected, but there is still a gap of some kind.

The existence of this source data problem could have been easier to spot and deal with if we had a better way to report dodgy content to collections staff.

@paul-butcher
Copy link
Contributor Author

the same is true of b30489313. August last year - 2023-08-17 15:53:15Z - it changed from the correct ind2=2 to the incorrect ind2=0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant