Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MAINTENANCE] Delete failed DataCite and Crossref queue entries #787

Open
adambuttrick opened this issue Nov 12, 2024 · 3 comments
Open
Labels
maintenance Routine or one-off maintenance or clean-up task

Comments

@adambuttrick
Copy link

Service/repository
EZID

Describe the current state/issue
As identified in #778, failed DataCite and Crossref queue entries do not need to be preserved and thus are in need of cleanup.

Describe the desired state/solution
Write a script to safely identify and delete failed DataCite/Crossref queue entries, taking into account the potential need for timestamp updates and status changes to trigger deletion.

Additional notes
Part of the queue cleanup effort tracked in #723.

@adambuttrick adambuttrick added the maintenance Routine or one-off maintenance or clean-up task label Nov 12, 2024
@jsjiang
Copy link
Contributor

jsjiang commented Dec 5, 2024

Use the newly developed management command update-async-queue-for-delete.py to delete failed DataCite and Crossref queue entries retrieved and listed in ticket #778 (comment)

  • 2024-12-05: delete failed entries from Crossref & DataCite queues
  • Recheck failed lists:
    • CrossRef queue: 0
    • DataCite queue: 84 - with enqueue date/time later than Nov 3. These are newer datasets created after the initial failed lists.
    • verified that the identifiers' status were set to 'I' in the crossref, datacite and searchindexer queues
    • TO-DO: verify entries are deleted from refIdentifier and the crossref, datacite and searchindexer queues after two weeks on Dec 19 after 10:30am.

Operation commands:

python manage.py update-async-queue-for-delete -i ~/ezid-ops-data/delete-queue-entries/failed_crossref_dois_tsv.txt > failed_crossref_dois.log

python manage.py update-async-queue-for-delete -i ~/ezid-ops-data/delete-queue-entries/failed_crossref_dois_tsv.txt > failed_crossref_dois.log

python manage.py update-async-queue-for-delete -i ~/ezid-ops-data/delete-queue-entries/failed_searchindexer_dois_tsv.txt > failed_is_dois.log

Datasets:

ezid@uc3-ezidui-prd01:10:03:14:~/ezid-ops-data/delete-queue-entries$ head *tsv.txt
==> failed_crossref_dois_tsv.txt <==
id	identifier	from_unixtime(q.enqueueTime)	from_unixtime(q.submitTime)
38055671	doi:10.15779/Z384T6F439	2022-05-22 14:31:52	NULL
38055730	doi:10.7272/Q6FJ2F2N	2022-05-23 00:33:01	NULL
38055829	doi:10.7297/X2FN152G	2022-05-23 06:52:39	NULL
38055831	doi:10.7297/X2639NMS	2022-05-23 06:53:18	NULL
38055832	doi:10.7297/X22F7M9S	2022-05-23 06:53:37	NULL
38055835	doi:10.7297/X2P55MCR	2022-05-23 06:55:42	NULL
38055837	doi:10.7297/X2DN43W7	2022-05-23 06:56:39	NULL
38055838	doi:10.7297/X28W3C5X	2022-05-23 06:57:02	NULL
38055839	doi:10.7297/X25719W8	2022-05-23 06:57:41	NULL

==> failed_datacite_dois_tsv.txt <==
id	identifier	from_unixtime(q.enqueueTime)	from_unixtime(q.submitTime)
38052259	doi:10.15697/FK27G7V	2022-05-21 10:16:26	2022-05-21 10:16:28
38052260	doi:10.31223/X5JM0K	2022-05-21 10:30:05	2022-05-21 10:30:05
38052265	doi:10.15697/FK23P4H	2022-05-21 11:25:11	2022-05-21 11:25:12
38052266	doi:10.15697/FK20022	2022-05-21 11:31:30	2022-05-21 11:31:31
38098760	doi:10.5070/P538257510	2022-05-27 10:55:48	2022-05-27 10:55:52
38098760	doi:10.5070/P538257510	2022-05-27 11:02:47	2022-05-27 11:02:48
36867411	doi:10.31223/X5HC98	2022-05-28 01:00:08	2022-05-28 01:00:09
38098760	doi:10.5070/P538257510	2022-05-30 19:40:45	2022-05-30 19:40:47
38118071	doi:10.31223/X5C05Q	2022-05-31 11:04:17	NULL

==> failed_searchindexer_dois_tsv.txt <==
id	identifier	from_unixtime(q.enqueueTime)	from_unixtime(q.submitTime)
12517622	ark:/87281/t2fj2f25	2024-02-06 10:40:18	NULL
12517647	ark:/87281/t26w98bm	2024-02-06 10:40:34	NULL
12517648	ark:/87281/t2348hnn	2024-02-06 10:40:35	NULL
12517649	ark:/87281/t2zc8149	2024-02-06 10:40:35	NULL
12517650	ark:/87281/t2tm78c4	2024-02-06 10:40:36	NULL
12517651	ark:/87281/t2pv6hmz	2024-02-06 10:40:37	NULL
12517652	ark:/87281/t2k35rwj	2024-02-06 10:40:37	NULL
12517653	ark:/87281/t2fb5162	2024-02-06 10:40:38	NULL
12517654	ark:/87281/t29p2zw8	2024-02-06 10:40:39	NULL

Database queries:

select ref.id, ref.identifier as datacite_doi, from_unixtime(q.enqueueTime), from_unixtime(q.submitTime)
from ezidapp_datacitequeue q
join ezidapp_refidentifier ref on ref.id = q.refIdentifier_id
where q.status = 'F' order by q.enqueueTime;

select ref.id, ref.identifier as crossref_doi, from_unixtime(q.enqueueTime), from_unixtime(q.submitTime)
from ezidapp_crossrefqueue q
join ezidapp_refIdentifier ref on ref.id = q.refIdentifier_id
where q.status = 'F' order by q.enqueueTime;

select ref.id, ref.identifier as identifier, from_unixtime(q.enqueueTime), from_unixtime(q.submitTime)
from ezidapp_searchindexerqueue q
join ezidapp_refIdentifier ref on ref.id = q.refIdentifier_id
where q.status = 'F' order by q.enqueueTime;

Verified entry status:

select * from ezidapp_refIdentifier
where id in 
(38055671, 38055730, 38055829, 38055831,
38052259, 38052260, 38098760, 38118071
);

select * from ezidapp_crossrefqueue
where refidentifier_id in 
(38055671, 38055730, 38055829, 38055831,
38052259, 38052260, 38098760, 38118071
);

select * from ezidapp_datacitequeue
where refidentifier_id in 
(38055671, 38055730, 38055829, 38055831,
38052259, 38052260, 38098760, 38118071
);

select * from ezidapp_searchindexerqueue
where refidentifier_id in 
(38055671, 38055730, 38055829, 38055831,
38052259, 38052260, 38098760, 38118071,
12517622, 12517647, 12517648, 12517649
);

Failed entries in the search indexer queue:
failed_searchindexer_dois_tsv.txt

@jsjiang
Copy link
Contributor

jsjiang commented Dec 5, 2024

TO-DO:

  • review entry status on Dec 19 after 10:30am
  • review newer failed DOIs from Crossref (84 entries)

failed_crossref_dois_after_20241103_tsv.txt

@adambuttrick Hi Adam, there are 84 failed entries in the Crossref queue since Nov 3. Please take a look. Thank you!

@jsjiang
Copy link
Contributor

jsjiang commented Dec 5, 2024

Queue status after deleting failed entries Dec 5:

ezid@uc3-ezidui-prd01:14:53:17:~/ezid$ python manage.py diag-queue-stats
{
  "download": {},
  "binder": {},
  "datacite": {
    "I": 1688,
    "O": 54127
  },
  "crossref": {
    "F": 84,
    "I": 58136,
    "O": 9,
    "U": 5,
    "W": 81
  },
  "searchindexer": {
    "I": 1688,
    "O": 54697,
    "U": 1
  }
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
maintenance Routine or one-off maintenance or clean-up task
Projects
None yet
Development

No branches or pull requests

2 participants