Retry search at current cursor on total:None #548

hornc · 2022-08-22T01:03:47Z

This is more of a workaround for an issue I've run into on search paging results. I'm not sure why the client is sporadically receiving an unexpected response -- it seems like a problem with the search API which might need investigation.

Getting the following error occasionally on ia search with largish result sets (eg: 92669) part way through:

Traceback (most recent call last):
  File "/home/charles/.local/bin/ia", line 8, in <module>
    sys.exit(main())
  File "/home/charles/.local/lib/python3.8/site-packages/internetarchive/cli/ia.py", line 171, in main
    sys.exit(ia_module.main(argv, session))
  File "/home/charles/.local/lib/python3.8/site-packages/internetarchive/cli/ia_search.py", line 100, in main
    for result in search:
  File "/home/charles/.local/lib/python3.8/site-packages/internetarchive/search.py", line 279, in __next__
    return self.iterator.__next__()
  File "/home/charles/.local/lib/python3.8/site-packages/internetarchive/search.py", line 153, in _scrape
    num_found = int(j['total'])
TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType'

Example API JSON response (from adding some debug code to capture the unexpected response):

UNEXPECTED RESPONSE: {'items': [], 'count': 0, 'previous': 'W3siaWRlbn....REDACTEDjustincase...X0tONjgzIn1d', 'total': None}

This was occurring sporadically for me over the weekend at 10K multiples on various queries from 30K to 90K num_results.

The query I was using is admittedly a bit complex:

ia search "mediatype:texts AND boxid:IA* AND scribe3_search_catalog:isbn AND identifier:t*" -f "scribe3_search_id" -p scope:all

But I have in the past run 'worse' ones with more clauses, fields, and expected results without problems :)

I haven't tested yet whether the scope:all contributes to the buggy behaviour. I started filtering on identifier first letter to reduce the size of the queries.

The code change here simply checks total before assuming it is an int, and retries the 10K limit request at the current cursor if total is None -- which seems an unexpected result. Current code looks to be expecting either 0 or a positive integer.

This was enough to let me complete a number of large search tasks.

Possible issues:

This could get stuck in a loop -- there's no give-up clause if the API decides to consistently return total: None
I only checked very briefly whether 0 search results avoid this retry code -- searching for non existent categories with 0 results does not get stuck in a loop, but there may be other cases which need checking?
total:None looks like an API bug which would be masked by this workaround without understanding of the root cause.

internetarchive/search.py

maxz · 2022-08-27T15:13:23Z

I'm not a huge fan of this band-aid fix, especially when considering the possibility of the infinite None loop.
But without taking a deeper look at the pagination code I can't suggest a better solution either.

jjjake · 2022-08-30T19:11:09Z

Thanks for the PR @hornc, and thanks for taking a look, maxz!

I think we should retry a few times when we run into total: None, and then throw a ReadTimeout exception like we do here if all retries fail/return None for total. I think we should back off for 1 second on each retry, based on feedback from our search team. We should probably also expose these values (retry count, back off delay) via optional parameters.

@hornc I'm happy to make these changes if you'd like. I have a few other projects that need tending to, but hopefully I can get to it soon! Thanks again.

hornc · 2022-08-31T21:46:53Z

Thanks @jjjake , that sounds great!

@maxz, thanks for the review. I'll remove my comment line, but adding the back off etc and doing it properly will be the better solution. Happy to close this in favour of a complete solution from @jjjake.

cclauss · 2023-01-07T11:43:41Z

@mekarpeles This was the behavior we were seeing on Friday.

mekarpeles · 2023-02-02T17:11:05Z

First off, I want to ➕ 1 @jjjake's proposal of having some sort of retry w/ exponential backoff.

However, today the existing solution is worse than the PR that @hornc offers in that it throws exceptions up stream.

I would support merging @hornc's changes to unbreak code that relies on this functionality and then also support an effort which adds retries + backoff and warnings.

mekarpeles

would prefer exponential backoff as well, though this would love problems purl is seeing (we are currently swallowing the exceptions on our side)

hornc · 2023-05-16T04:13:31Z

This problem can still occur on v3.2.0 -- I had been using a older patched version with the fix in this PR daily (without exp. backoff, and without problems), but recently upgraded.

It's a sporadic issue, but I've seen it under v3.2.0. I've now upgraded to v3.5.0. If it occurs again I'll rebase this PR and see about adding a backoff.

retry search at current cursor on total:None

c68b6be

hornc marked this pull request as ready for review August 23, 2022 08:29

maxz reviewed Aug 27, 2022

View reviewed changes

internetarchive/search.py Outdated Show resolved Hide resolved

remove debug line

fce674f

cclauss approved these changes Feb 2, 2023

View reviewed changes

mekarpeles approved these changes Feb 2, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry search at current cursor on total:None #548

Retry search at current cursor on total:None #548

hornc commented Aug 22, 2022 •

edited

Loading

maxz commented Aug 27, 2022

jjjake commented Aug 30, 2022

hornc commented Aug 31, 2022

cclauss commented Jan 7, 2023

mekarpeles commented Feb 2, 2023

mekarpeles left a comment

hornc commented May 16, 2023

Retry search at current cursor on total:None #548

Are you sure you want to change the base?

Retry search at current cursor on total:None #548

Conversation

hornc commented Aug 22, 2022 • edited Loading

maxz commented Aug 27, 2022

jjjake commented Aug 30, 2022

hornc commented Aug 31, 2022

cclauss commented Jan 7, 2023

mekarpeles commented Feb 2, 2023

mekarpeles left a comment

Choose a reason for hiding this comment

hornc commented May 16, 2023

hornc commented Aug 22, 2022 •

edited

Loading