Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTTP Error with Volume() #48

Open
melaniewalsh opened this issue Apr 19, 2023 · 6 comments
Open

HTTP Error with Volume() #48

melaniewalsh opened this issue Apr 19, 2023 · 6 comments

Comments

@melaniewalsh
Copy link

I'm trying to fetch HathiTrust metadata for books in a spreadsheet via their HathiTrust IDs and Volume()

Screenshot 2023-04-18 at 4 58 29 PM

But I'm getting a lot of HTTP Errors like so, even though this URL does exist and contains HathiTrust data:

ERROR:root:HTTP Error accessing http://data.analytics.hathitrust.org/features-2020.03/mdp/31532/mdp.39015054033520.json.bz2

Screenshot 2023-04-18 at 5 02 19 PM

This issue seems similar to issue #45, but I'm using a Mac, not a Windows computer. Also this doesn't happen with all HathiTrust IDs, only some of them.

Any thoughts about what might be going wrong?

@bmschmidt
Copy link
Contributor

Are specific IDs always broken, or just sometimes? That one works fine for me in a colab notebook.

!pip install htrc-feature-reader

from htrc_features import Volume
v = Volume('mdp.39015054033520')
v.tokenlist()

Also possibly include the end of the error trace? hard to see what the http error code is here.

@organisciak
Copy link
Collaborator

Hmm, odd. I second Ben's question: when you say 'this doesn't happen with all HathiTrust IDs, only some of them': are the ones that succeed or fail consistently, or will the same ID sometimes fail and sometimes succeed?

That uses rsync with a subprocess, which is why the error catching is so poor. I suspect the file is failing to download but Python isn't catching it and still trying to open the volume.

By the way, if you're just loading metadata, there's also the https://www.hathitrust.org/hathifiles, or the HathiTrust Bib API: https://www.hathitrust.org/bib_api.

@bmschmidt
Copy link
Contributor

Oh yeah that's right this is kind of a painful way to get metadata. There is some data that Hathi only distributes through here, not the Hathifiles (e.g. LC classification) -- @melaniewalsh send me an e-mail if this is what you're looking for, I believe that I have some stuff about parsing this sitting in my e-mail somewhere.

@melaniewalsh
Copy link
Author

melaniewalsh commented Apr 19, 2023

Thanks @bmschmidt @organisciak! It's good to know about the HathiTrust Bib API.

There are a few reasons that I'm trying to get metadata from the Hathi IDs. We specifically included Hathi IDs with all book data in the Post45 Data Collective (e.g. NYT bestsellers) to enable people to work with the full texts/bags of words in HathiTrust. But I recently realized that the Hathi IDs are basically also our only consistent unique identifier for books, so now I'm trying to retroactively add ISBN and OCLC numbers, so we can make the datasets interoperable with other data about the same books. Similarly, I want to add ISBN/OCLC numbers to some of the Hathi derived datasets, like the Geographic Locations data, to make them interoperable with data like the Seattle Public Library's collection or circulation data.

Anyway, that's a long-winded way of saying that the HathiTrust Bib API sounds like it might be better for my metadata needs. But I would still like to create some notebooks and resources that demonstrate how you can take the Post45 Data Collective data and connect it with HathiTrust text data.

I'm including the full error message below (it's long). I'm calling Volume() on like 5,000 rows in a spreadsheet by applying a function to a column (I also tried looping through the data with Volume()), so I was wondering if it's happening too quickly or maybe the timing is the problem?

Error message 👇
Traceback (most recent call last):
  File "/Users/melwalsh/opt/anaconda3/lib/python3.8/site-packages/htrc_features/caching.py", line 73, in open
    fout = super().open(id, **kwargs)
  File "/Users/melwalsh/opt/anaconda3/lib/python3.8/site-packages/htrc_features/resolvers.py", line 121, in open
    uncompressed = self._open(id = id, suffix = suffix, mode = mode, format = format,
  File "/Users/melwalsh/opt/anaconda3/lib/python3.8/site-packages/htrc_features/resolvers.py", line 203, in _open
    return Path(dir, filename).open(mode = mode)
  File "/Users/melwalsh/opt/anaconda3/lib/python3.8/pathlib.py", line 1221, in open
    return io.open(self, mode, buffering, encoding, errors, newline,
  File "/Users/melwalsh/opt/anaconda3/lib/python3.8/pathlib.py", line 1077, in _opener
    return self._accessor.open(self, flags, mode)
FileNotFoundError: [Errno 2] No such file or directory: '/var/folders/t1/1xbnlp5j163cd9mt_ht253cw0000gp/T/mdp.39015018932429.json.bz2'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/melwalsh/opt/anaconda3/lib/python3.8/site-packages/htrc_features/resolvers.py", line 182, in _open
    byt = _urlopen(path_or_url).read()
  File "/Users/melwalsh/opt/anaconda3/lib/python3.8/urllib/request.py", line 222, in urlopen
    return opener.open(url, data, timeout)
  File "/Users/melwalsh/opt/anaconda3/lib/python3.8/urllib/request.py", line 531, in open
    response = meth(req, response)
  File "/Users/melwalsh/opt/anaconda3/lib/python3.8/urllib/request.py", line 640, in http_response
    response = self.parent.error(
  File "/Users/melwalsh/opt/anaconda3/lib/python3.8/urllib/request.py", line 569, in error
    return self._call_chain(*args)
  File "/Users/melwalsh/opt/anaconda3/lib/python3.8/urllib/request.py", line 502, in _call_chain
    result = func(*args)
  File "/Users/melwalsh/opt/anaconda3/lib/python3.8/urllib/request.py", line 649, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 404: Not Found
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
~/opt/anaconda3/lib/python3.8/site-packages/htrc_features/caching.py in open(self, id, fallback_kwargs, **kwargs)
     72             try:
---> 73                 fout = super().open(id, **kwargs)
     74                 logging.debug("Successfully returning from cache")

~/opt/anaconda3/lib/python3.8/site-packages/htrc_features/resolvers.py in open(self, id, suffix, format, mode, skip_compression, compression, **kwargs)
    120 
--> 121         uncompressed = self._open(id = id, suffix = suffix, mode = mode, format = format,
    122                                   compression=compression,  **kwargs)

~/opt/anaconda3/lib/python3.8/site-packages/htrc_features/resolvers.py in _open(self, id, format, mode, compression, dir, suffix, **kwargs)
    202         filename = self.fname(id, format = format, compression = compression, suffix = suffix)
--> 203         return Path(dir, filename).open(mode = mode)
    204 

~/opt/anaconda3/lib/python3.8/pathlib.py in open(self, mode, buffering, encoding, errors, newline)
   1220             self._raise_closed()
-> 1221         return io.open(self, mode, buffering, encoding, errors, newline,
   1222                        opener=self._opener)

~/opt/anaconda3/lib/python3.8/pathlib.py in _opener(self, name, flags, mode)
   1076         # A stub for the opener argument to built-in open()
-> 1077         return self._accessor.open(self, flags, mode)
   1078 

FileNotFoundError: [Errno 2] No such file or directory: '/var/folders/t1/1xbnlp5j163cd9mt_ht253cw0000gp/T/mdp.39015018932429.json.bz2'

During handling of the above exception, another exception occurred:

HTTPError                                 Traceback (most recent call last)
<ipython-input-117-7706fdd79865> in <module>
----> 1 hathi_df[["title", "isbn", "oclc", "lccn", "pub_date", "title", "pub_place", "publisher"]] = hathi_df[["hathi_id"]].apply(add_metadata_from_hathi, axis = "columns", result_type = "expand" )

~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/frame.py in apply(self, func, axis, raw, result_type, args, **kwargs)
   8734             kwargs=kwargs,
   8735         )
-> 8736         return op.apply()
   8737 
   8738     def applymap(

~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/apply.py in apply(self)
    686             return self.apply_raw()
    687 
--> 688         return self.apply_standard()
    689 
    690     def agg(self):

~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/apply.py in apply_standard(self)
    810 
    811     def apply_standard(self):
--> 812         results, res_index = self.apply_series_generator()
    813 
    814         # wrap results

~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/apply.py in apply_series_generator(self)
    826             for i, v in enumerate(series_gen):
    827                 # ignore SettingWithCopy here in case the user mutates
--> 828                 results[i] = self.f(v)
    829                 if isinstance(results[i], ABCSeries):
    830                     # If we have a view on v, we need to make a copy because

<ipython-input-116-e0bd80aa6927> in add_metadata_from_hathi(row)
      3     input_id = row["hathi_id"]
      4 
----> 5     volume = Volume(input_id)
      6 
      7     return volume.title, volume.isbn, volume.oclc, volume.lccn, volume.pub_date, volume.title, volume.pub_place, volume.publisher

~/opt/anaconda3/lib/python3.8/site-packages/htrc_features/feature_reader.py in __init__(self, id, format, id_resolver, default_page_section, path, compression, dir, file_handler, **kwargs)
    462                                 "but requested {} files".format(id_resolver.format, format))
    463 
--> 464         self.parser = retrieve_parser(id, format, id_resolver, compression, dir, 
    465                                       file_handler=file_handler, **kwargs)
    466 

~/opt/anaconda3/lib/python3.8/site-packages/htrc_features/feature_reader.py in retrieve_parser(id, format, id_resolver, compression, dir, file_handler, **kwargs)
    338         raise NotImplementedError("Must pass a format. Currently 'json' and 'parquet' are supported.")
    339 
--> 340     return Handler(id, id_resolver = id_resolver, dir = dir,
    341                    compression = compression, **kwargs)
    342 

~/opt/anaconda3/lib/python3.8/site-packages/htrc_features/parsers.py in __init__(self, id, id_resolver, compression, **kwargs)
    188 
    189         # parsing and reading are called here.
--> 190         super().__init__(id, id_resolver = id_resolver, compression = compression, **kwargs)
    191 
    192     def parse(self, **kwargs):

~/opt/anaconda3/lib/python3.8/site-packages/htrc_features/parsers.py in __init__(self, id, id_resolver, mode, **kwargs)
     77             return
     78 
---> 79         self.parse(**kwargs)
     80 
     81     def __init_resolver(self, id_resolver, format=None, **kwargs):

~/opt/anaconda3/lib/python3.8/site-packages/htrc_features/parsers.py in parse(self, **kwargs)
    192     def parse(self, **kwargs):
    193 
--> 194         obj = self._parse_json()
    195 
    196         self._schema = obj['features']['schemaVersion']

~/opt/anaconda3/lib/python3.8/site-packages/htrc_features/parsers.py in _parse_json(self, compression, **kwargs)
    289             if not k in kwargs:
    290                 kwargs[k] = self.args[k]
--> 291         with resolver.open(id, **kwargs) as fin:
    292             rawjson = fin.read()
    293 

~/opt/anaconda3/lib/python3.8/site-packages/htrc_features/caching.py in open(self, id, fallback_kwargs, **kwargs)
     82                     return input.parser.open(id, **fallback_kwargs)
     83                 else:
---> 84                     copy_between_resolvers(id, self.fallback, self.super)
     85                     fout = super().open(id, **kwargs)
     86                     logging.debug("Successfully returning from cache")

~/opt/anaconda3/lib/python3.8/site-packages/htrc_features/caching.py in copy_between_resolvers(id, resolver1, resolver2)
      8 def copy_between_resolvers(id, resolver1, resolver2):
      9 #    print (resolver1, "--->", resolver2)
---> 10     input = Volume(id, id_resolver=resolver1)
     11     output = Volume(id, id_resolver=resolver2, mode = 'wb')
     12     output.write(input)

~/opt/anaconda3/lib/python3.8/site-packages/htrc_features/feature_reader.py in __init__(self, id, format, id_resolver, default_page_section, path, compression, dir, file_handler, **kwargs)
    462                                 "but requested {} files".format(id_resolver.format, format))
    463 
--> 464         self.parser = retrieve_parser(id, format, id_resolver, compression, dir, 
    465                                       file_handler=file_handler, **kwargs)
    466 

~/opt/anaconda3/lib/python3.8/site-packages/htrc_features/feature_reader.py in retrieve_parser(id, format, id_resolver, compression, dir, file_handler, **kwargs)
    338         raise NotImplementedError("Must pass a format. Currently 'json' and 'parquet' are supported.")
    339 
--> 340     return Handler(id, id_resolver = id_resolver, dir = dir,
    341                    compression = compression, **kwargs)
    342 

~/opt/anaconda3/lib/python3.8/site-packages/htrc_features/parsers.py in __init__(self, id, id_resolver, compression, **kwargs)
    188 
    189         # parsing and reading are called here.
--> 190         super().__init__(id, id_resolver = id_resolver, compression = compression, **kwargs)
    191 
    192     def parse(self, **kwargs):

~/opt/anaconda3/lib/python3.8/site-packages/htrc_features/parsers.py in __init__(self, id, id_resolver, mode, **kwargs)
     77             return
     78 
---> 79         self.parse(**kwargs)
     80 
     81     def __init_resolver(self, id_resolver, format=None, **kwargs):

~/opt/anaconda3/lib/python3.8/site-packages/htrc_features/parsers.py in parse(self, **kwargs)
    192     def parse(self, **kwargs):
    193 
--> 194         obj = self._parse_json()
    195 
    196         self._schema = obj['features']['schemaVersion']

~/opt/anaconda3/lib/python3.8/site-packages/htrc_features/parsers.py in _parse_json(self, compression, **kwargs)
    289             if not k in kwargs:
    290                 kwargs[k] = self.args[k]
--> 291         with resolver.open(id, **kwargs) as fin:
    292             rawjson = fin.read()
    293 

~/opt/anaconda3/lib/python3.8/site-packages/htrc_features/resolvers.py in open(self, id, suffix, format, mode, skip_compression, compression, **kwargs)
    119             skip_compression = True
    120 
--> 121         uncompressed = self._open(id = id, suffix = suffix, mode = mode, format = format,
    122                                   compression=compression,  **kwargs)
    123 

~/opt/anaconda3/lib/python3.8/site-packages/htrc_features/resolvers.py in _open(self, id, mode, compression, **kwargs)
    180 
    181         try:
--> 182             byt = _urlopen(path_or_url).read()
    183             req = BytesIO(byt)
    184         except HTTPError:

~/opt/anaconda3/lib/python3.8/urllib/request.py in urlopen(url, data, timeout, cafile, capath, cadefault, context)
    220     else:
    221         opener = _opener
--> 222     return opener.open(url, data, timeout)
    223 
    224 def install_opener(opener):

~/opt/anaconda3/lib/python3.8/urllib/request.py in open(self, fullurl, data, timeout)
    529         for processor in self.process_response.get(protocol, []):
    530             meth = getattr(processor, meth_name)
--> 531             response = meth(req, response)
    532 
    533         return response

~/opt/anaconda3/lib/python3.8/urllib/request.py in http_response(self, request, response)
    638         # request was successfully received, understood, and accepted.
    639         if not (200 <= code < 300):
--> 640             response = self.parent.error(
    641                 'http', request, response, code, msg, hdrs)
    642 

~/opt/anaconda3/lib/python3.8/urllib/request.py in error(self, proto, *args)
    567         if http_err:
    568             args = (dict, 'default', 'http_error_default') + orig_args
--> 569             return self._call_chain(*args)
    570 
    571 # XXX probably also want an abstract factory that knows when it makes

~/opt/anaconda3/lib/python3.8/urllib/request.py in _call_chain(self, chain, kind, meth_name, *args)
    500         for handler in handlers:
    501             func = getattr(handler, meth_name)
--> 502             result = func(*args)
    503             if result is not None:
    504                 return result

~/opt/anaconda3/lib/python3.8/urllib/request.py in http_error_default(self, req, fp, code, msg, hdrs)
    647 class HTTPDefaultErrorHandler(BaseHandler):
    648     def http_error_default(self, req, fp, code, msg, hdrs):
--> 649         raise HTTPError(req.full_url, code, msg, hdrs, fp)
    650 
    651 class HTTPRedirectHandler(BaseHandler):

HTTPError: HTTP Error 404: Not Found```
</details>

@bmschmidt
Copy link
Contributor

For adding ISBN/OCLC/LCCN identifiers I would probably use Hathifiles. You can just download and parse the data in. The bibAPI can be slow, IIRC. link They have these columns.

But 5k isn't that much, so the bibAPI is fine.

I'd also just write ht-help--don't know if anyone there monitors this repo, but when I've had this kind of issue it tends to be because some of their servers are on the blink--I think there's some load-balancing for several or something like that.

@melaniewalsh
Copy link
Author

melaniewalsh commented Apr 20, 2023

Thanks @bmschmidt. That's a good call about reaching out to ht-help (edit: I'm not actually getting the same error with the BibAPI — I'm getting a different error). But I will try out the Hathifiles — thanks for the tip!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants