Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Access denied via Volume() #45

Open
Ori-Pixel opened this issue Feb 24, 2022 · 6 comments
Open

Access denied via Volume() #45

Ori-Pixel opened this issue Feb 24, 2022 · 6 comments

Comments

@Ori-Pixel
Copy link

Ori-Pixel commented Feb 24, 2022

from htrc_features import Volume

print(Volume('mdp.39015028036104').tokenlist())

ERROR:root:HTTP Error accessing http://data.analytics.hathitrust.org/features-2020.03/mdp\31230\mdp.39015028036104.json.bz2

I can access the page via browser, but it gives a 'no online access' page. I can download the json file, but I cannot use that within the Volume()

If I try path = r""

I get: OSError: Invalid data stream

@bmschmidt
Copy link
Contributor

Thank you for the clear report. What is your system setup? (Especially OS and character encoding). This code works fine on my Mac and on a clean google colab instance, so I suspect it must be some kind of system-specific escaping problem.

@bmschmidt
Copy link
Contributor

Also, is this a problem with all htids, or just the one here?

Given the backslashes where there should be slashes in that URL, it seems possible that this is pathlib overcompensating for being on Windows, but I'm not sure why we wouldn't have seen this before.

"mdp\31230\mdp"

@Ori-Pixel
Copy link
Author

Ori-Pixel commented Feb 24, 2022

Setup:

Windows 10 Pro
IsSingleByte      : True
BodyName          : iso-8859-1
EncodingName      : Western European (Windows)
HeaderName        : Windows-1252
WebName           : Windows-1252
WindowsCodePage   : 1252
IsBrowserDisplay  : True
IsBrowserSave     : True
IsMailNewsDisplay : True
IsMailNewsSave    : True
EncoderFallback   : System.Text.InternalEncoderBestFitFallback
DecoderFallback   : System.Text.InternalDecoderBestFitFallback
IsReadOnly        : True
CodePage          : 1252

The first/top error when only passing: Volume('mdp.39015029970129').tokenlist())

FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Users\\<name>\\AppData\\Local\\Temp\\mdp.39015029970129.json.bz2'

It doesn't work for any IDs for me. It was working on Mac/Manjaro

@bmschmidt
Copy link
Contributor

Thanks. Let me tag in @organisciak since I think he develops this on Windows.

@Ori-Pixel
Copy link
Author

@bmschmidt The error seems to only exist in 2.0.7. If you downgrade the package to 2.0.6, it works.

@bmschmidt
Copy link
Contributor

Thanks for looking in further. It looks like 2.0.7 is a version number that only appears on the MassiveTexts branch. My best guess is that this problem was have been introduced here. It calls os.path.join here to build a url, which leaves the slashes facing the wrong way on windows when it's put into a URL. The most elegant fix would probably be to switch to pathlib.Path instances from strings for the return values there, but I'm not sure if it would work.

@organisciak, do you still have Windows to fix/test this? I've proposed massivetexts#29 as a really stupid way to fix what appears to the be the problem, but I don't know if it's a full fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants