Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature: download only if content-type matches #9

Open
blahah opened this issue Aug 8, 2015 · 0 comments
Open

feature: download only if content-type matches #9

blahah opened this issue Aug 8, 2015 · 0 comments

Comments

@blahah
Copy link
Member

blahah commented Aug 8, 2015

A common problem when scraping scientific journal articles is if access to the PDF or other file downloads is not granted, but the server returns a 200 OK status and sends an HTML document telling the user they don't have access. In this case, a scraperJSON client will simply download the HTML page and may rename it to the user's specified filename, which leads to a confusing situation where an HTML document might be mislabelled as some other filetype.

A solution is to allow a download to specify one or more content-types that are permitted, or perhaps a regex that should match the content-type. If the content-type does not match, the download is skipped.

The client would implement this by performing a HEAD request to the download URL initially, then evaluating the Content-Type HTTP header, then deciding whether to proceed to full download.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant