feature: download only if content-type matches #9

blahah · 2015-08-08T04:24:33Z

A common problem when scraping scientific journal articles is if access to the PDF or other file downloads is not granted, but the server returns a 200 OK status and sends an HTML document telling the user they don't have access. In this case, a scraperJSON client will simply download the HTML page and may rename it to the user's specified filename, which leads to a confusing situation where an HTML document might be mislabelled as some other filetype.

A solution is to allow a download to specify one or more content-types that are permitted, or perhaps a regex that should match the content-type. If the content-type does not match, the download is skipped.

The client would implement this by performing a HEAD request to the download URL initially, then evaluating the Content-Type HTTP header, then deciding whether to proceed to full download.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature: download only if content-type matches #9

feature: download only if content-type matches #9

blahah commented Aug 8, 2015

feature: download only if content-type matches #9

feature: download only if content-type matches #9

Comments

blahah commented Aug 8, 2015