Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check why certain pages were filtered out by the blacklist #24

Open
jogli5er opened this issue May 31, 2018 · 2 comments
Open

Check why certain pages were filtered out by the blacklist #24

jogli5er opened this issue May 31, 2018 · 2 comments
Assignees
Labels
bug Something isn't working

Comments

@jogli5er
Copy link
Member

The latest counts showed that we filter out a large part of the contents, either because they have a "wrong" mimetype (we should not even download those, see #23 ) or because the parser finds something that resembles base64. We have to crawl through a few of those pages to see, whether the parser works correctly or if there is some sort of bug.

@jogli5er jogli5er added the bug Something isn't working label May 31, 2018
@jogli5er jogli5er self-assigned this May 31, 2018
@jogli5er
Copy link
Member Author

jogli5er commented Jun 4, 2018

So, we found the issue: We also filter data URLs (containing data). Such URLs seems to include tiny image elements often. Therefore we suggest that we switch to just removing image data elements. This way, we would have the images in volatile memory for a very short time, but without giving anybody access to it. Such behaviour is similar to a node in the internet routing traffic - it does have to store the content shortly, but it cannot be held responsible for any illegal content it may forwards.
Further, we can whitelist such things as SVG images

@dionyziz
Copy link
Member

dionyziz commented Jun 4, 2018 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants