Check why certain pages were filtered out by the blacklist #24

jogli5er · 2018-05-31T14:51:03Z

The latest counts showed that we filter out a large part of the contents, either because they have a "wrong" mimetype (we should not even download those, see #23 ) or because the parser finds something that resembles base64. We have to crawl through a few of those pages to see, whether the parser works correctly or if there is some sort of bug.

jogli5er · 2018-06-04T05:26:31Z

So, we found the issue: We also filter data URLs (containing data). Such URLs seems to include tiny image elements often. Therefore we suggest that we switch to just removing image data elements. This way, we would have the images in volatile memory for a very short time, but without giving anybody access to it. Such behaviour is similar to a node in the internet routing traffic - it does have to store the content shortly, but it cannot be held responsible for any illegal content it may forwards.
Further, we can whitelist such things as SVG images

dionyziz · 2018-06-04T08:03:09Z

Agreed, let's move ahead with this change.

…

On Mon, Jun 4, 2018, 07:26 Roman Brunner ***@***.***> wrote: So, we found the issue: We also filter data URLs (containing data). Such URLs seems to include tiny image elements often. Therefore we suggest that we switch to just removing image data elements. This way, we would have the images in volatile memory for a very short time, but without giving anybody access to it. Such behaviour is similar to a node in the internet routing traffic - it does have to store the content shortly, but it cannot be held responsible for any illegal content it may forwards. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#24 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAhPPAClvgiywtCkcQs3esevLz0KYWbEks5t5MUHgaJpZM4UVMJP> .

jogli5er added the bug Something isn't working label May 31, 2018

jogli5er self-assigned this May 31, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Check why certain pages were filtered out by the blacklist #24

Check why certain pages were filtered out by the blacklist #24

jogli5er commented May 31, 2018

jogli5er commented Jun 4, 2018 •

edited

Loading

dionyziz commented Jun 4, 2018 via email

Check why certain pages were filtered out by the blacklist #24

Check why certain pages were filtered out by the blacklist #24

Comments

jogli5er commented May 31, 2018

jogli5er commented Jun 4, 2018 • edited Loading

dionyziz commented Jun 4, 2018 via email

jogli5er commented Jun 4, 2018 •

edited

Loading