Reader doesn't extract any content from this page even though its quite simple? #105

oscar-o-oneill · 2024-08-16T20:04:23Z

Hi, I love reader! It's so useful. I am playing around with it, and I noticed it isn't able to extract any content from this URL.

https://www.canada.ca/en/women-gender-equality/gender-based-violence/gender-based-violence-glossary.html

On navigating to the reader page for it, I just get this response:

Title: 

URL Source: https://www.canada.ca/en/women-gender-equality/gender-based-violence/gender-based-violence-glossary.html

Markdown Content:

What's going on? It's a fairly simple page.

The text was updated successfully, but these errors were encountered:

hanxiao · 2024-08-16T21:08:33Z

okay this is weird, i get the same empty result; however if i use pageshot mode it does return the full webpage

could u look at it? @nomagick

oscar-o-oneill · 2024-08-16T21:48:00Z

Thanks, @hanxiao. Just wanted to bring this to your attention! I will keep following the thread and help out if I can.

mapleeit · 2024-08-19T09:13:57Z

Hi @oscar-o-oneill did you have same issues on other pages?

I found that it seems there is some trick in this specific webpage that makes the browser treat the webpage isn't fully loaded until encountering the Timeout, which is 30s in this case by default. But I'm still trying to identify what's the trick in the page makes this situation.

It would be helpful if you have more bad cases, so that I can find the common pattern

oscar-o-oneill · 2024-08-20T18:17:31Z

Hi @mapleeit, no, I have not found this issue on many other pages. Reader usually works really well!

I will definitely report any issues I may find with other web pages in the future.

Thank you for making Jina AI Reader.

nomagick · 2024-08-27T07:47:45Z

It looks like some kind of bot-prevention mechanism from the "edgesuite". It seems to be replacing the DOM contents in a fraction and making Reader capture its warning messages.

mapleeit self-assigned this Aug 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reader doesn't extract any content from this page even though its quite simple? #105

Reader doesn't extract any content from this page even though its quite simple? #105

oscar-o-oneill commented Aug 16, 2024

hanxiao commented Aug 16, 2024

oscar-o-oneill commented Aug 16, 2024

mapleeit commented Aug 19, 2024

oscar-o-oneill commented Aug 20, 2024

nomagick commented Aug 27, 2024

Reader doesn't extract any content from this page even though its quite simple? #105

Reader doesn't extract any content from this page even though its quite simple? #105

Comments

oscar-o-oneill commented Aug 16, 2024

hanxiao commented Aug 16, 2024

oscar-o-oneill commented Aug 16, 2024

mapleeit commented Aug 19, 2024

oscar-o-oneill commented Aug 20, 2024

nomagick commented Aug 27, 2024