Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reader doesn't extract any content from this page even though its quite simple? #105

Open
oscar-o-oneill opened this issue Aug 16, 2024 · 5 comments
Assignees

Comments

@oscar-o-oneill
Copy link

Hi, I love reader! It's so useful. I am playing around with it, and I noticed it isn't able to extract any content from this URL.

https://www.canada.ca/en/women-gender-equality/gender-based-violence/gender-based-violence-glossary.html

On navigating to the reader page for it, I just get this response:

Title: 

URL Source: https://www.canada.ca/en/women-gender-equality/gender-based-violence/gender-based-violence-glossary.html

Markdown Content:

What's going on? It's a fairly simple page.

@hanxiao
Copy link
Member

hanxiao commented Aug 16, 2024

okay this is weird, i get the same empty result; however if i use pageshot mode it does return the full webpage

could u look at it? @nomagick

01b4c4a07c62c025981af2d5e5deb419

@oscar-o-oneill
Copy link
Author

Thanks, @hanxiao. Just wanted to bring this to your attention! I will keep following the thread and help out if I can.

@mapleeit
Copy link
Contributor

Hi @oscar-o-oneill did you have same issues on other pages?

I found that it seems there is some trick in this specific webpage that makes the browser treat the webpage isn't fully loaded until encountering the Timeout, which is 30s in this case by default. But I'm still trying to identify what's the trick in the page makes this situation.

It would be helpful if you have more bad cases, so that I can find the common pattern

@mapleeit mapleeit self-assigned this Aug 19, 2024
@oscar-o-oneill
Copy link
Author

Hi @mapleeit, no, I have not found this issue on many other pages. Reader usually works really well!

I will definitely report any issues I may find with other web pages in the future.

Thank you for making Jina AI Reader.

@nomagick
Copy link
Member

It looks like some kind of bot-prevention mechanism from the "edgesuite". It seems to be replacing the DOM contents in a fraction and making Reader capture its warning messages.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants