Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add content scrape #34

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Conversation

Areskiko
Copy link

Adds the ability to scrape the entry link for additional content, using the crate developed by News Flash.

Unfortunately there is no sync version of it, and it demands the usage of reqwest, so some additional dependencies have had to be added.

Currently pressing s while focused on an entry with a link will cause the content of the entry to be temporarily altered with the content scraped from the source.

Adds the ability to scrape the source of an entry for additional content.
Uses the crate developed by news_flash to do so.
@Areskiko
Copy link
Author

If there is a large enough interest, I could look into rewriting the article_scraper crate, to avoid the extra dependencies

@ckampfe ckampfe mentioned this pull request Jun 1, 2024
1 task
@ckampfe
Copy link
Owner

ckampfe commented Jun 1, 2024

Thank you for this contribution @Areskiko. I appreciate it. This is definitely interesting. It's one of the weird quirks of RSS/Atom that sites will sometimes include a brief summary or even nothing at all in the content/description of the feed, forcing us to go to the site directly.

It seems like this a feature that people would appreciate!

There are some quirks to this problem/feature that I want to think about a bit as we investigate this further.

The first is that this article-scraper crate makes use of an external scraping configuration that we (russ) do not control. In terms of thinking about the security and overall safety of russ's users, I'm suspicious of this kind of implicit/transitive dependency on community-contributed configuration. I will have to investigate further exactly how this community contributed scraping configuration interacts both with the article-scraper library and how the article-scraper library itself is managed.

Secondly, and you touched on this in your comments above, article-scraper and reqwest require the use of tokio. This is an interesting issue, as I have played with converting russ to use async in the past, but never really gotten around to it in a way I found satisfactory. In fact, russ actually did use tokio at one point, and I ripped it out because it felt unnecessary for what russ was trying to be. I have nothing against tokio and reqwest in principle and in fact think they are high quality, but we do not currently use them and using 1. an async runtime in a threaded application and 2. using two different HTTP clients in the same application feels a bit weird. Again, this is not a dealbreaker necessarily, but there are a few things to evaluate in this regard: are there other "extractor"-style crates that simple can look at a given string of HTML without needing to pull in a network client? Does it make sense to try to rewrite some/all of russ to use tokio and async? I don't know the answer to these questions right now, but I will work to evaluate them.

Feel free to respond if you like, I'm happy to continue discussing this to move it forward, with the understanding that I don't feel comfortable moving forward with this specific work at this moment in time given my above concerns.

Again, thank you very much for the excellent contribution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants