Add content scrape #34

Areskiko · 2024-03-16T13:27:52Z

Adds the ability to scrape the entry link for additional content, using the crate developed by News Flash.

Unfortunately there is no sync version of it, and it demands the usage of reqwest, so some additional dependencies have had to be added.

Currently pressing s while focused on an entry with a link will cause the content of the entry to be temporarily altered with the content scraped from the source.

Adds the ability to scrape the source of an entry for additional content. Uses the crate developed by news_flash to do so.

Areskiko · 2024-03-18T10:10:50Z

If there is a large enough interest, I could look into rewriting the article_scraper crate, to avoid the extra dependencies

ckampfe · 2024-06-01T22:45:51Z

Thank you for this contribution @Areskiko. I appreciate it. This is definitely interesting. It's one of the weird quirks of RSS/Atom that sites will sometimes include a brief summary or even nothing at all in the content/description of the feed, forcing us to go to the site directly.

It seems like this a feature that people would appreciate!

There are some quirks to this problem/feature that I want to think about a bit as we investigate this further.

The first is that this article-scraper crate makes use of an external scraping configuration that we (russ) do not control. In terms of thinking about the security and overall safety of russ's users, I'm suspicious of this kind of implicit/transitive dependency on community-contributed configuration. I will have to investigate further exactly how this community contributed scraping configuration interacts both with the article-scraper library and how the article-scraper library itself is managed.

Secondly, and you touched on this in your comments above, article-scraper and reqwest require the use of tokio. This is an interesting issue, as I have played with converting russ to use async in the past, but never really gotten around to it in a way I found satisfactory. In fact, russ actually did use tokio at one point, and I ripped it out because it felt unnecessary for what russ was trying to be. I have nothing against tokio and reqwest in principle and in fact think they are high quality, but we do not currently use them and using 1. an async runtime in a threaded application and 2. using two different HTTP clients in the same application feels a bit weird. Again, this is not a dealbreaker necessarily, but there are a few things to evaluate in this regard: are there other "extractor"-style crates that simple can look at a given string of HTML without needing to pull in a network client? Does it make sense to try to rewrite some/all of russ to use tokio and async? I don't know the answer to these questions right now, but I will work to evaluate them.

Feel free to respond if you like, I'm happy to continue discussing this to move it forward, with the understanding that I don't feel comfortable moving forward with this specific work at this moment in time given my above concerns.

Again, thank you very much for the excellent contribution.

Add scraping capabilities

f8a0a55

Adds the ability to scrape the source of an entry for additional content. Uses the crate developed by news_flash to do so.

Areskiko force-pushed the content-scrape branch from 512c858 to f8a0a55 Compare March 18, 2024 09:51

ckampfe mentioned this pull request Jun 1, 2024

HTML text extraction #28

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add content scrape #34

Add content scrape #34

Areskiko commented Mar 16, 2024

Areskiko commented Mar 18, 2024

ckampfe commented Jun 1, 2024

Add content scrape #34

Are you sure you want to change the base?

Add content scrape #34

Conversation

Areskiko commented Mar 16, 2024

Areskiko commented Mar 18, 2024

ckampfe commented Jun 1, 2024