Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle redirects and gone feeds gracefully #246

Open
lemon24 opened this issue Jul 16, 2021 · 3 comments
Open

Handle redirects and gone feeds gracefully #246

lemon24 opened this issue Jul 16, 2021 · 3 comments

Comments

@lemon24
Copy link
Owner

lemon24 commented Jul 16, 2021

https://feedparser.readthedocs.io/en/latest/http-redirect.html

If you are polling a feed on a regular basis, it is very important to check the status code (d.status) every time you download. If the feed has been permanently redirected, you should update your database or configuration file with the new address (d.href). Repeatedly requesting the original address of a feed that has been permanently redirected is very rude, and may get you banned from the server.

Repeatedly requesting a feed that has been marked as “gone” is very rude, and may get you banned from the server.

@lemon24
Copy link
Owner Author

lemon24 commented Jun 17, 2023

Related comment:

# if feed_for_update.url != parsed_feed.feed.url, the feed was redirected.
# TODO: Maybe handle redirects somehow else (e.g. change URL if permanent).

Misc thoughts:

  • Ideally, this should be a plugin.
  • To allow plugins to handle this, we likely need to expose additional info to after_feed_update_hooks – old feed, new feed, status code or its meaning.
  • For redirects, we need the status code of the initial request – an update can have a redirect and succeed.
  • The plugin that changes the URL must run after all other ones that use the old one (e.g. after_entry_update_hooks).
  • Should the UpdateResult/UpdatedFeed returned by update_feeds_iter()/update_feed() have the new or the old URL?
    • Likely the new one.
  • Assuming an after_feed_update_hooks plugin that runs after the one that changes the URL:

@zifot
Copy link

zifot commented Jun 18, 2023

Just a thought.

Consider API semantics that allows for a plugin to only mark feed url for a change. Then, after processing all of the plugins, you check if any plugin requested a change (and maybe make sure only one did it?), and make the change itself as part of the processing mechanism that runs outside of the plugins.

This is subtle change, but that way you (probably) can drop requirement that such plugin must run as a last one. Also, this seem to simplify issues you mention in the last point and allows for controlling if such request makes sense in the context of any other plugins or other external factors that may occur.

EDIT: typo

@lemon24
Copy link
Owner Author

lemon24 commented Jun 18, 2023

@zifot, that's actually a great idea, thank you!

I think it's doable right now with tags:

def after_feed_update(reader, feed, ...):
    # runs for each feed
    new_url = is_permanent_redirect(feed, ...)
    if new_url:
        reader.set_tag(feed, '.url-change-needed', new_url)

def after_feeds_update(reader):
    # runs after all the feeds
    for feed in reader.get_feeds(tags=['.url-change-needed']):
        new_url = reader.get_tag(feed, '.url-change-needed')
        # for later: how do we deal with InvalidFeedURLError?
        reader.change_feed_url(feed, new_url)
        reader.delete_tag(new_url, '.url-change-needed')

Note to self: This seems like a very useful pattern, mention it in the docs for plugin authors (when we have them). The way we're handling .reader.dedupe.once for entry_dedupe is vaguely similar (mark, then change).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants