Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Entry number unbounded, no way of removing old entries #96

Open
lemon24 opened this issue Nov 26, 2018 · 4 comments
Open

Entry number unbounded, no way of removing old entries #96

lemon24 opened this issue Nov 26, 2018 · 4 comments

Comments

@lemon24
Copy link
Owner

lemon24 commented Nov 26, 2018

A database with ~3000 entries takes about 21M, which is perfectly acceptable. However, at the moment there is no way to remove old entries, and the database can grow arbitrarily.

@lemon24
Copy link
Owner Author

lemon24 commented Feb 3, 2019

How other people handle this:

Akregator has 4 archive settings (can be configured globally, or per feed) (update: unchanged as of 2022):

  • keep all articles forever (what reader does now)
  • limit archive size to X articles; the oldest articles are deleted; flagged articles are ignored when counting the number of articles (reader doesn't have a concept of flagged/important article)
  • delete articles older than X days, unless they are flagged as important; done at startup, and then once per hour
  • disable archiving; all articles are discarded when quitting

Also, not deleting important articles can be turned off.

Tiny Tiny RSS can purge articles after X days (can be configured globally, or per feed); some details:

  • starred articles are not purged
  • purging is done based on the import date; the import date gets bumped every time an article is encountered in the feed, otherwise it would get purged and imported again on every feed update
  • purging happens after update, so disabled updates means no purging (reader doesn't allow disabling updates for feeds)

An interesting (but somewhat unrelated feature) is the Archived feed, which keeps starred articles from deleted feeds and share-anything articles (you can add articles that have no feed). Articles in the Archived feed are not purged.

@lemon24
Copy link
Owner Author

lemon24 commented May 14, 2020

Presumably, it would be also nice to mark a whole feed as important ("don't delete"). This could also be implemented as a plug-in that marks each new entry as important, but it may pollute individual important entries.

@lemon24
Copy link
Owner Author

lemon24 commented Sep 4, 2022

Requirements:

  • strategies

    • initially:
      • keep all entries forever
      • delete entries older than X days
    • (later) it should be possible to have other strategies, and have multiple strategies act at the same time (delete if either matches)
    • likely configured through a .reader. reserved tag
  • levels:

    • global
      • default must be "keep all entries forever", for backwards compatibility
    • (later) per-feed
      • fully overrides global strategy
      • (later) flag to append to the global strategy (delete if either matches), not override
  • important entries are never deleted

    • (later) deleting important entries can be turned off (update: on?)
    • can mark an entire feed as important
      • how is this different from a per-feed strategy? it allows (temporarily) overriding the strategy, without deleting it
        • should this be possible for the global strategy too?
  • unread entries are never deleted

  • happens after the feed is updated

    • ... in update_feeds(), update_feeds_iter(), update_feed()
      • (later) can be skipped
    • (later) can be triggered independently (without update)
  • must happen at Python level, can't select + delete in a single query

    • why? for logging, and so that to-be-deleted entries can be intercepted by plugins

Open questions:

  • "delete entries older than X days"

    • what does "older" mean?
      • we need to handle "older" entries that are still in the feed, so we don't import them again on every update
      • if it's "last appeared in feed" (we don't store this at the moment), then we don't need special handling
        • but if a feed contains all the entries since the beginning of time, entries from e.g. 10 years ago will never get deleted
      • if it's something else (added, last updated, published/updated), then we need some kind of tombstone
        • how long do we keep the tombstones around? we can't leave them accumulate indefinitely
  • entry_dedupe needs the old entry to be able to dedupe, it cannot work if it has been deleted

    • after_{entry,feed}_update_hooks must run before entries are deleted
    • if "older" means "last appeared in feed", when entry ids change and the number of entries in the feed increases, the older ones still won't be deduped
    • if we keep tombstones and we switch to MinHash, then we can store the serialized MinHash with the tombstone
  • We still need to keep accurate EntryCounts.averages after the entries were deleted (but not for duplicates).

@lemon24
Copy link
Owner Author

lemon24 commented Mar 23, 2023

TODO: kinds of duplicates (broadly) × deduplication mechanisms matrix

Kinds of duplicates:

  • different title, same text
  • same title, same text
  • same id
  • same link
  • same title (don't have text)

Deduplication mechanisms:

  • Jaccard similarity of text
  • MinHash of text, k hash functions (any size doc, slow, takes k * 4 bytes in storage, k=2500 for error to be within .02, k=625 for error to be within .04)
  • MinHash of text, 1 hash function (doc size >= k, faster, storage and error idem)
  • id
  • link
  • title

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant