Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Towards a usable discovery algo #64

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Conversation

sneakers-the-rat
Copy link
Collaborator

@sneakers-the-rat sneakers-the-rat commented Sep 15, 2024

Step 1, background for making the /about page friendlier is trying to surface recent discussions from the instance. first i wanted to make the discover page usable at all, and then i'll add a filter to only include posts from the instance.

Currently the discovery page is sort of sad, doesn't really work for smaller instances because the values are all tuned for mastodon social.

  • posts decay too quickly to be seen with a 1 hour half life - this puts most posts beneath the 5 score threshold most of the time.
  • we want to surface posts that generate good discussion i think, but replies are not accounted for in the score
  • likes are weighted the same as boosts, even though most people think of favorites as being 'read receipts' with a boost being more meaningful.
  • there is problem in the way that it picks posts that makes it so there is a discontinuity around midnight UTC thanks to the use of beginning_of_day rather than days_ago(1)

So this PR

  • adds a descendents_count method to Status that counts all replies in a tree, rather than immediate replies
  • adds an additional parameter to the update method to allow for statuses that haven't been used in the redis cache in 3 days to be considered for the explore feed (i couldn't find documentation for used key in redis but i'm assuming it's last time it's accessed)
  • increases the halflife from 1 hour to 12 hours, this spreads out time for consideration and importantly allows us to account for timezone differences
  • adds weights for different kinds of interactions
  • moves the interaction_exponent out of the function into the options - i feel like this should be an exponent <1 where there is some ceiling for interaction counts mattering rather than making them matter more at higher values, but i didn't want to change too much without simulating
  • makes a local_weight value that amplifies posts from within the instance
  • allows replies to be shown in the explore feed - some of the best discussion happens in replies. kept the requirement that they be public visibility since the docs for the unlisted setting explicitly say you won't be included in the algo feed, but replies to someone else's post in public visibility don't clog the local feed so using public visibility in replies not to yourself is good to do.

Here's the worst matplotlib plot you've ever seen showing some values of favorites, replies, and boosts over a week and how those scores decay with time. i split it out into 3 subplots for number of favorites bc there were too many lines in a single plot, and the legends say how many boosts or replies for each post. another way of reading this is to "slide each line across time" to see when a post with a smaller number of interactions would have a higher rank than an older post with more interactions, and that's sort of a proxy for the discovery churn. The vertical black bars indicate when a post (with the matching initial y-value) would fall below the consideration threshold. So a post with 25 likes, 25 boosts, and 5 replies would fall out of consideration in about a week (but would probably be very low in the list at that point), and all the higher values follow shortly after

Screenshot 2024-09-14 at 10 54 15 PM

@thesamovar wdyt

edit: looks like i'll need to adjust some values in the tests, but nothing seems fundamentally broken. this is a pretty simple pr after all

@sneakers-the-rat
Copy link
Collaborator Author

so i'm going to back off the setting that pulls up to 4 days of redis data a bit, but the reason why i am not all that concerned about the extra compute time here is that we are really rarely CPU bound after splitting up the sidekiq queues into more processes, like this is a snapshot from where it was basically just me posting monsterdon stuff with a few boosts and etc. from others in the meantime
Screenshot 2024-09-15 at 8 13 43 PM.

we have all that area under the maximum curve as processing overhead. our cpu usage regularly averages 12.5-25%. the longer halflife directly contributes to the maximum number of statuses considered, probably significantly, as does any increase of these values. the scheduler should accomodate for this - it fires an update event every 5 minutes (

trends_refresh_scheduler:
) but those shouldn't stack, eg. if it's not done it won't run twice. if we get into the situation where the update takes forever we can probably speed this up by batching the update, or more directly profiling the calls here.

basically what i'm saying is that we have a lot of headroom to experiment with, and we should use it. after this PR and if the basic approach seems to be working i'll adjust the values more directly (i might split them out into the .env file) until we get decent perf and the algo seems to be working. currently that page is basically useless, as it only ever has 5 posts on it and they seem effectively random to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant