Towards a usable discovery algo #64

sneakers-the-rat · 2024-09-15T06:14:01Z

Step 1, background for making the /about page friendlier is trying to surface recent discussions from the instance. first i wanted to make the discover page usable at all, and then i'll add a filter to only include posts from the instance.

Currently the discovery page is sort of sad, doesn't really work for smaller instances because the values are all tuned for mastodon social.

posts decay too quickly to be seen with a 1 hour half life - this puts most posts beneath the 5 score threshold most of the time.
we want to surface posts that generate good discussion i think, but replies are not accounted for in the score
likes are weighted the same as boosts, even though most people think of favorites as being 'read receipts' with a boost being more meaningful.
there is problem in the way that it picks posts that makes it so there is a discontinuity around midnight UTC thanks to the use of beginning_of_day rather than days_ago(1)

So this PR

adds a descendents_count method to Status that counts all replies in a tree, rather than immediate replies
adds an additional parameter to the update method to allow for statuses that haven't been used in the redis cache in 3 days to be considered for the explore feed (i couldn't find documentation for used key in redis but i'm assuming it's last time it's accessed)
increases the halflife from 1 hour to 12 hours, this spreads out time for consideration and importantly allows us to account for timezone differences
adds weights for different kinds of interactions
moves the interaction_exponent out of the function into the options - i feel like this should be an exponent <1 where there is some ceiling for interaction counts mattering rather than making them matter more at higher values, but i didn't want to change too much without simulating
makes a local_weight value that amplifies posts from within the instance
allows replies to be shown in the explore feed - some of the best discussion happens in replies. kept the requirement that they be public visibility since the docs for the unlisted setting explicitly say you won't be included in the algo feed, but replies to someone else's post in public visibility don't clog the local feed so using public visibility in replies not to yourself is good to do.

Here's the worst matplotlib plot you've ever seen showing some values of favorites, replies, and boosts over a week and how those scores decay with time. i split it out into 3 subplots for number of favorites bc there were too many lines in a single plot, and the legends say how many boosts or replies for each post. another way of reading this is to "slide each line across time" to see when a post with a smaller number of interactions would have a higher rank than an older post with more interactions, and that's sort of a proxy for the discovery churn. The vertical black bars indicate when a post (with the matching initial y-value) would fall below the consideration threshold. So a post with 25 likes, 25 boosts, and 5 replies would fall out of consideration in about a week (but would probably be very low in the list at that point), and all the higher values follow shortly after

@thesamovar wdyt

edit: looks like i'll need to adjust some values in the tests, but nothing seems fundamentally broken. this is a pretty simple pr after all

sneakers-the-rat · 2024-09-16T03:22:51Z

so i'm going to back off the setting that pulls up to 4 days of redis data a bit, but the reason why i am not all that concerned about the extra compute time here is that we are really rarely CPU bound after splitting up the sidekiq queues into more processes, like this is a snapshot from where it was basically just me posting monsterdon stuff with a few boosts and etc. from others in the meantime
.

we have all that area under the maximum curve as processing overhead. our cpu usage regularly averages 12.5-25%. the longer halflife directly contributes to the maximum number of statuses considered, probably significantly, as does any increase of these values. the scheduler should accomodate for this - it fires an update event every 5 minutes (

mastodon/config/sidekiq.yml

Line 18 in d416ae6

trends_refresh_scheduler:

) but those shouldn't stack, eg. if it's not done it won't run twice. if we get into the situation where the update takes forever we can probably speed this up by batching the update, or more directly profiling the calls here.

basically what i'm saying is that we have a lot of headroom to experiment with, and we should use it. after this PR and if the basic approach seems to be working i'll adjust the values more directly (i might split them out into the .env file) until we get decent perf and the algo seems to be working. currently that page is basically useless, as it only ever has 5 posts on it and they seem effectively random to me.

more sorting in discovery algorithm

f49e9a5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Towards a usable discovery algo #64

Towards a usable discovery algo #64

sneakers-the-rat commented Sep 15, 2024 •

edited

Loading

sneakers-the-rat commented Sep 16, 2024

Towards a usable discovery algo #64

Are you sure you want to change the base?

Towards a usable discovery algo #64

Conversation

sneakers-the-rat commented Sep 15, 2024 • edited Loading

sneakers-the-rat commented Sep 16, 2024

sneakers-the-rat commented Sep 15, 2024 •

edited

Loading