Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add trending score to solr #7429

Open
cdrini opened this issue Jan 18, 2023 · 5 comments · May be fixed by #10057
Open

Add trending score to solr #7429

cdrini opened this issue Jan 18, 2023 · 5 comments · May be fixed by #10057
Assignees
Labels
Lead: @cdrini Issues overseen by Drini (Staff: Team Lead & Solr, Library Explorer, i18n) [managed] Module: Solr Issues related to the configuration or use of the Solr subsystem. [managed] Needs: Help Issues, typically substantial ones, that need a dedicated developer to take them on. [managed] Priority: 2 Important, as time permits. [managed] Type: Feature Request Issue describes a feature or enhancement we'd like to implement. [managed]

Comments

@cdrini
Copy link
Collaborator

cdrini commented Jan 18, 2023

We would love to be able to see trending items in a given collection (eg what's trending today in subject:fantasy ?). This would be a useful sorting parameter for many pages (eg Library Explorer, subject pages, trending carousel), that would provide interesting books while also providing a good variety of books.

In order to do this we need a score to be stored in solr. We can do this using a z score algorithm like the one described here.

We'd need a few new solr fields:

# The last 24 hours of trending actions
trending_actions_hourly: [0,0,0,0,1,1,0, ...] # len: 24
# The last 30 days of trending actions as a json blob
trending_actions_daily: [0,1,4,3,2,0,0,12, ...] # len: 30
# The trending zscore as a number
trending_score: 0.645

Then the rough formula is:

trending_score = (views_today - average_views_per_day) / standard_dev_of_views_per_day

We would then need:

  • a script that every day updates all trending_actions_daily and trending_score
  • a script that every hour updates trending_actions_hourly and trending_score
  • whenever a trending action occurs (eg a want to read, etc), update trending_actions_hourly and trending_actions_monthly and trending_score
  • Use solr partial document updates for everything to make performant
@cdrini cdrini added Type: Feature Request Issue describes a feature or enhancement we'd like to implement. [managed] Module: Solr Issues related to the configuration or use of the Solr subsystem. [managed] Priority: 2 Important, as time permits. [managed] Lead: @cdrini Issues overseen by Drini (Staff: Team Lead & Solr, Library Explorer, i18n) [managed] labels Jan 18, 2023
@mekarpeles mekarpeles added Priority: 3 Issues that we can consider at our leisure. [managed] and removed Priority: 2 Important, as time permits. [managed] labels Nov 28, 2023
@mekarpeles
Copy link
Member

Ideas for better trending on homepage

  1. remove trending from the homepage
  2. more aggressively randomize the trending carousel
  3. exclude books which were tending last month (or [4] where we have a trending score based on the delta of last month & this month)
  4. use different trending criteria (I know this is limited to what data is in solr + re-indexing rate)

@mekarpeles
Copy link
Member

Based on a conversation yesterday between @cdrini and myself, there seems like a approach forward:

  • At the end of each day, select and update all solr records for works which contain a trending_score and evict (i.e. set to 0) the oldest day of reading log counts in the work's trending_events_daily field. This field is an array with 7 rolling slots for the 7 days in a week, where the indices of the array correspond to the day of the week:
    • 0: sunday, 1: monday, ..., 6: saturday
    • If today is Monday, we would evict Tuesday and set its count to 0
  • Directly after eviction completes, query for all books that have had reading log events in the last 24h (this number should be around 50k)
    • Screenshot 2024-08-23 at 5 32 14 AM
  • For each work with reading log edits, update the work's solr record's trending_events_daily field (which is an array with 7 slots for the 7 days in a week). The accumulative reading log events for that work for that day will be inserted into the trending_events_daily array at the index/position corresponding to the day of the week (in a rolling fashion). e.g. If it's sunday, insert the count into position 0, if it's saturday, insert the value into the final position 6.
  • Recompute the work's trending_score based on its updated trending_events_daily array and add this score to the work's solr record

@mekarpeles mekarpeles added Priority: 2 Important, as time permits. [managed] Needs: Help Issues, typically substantial ones, that need a dedicated developer to take them on. [managed] and removed Priority: 3 Issues that we can consider at our leisure. [managed] labels Aug 23, 2024
@mekarpeles mekarpeles added this to the Sprint 2024-09 milestone Aug 23, 2024
@benbdeitch
Copy link
Collaborator

Hi, mind assigning me to this? I'd love to take a shot at it.

@github-actions github-actions bot added the Needs: Response Issues which require feedback from lead label Sep 11, 2024
@cdrini
Copy link
Collaborator Author

cdrini commented Sep 11, 2024

Ben and I chatted and I presented this rough approach which I believe should be the least headache inducing and also give us up-to-the hour trending score data:

Every day:

/export
?fields=key,trending_events_daily_*, trending_events_hourly_*
&q=trending_events_daily_X:[1 TO *] OR trending_events_hourly_sum:[1 TO *]

Solr in place update:
set trending_events_daily_X to sum of last 24h

Every hour (H:05):

/export
?fields=key,trending_events_hourly_H, trending_events_hourly_sum
&q=trending_events_hourly_(H-1):[1 TO *]

From DB: get all events in hour H-1

Solr in place update:
trending_events_hourly_(H-1): db_value or 0
trending_events_hourly_sum: - prev_val + new_val
trending_score: hourly_sum / (daily_mean - daily_stddev)

@cdrini cdrini assigned benbdeitch and unassigned mekarpeles and cdrini Sep 11, 2024
@cdrini cdrini removed the Needs: Response Issues which require feedback from lead label Sep 13, 2024
@mekarpeles mekarpeles modified the milestones: Sprint 2024-09, 2024-11 Oct 25, 2024
@benbdeitch benbdeitch linked a pull request Nov 20, 2024 that will close this issue
@mekarpeles
Copy link
Member

Status update from @benbdeitch:

In regards to further programming notes for Trending:

  1. Currently trending uses the 'date updated' column for events within the reading log to track user interest. This could potentially lead to unexpected behavior, and I need to change it to 'date added', before it's implemented.

  2. There's the ongoing caveat that the current numerical formula for z-score may need to be tweaked, depending on if we like the results that we get. This is a minor change that will only affect a small part of the code, but if we're getting weird outliers, it may need to be tweaked.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Lead: @cdrini Issues overseen by Drini (Staff: Team Lead & Solr, Library Explorer, i18n) [managed] Module: Solr Issues related to the configuration or use of the Solr subsystem. [managed] Needs: Help Issues, typically substantial ones, that need a dedicated developer to take them on. [managed] Priority: 2 Important, as time permits. [managed] Type: Feature Request Issue describes a feature or enhancement we'd like to implement. [managed]
Projects
None yet
3 participants