Reduce duplicate calculations by caching metrics to disk (replacement for `load_from_csv`) #342

jucor · 2023-07-26T16:06:04Z

Hi!

Thanks for a great pacakage.

I notice that StatsManager.load_from_csv now has a deprectation warning. What is now the recommended way to avoid re-running the costly (20 minutes on my 1h20 x265 video) scene detection, please?

I have a stats file created with the CLI:

scenedetect --input myfile.mkv --stats adaptive-stats.csv detect-adaptive

and I would like to load that adaptative-stats.csv file into a SceneManager to do the actual scene detection. load_from_csv sounded promising :)

Thanks for any help, and a great tool!

Environment:

[PySceneDetect] PySceneDetect 0.6.2

System Info
------------------------------------------------------------
OS           macOS-13.4.1-arm64-arm-64bit
Python       3.11.4

Packages
------------------------------------------------------------
av           Not Installed
click        8.1.6
cv2          4.8.0
moviepy      Not Installed
numpy        1.25.1
platformdirs 3.9.1
scenedetect  0.6.2
tqdm         4.65.0

Tools
------------------------------------------------------------
ffmpeg       6.0
mkvmerge     v77.0 ('Elemental') 64-bit

The text was updated successfully, but these errors were encountered:

jucor · 2023-07-26T16:10:29Z

I am seeing in the deprecation commit the notes:

Equivalent functionality with much better performance could be obtained for most use cases by seeking instead.

Is this seeking alternative something that I can already use, please (and how?), or am I better (in the current moment) to use load_from_csv ?
Thanks for any pointer!

jucor · 2023-07-26T16:34:55Z

Mm, reading AdaptiveDetector.process_frame I now understand that the stats CSV is not used to speed things up anymore anyway.
But the commit introducing online calculation (smart!) mentions the ability to skip/seek, as did the deprecation commit mentioned above.

How could I use that new skip/seek trick, based on saved CSV of the stats, please?

jucor · 2023-07-26T16:38:45Z

(and yes, I think I'm absolutely spending an hour diving into the code just to avoid waiting 20 minutes again to parse my video 😅 )

Breakthrough · 2023-07-26T23:28:19Z

@jucor ~~unfortunately I don't have a replacement planned for this at this time.~~ thanks for bringing this up, I'll repurpose this issue to track adding replacement functionality.

The main reason for removing this feature was that it just wasn't well thought out before I introduced it - it became very difficult to scale the CSV format as a cache for detector calculations as the number of dimensions/parameters per detector rose. This led to a bunch of weird bugs and edge cases that for a long time went unnoticed, so felt it was time to rework how this caching happens.

The latest version made a lot of performance improvements to scanning - are you disabling downscaling in your case, or is most of the time processing the video spent just decoding each frame?

I do apologize for the current status quo regarding loading/caching frame metrics. Removing it was necessary to allow customizing the calculation of each frame's score from the component weights. In the long term, I want to look into having each detector itself be responsible for caching calculations, and reserve the statsfile purely for statistical/human readable analysis. That was the intended purpose, but it kind of unravelled when attempting to re-use the same format as a cache for detection calculations.

Breakthrough · 2023-07-26T23:35:08Z

Regarding seeking/skipping, I don't think that applies for your particular use case since you want to re-process the entire video. Instead it might be worth just hacking a ContentDetector to load frame scores from a specified CSV file instead of calculating them. I do apologize for this making things slower for you in the interim - if you end up making any progress on that, feel free to submit a PR for it, we did something similar for using the output of list-scenes recently by creating a pseudo-detector (SceneLoader).

You could do this pretty easily by creating a new detector that derives from ContentDetector, and override the _calculate_frame_score method to return the score for that frame loaded from disk instead of re-calculating it. This should also resolve some of the shortcomings of the existing solution (e.g. can guarantee a user doesn't override the parameters if we load the scores from disk), so I'm more than open to including this in the next release as an official workaround.

In the long run, I think you raise some great points. I want to look into having ContentDetector allow specifying a file to use as a cache, just not sure what the correct format would be. The statsfile would effectively be derived from the cache, instead of being the cache.

Thanks for filing this!

jucor · 2023-07-27T00:01:49Z

Please don't apologize! It's a terrific tool :) :)
Most of the time is spent processing the frames, in this example. But I've just bit the bullet and waited 20 minutes, and it was good enough for what I needed to do, so no problem on my end any more!

Breakthrough · 2024-01-29T01:42:32Z

Taking a look at this with a fresh set of eyes it's probably unwise to pursue re-adding this until a well thought out API is in place. The issue is how the current interface for a SceneDetector is defined. If this continues to become a burden for some workflows it can be re-prioritized. The v0.7 API definitely needs to take this into account, so that's where I would ideally like to target this for.

Edit: thanks for your the feedback as well on this @jucor, much appreciated.

jucor · 2024-01-30T07:54:49Z

Pleasure! Thanks for the hard work and for keeping thinking about it. Makes sense to prioritize as you did :)

jucor changed the title ~~Replacement for load_from_csv?~~ Save statistics and reload them to avoid repeating the processing time? [Was: Replacement for load_from_csv?] Jul 26, 2023

Breakthrough added the feature label Jul 26, 2023

Breakthrough changed the title ~~Save statistics and reload them to avoid repeating the processing time? [Was: Replacement for load_from_csv?]~~ Reduce duplicate calculations by caching metrics to disk (replacement for load_from_csv) Jul 26, 2023

Breakthrough added this to the v0.6.3 milestone Jul 26, 2023

Breakthrough modified the milestones: v0.6.3, v0.7 Jan 29, 2024

Breakthrough modified the milestones: v0.6.4, v0.7 Apr 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce duplicate calculations by caching metrics to disk (replacement for `load_from_csv`) #342

Reduce duplicate calculations by caching metrics to disk (replacement for `load_from_csv`) #342

jucor commented Jul 26, 2023

jucor commented Jul 26, 2023

jucor commented Jul 26, 2023

jucor commented Jul 26, 2023 •

edited

Loading

Breakthrough commented Jul 26, 2023 •

edited

Loading

Breakthrough commented Jul 26, 2023 •

edited

Loading

jucor commented Jul 27, 2023

Breakthrough commented Jan 29, 2024 •

edited

Loading

jucor commented Jan 30, 2024

Reduce duplicate calculations by caching metrics to disk (replacement for load_from_csv) #342

Reduce duplicate calculations by caching metrics to disk (replacement for load_from_csv) #342

Comments

jucor commented Jul 26, 2023

jucor commented Jul 26, 2023

jucor commented Jul 26, 2023

jucor commented Jul 26, 2023 • edited Loading

Breakthrough commented Jul 26, 2023 • edited Loading

Breakthrough commented Jul 26, 2023 • edited Loading

jucor commented Jul 27, 2023

Breakthrough commented Jan 29, 2024 • edited Loading

jucor commented Jan 30, 2024

Reduce duplicate calculations by caching metrics to disk (replacement for `load_from_csv`) #342

Reduce duplicate calculations by caching metrics to disk (replacement for `load_from_csv`) #342

jucor commented Jul 26, 2023 •

edited

Loading

Breakthrough commented Jul 26, 2023 •

edited

Loading

Breakthrough commented Jul 26, 2023 •

edited

Loading

Breakthrough commented Jan 29, 2024 •

edited

Loading