LPMS bottleneck fix #141

j0sh · 2019-08-03T17:40:40Z

The commits in this PR are rebased and squashed from the following WIP branch: https://github.com/livepeer/lpms/compare/ja/old-bottleneck

Background

There is a substantial overhead cost to starting new transcode sessions on a GPU. Adding more streams and GPUs exacerbates the problem [1]. Up until now, a new GPU session has been started for each segment that's received. This has led to a massive slowdown in transcoding latency, and becomes worse with each additional concurrent stream.

To mitigate this setup overhead, we make transcode sessions persistent for each stream, so segments after the first can re-use the same session.

[1] There may be a global lock somewhere within FFmpeg or the Nvidia drivers, or it may be intrinsic.

LPMS API changes:

Transcoding becomes stateful: a transcode session is persisted on the GPU in between segments. The caller becomes responsible for terminating session. Working with the transcoder now involves allocating the transcoder, transcoding and stopping the transcoder. The API is as follows:

func NewTranscoder() *Transcoder
func (t *Transcoder) StopTranscoder()
func (t *Transcoder) Transcode(input *TranscodeOptionsIn, ps []TranscodeOptions) (*TranscodeResults, error)

The Transcoder struct is currently opaque to outside packages - nothing within that needs to be exposed at the moment.

The signature to the Transcode function match the existing Transcode3 function signature. Existing transcoding APIs are re-defined in terms of this new API and will continue to work unmodified in older programs.

Note that this doubles down on the "one segment at a time" restriction. In order to process multiple segments for a the same job in parallel, a new session needs to be created. However, this may end up being very slow. Sessions should be reused wherever possible.

The older Transcode3 API is preserved for single-shot transcoding, and is implemented as a wrapper around the three-step API.

Code Flow and Changes

Muxer, Filtergraphs and Audio: No change from what we have now. Always recreated in between sessions. Enough code may have been moved around to obscure this fact, but the there should be no net effect on the muxer, filtergraphs and audio as a result of this PR.
Decoding: Reused in between sessions. The number of streams and the indices of audio / video components are also cached in between segments in order to avoid a latency-incurring probing step. If the stream ordering changes in between segments, things will likely break. (This shouldn't happen under most circumstances, but isn't unheard of, eg in case there is a reconfiguration somewhere at the source. In which case the best thing to do there would be to create a new session, rather than reuse an existing session.)

Decoder buffering is a problem: sometimes we have completed inputting all packets from a segment, but haven't received all pending frames from the decoder. In order to flush the decoder buffer, we send dummy packet with a sentinel timestamp, keep feeding decoder until we receive a decoded frame back with the sentiel timestamp. Discard any frames with the sentinel timestamp. The dummy packet is taken from the first packet of each segment (and reset for subsequent segments).

Encoding: The encoder session is kept alive in between segments if a hardware encoder is used. If a software encoder is used, then the encoder is reset. This fact alone makes the set-up and clean-up flows conditional and rather confusing.

For hardware encoding, we current don't do anything to explicitly flush the encoder in between segments. This seems OK for now - we receive all frames we're supposed to, according to our tests - but this makes me nervous that certain inputs or configurations will leave us with buffered frames (eg, possibly B-frames). (There is a solution that might work for flushing the Nvidia encoder but it involves modifying FFmpeg in questionable ways.)

API Usability

Some usability shortcomings of the current API that should be mentioned. (We can opt to fix this before merging, but I haven't gotten around to it. Happy to do this sooner since it'd be an API break.)

Currently, the output renditions are fixed for the duration of the transcode session, but the user can pass in whatever they want for subsequent segments [1]. We attempt to "check" this with a rudimentary length check of the number of outputs, but doesn't actually enforce that the configuration of those outputs match. A better API would look something like:

NewTranscoder(InputOptions, OutputOptions)
Transcode(InputFilenames, OutputFilenames)

That would mean defining new structs for NewTranscoder and Transcoder since we shouldn't re-use or modify the old parameter sets.

However, the in/out filenames might not be the only thing we legitimately want to pass in to the transcoder on a per-segment basis, so the parameters should be a struct.

[1] Example of poor API usage:

tr := NewTranscoder()
out1 := []TranscodeOptionsOut{ ... one set of renditions ... }
tr.Transcode(in1, out1)
// Most of this is basically ignored, apart from the output filename (which *should* vary)
// Different muxer options will be respected as well, but probably shouldn't for consistency
out2 := []TranscodeOptionsOut{ ... another set of renditions ... }
tr.Transcode(in2, out2)
tr.StopTranscoder()

Fixes livepeer/go-livepeer#997
Fixes livepeer/go-livepeer#951

yondonfu · 2019-11-18T21:10:08Z

ffmpeg/ffmpeg.go

@@ -194,7 +209,7 @@ func Transcode3(input *TranscodeOptionsIn, ps []TranscodeOptions) (*TranscodeRes
 		filters := fmt.Sprintf("fps=%d/%d,%s='w=if(gte(iw,ih),%d,-2):h=if(lt(iw,ih),%d,-2)'", param.Framerate, 1, scale_filter, w, h)
 		if input.Accel != Software && p.Accel == Software {
 			// needed for hw dec -> hw rescale -> sw enc
-			filters = filters + ":format=yuv420p,hwdownload"
+			filters = filters + ",hwdownload,format=nv12"


Why the change from converting to yuv420p before hwdownload to converting to nv12 after hwdownload?

Old way was actually broken - it only worked with the npp scaler which we phased out in favor of CUDA before the original GPU PR was even merged. See livepeer/go-livepeer#951

This seems to work better now that we're using cuvid as a decoder, but I have to check that it actually fixes livepeer/go-livepeer#951

yondonfu · 2019-11-19T16:20:11Z

ffmpeg/ffmpeg.go

@@ -203,6 +218,13 @@ func Transcode3(input *TranscodeOptionsIn, ps []TranscodeOptions) (*TranscodeRes
 			muxOpts.name = C.CString(p.Muxer.Name)
 			defer C.free(unsafe.Pointer(muxOpts.name))
 		}
+		// Set some default encoding options
+		if len(p.VideoEncoder.Name) <= 0 && len(p.VideoEncoder.Opts) <= 0 {
+			p.VideoEncoder.Opts = map[string]string{


Out of curiosity, in what scenarios wouldn't we want to enforce IDR frames or closed GOPs?

We don't set it if the user passes in their own options. Practical scenarios for open GOP and non-intra I frames include non-segmented VOD or real time streaming where you want to take advantage of a larger range of pictures for the bitrate savings.

On checking again, nvenc doesn't actually the cgop option, so it's useless for us now. Will remove. It was a leftover from when trying to kludge x264 into doing persistent transcode sessions. We aren't anymore, so this option is no longer needed.

The forced-idr option might not be needed as well...

So forced-idr is in fact needed here?

Yes the force-idr option is required - without it, we get decoder errors at the beginning of playback because SPS/PPS extradata isn't always is emitted for the first frame of new segments, which we signal via an explicit keyframe. This option forces that.

Ideally this option wouldn't be needed at all; ffmpeg itself should but smart enough to emit extradata for the first frame after a flush; I may add that in.

Also just checked and it seems that disabling this option doesn't lead to any unit test failures - we really should capture this behavior in tests. Will make a note to correct that.

ffmpeg/lpms_ffmpeg.c

ffmpeg/ffmpeg.go

ffmpeg/lpms_ffmpeg.c

ffmpeg/lpms_ffmpeg.h

yondonfu · 2019-11-19T17:21:59Z

In theory, if an existing handle was initialized using a certain set of output renditions, that handle could be re-used for any stream as long as the stream is using those set of output renditions right?

Updating the transcode API so that the user can only pass in parameters that can change on a per-segment basis sounds like a good idea to me.

Fixing the output > However, the in/out filenames might not be the only thing we legitimately want to pass in to the transcoder on a per-segment basis, so the parameters should be a struct.

Do you have anything in particular in mind or would using a struct just be for future-proofing?

darkdarkdragon

Can't comment in code, so commenting here:

        // we hit this case when decoder is flushing; will be no input packet
        // (we don't need decoded frames since this stream is doing a copy)
        if (ipkt.pts == AV_NOPTS_VALUE) continue;

from av_read_frame documentation:

pkt->pts can be AV_NOPTS_VALUE if the video format has B-frames

So we will skip B-frames in case of stream copy?

ffmpeg/lpms_ffmpeg.c

j0sh · 2019-11-21T23:07:01Z

So we will skip B-frames in case of stream copy?

@darkdarkdragon Damn! Have to test for this, but you're probably right. Good catch!

j0sh · 2019-12-20T17:14:11Z

Updating the transcode API so that the user can only pass in parameters that can change on a per-segment basis sounds like a good idea to me.

Opted not to do this for now - both because the actual API itself requires some more thought and discussion that I'd like to have some more time for, and because any such API would likely have knock-on effects on the go-livepeer integration that would require even more work there.

Since there isn't really an appetite for prolonging this PR even further, we can keep the API as-is for now.

cc @yondonfu

darkdarkdragon · 2020-01-09T20:11:51Z

@j0sh what about my comment about skipping B-frames?

darkdarkdragon · 2020-01-09T20:38:11Z

@j0sh
I'm missing something, or this is
https://github.com/livepeer/lpms/pull/141/files#diff-5e694afe073853397a89ba4797aad0d6L662-L664

  av_init_packet(pkt);
  // loop until a new frame has been decoded, or EAGAIN
  while (1) {

not a loop anymore? It has unconditional break at the end?

j0sh · 2020-01-10T16:31:43Z

@j0sh what about my comment about skipping B-frames?

Not in this PR, since that's an unrelated issue.

this is not a loop anymore? It has unconditional break at the end?

Ah yeah - this PR didn't change anything there, but stream copy changed the behavior so the comment has been outdated. Updated the comment, but avoided code changes in e34a473

The nvenc fix needed to support flushing is now in ffmpeg master.

darkdarkdragon · 2020-01-10T18:18:40Z

@j0sh

Not in this PR, since that's an unrelated issue.

do we have issue created in Github for this?

otherwise, LGTM, 🚢 !

yondonfu

Just had one clarifying question

yondonfu · 2020-01-10T19:27:13Z

ffmpeg/ffmpeg.go

@@ -203,6 +218,13 @@ func Transcode3(input *TranscodeOptionsIn, ps []TranscodeOptions) (*TranscodeRes
 			muxOpts.name = C.CString(p.Muxer.Name)
 			defer C.free(unsafe.Pointer(muxOpts.name))
 		}
+		// Set some default encoding options
+		if len(p.VideoEncoder.Name) <= 0 && len(p.VideoEncoder.Opts) <= 0 {
+			p.VideoEncoder.Opts = map[string]string{


So forced-idr is in fact needed here?

j0sh · 2020-01-10T20:51:39Z

do we have issue created in Github for this?

Tracking here: #164

we really should capture this behavior in tests. Will make a note to correct that.

Tracking here: #165

j0sh mentioned this pull request Aug 19, 2019

Count decoded pixels. #147

Merged

j0sh mentioned this pull request Oct 2, 2019

GPU Load Balancing livepeer/go-livepeer#1108

Closed

9 tasks

j0sh force-pushed the ja/bottleneck branch from f0e6390 to 29699f7 Compare October 7, 2019 19:00

j0sh mentioned this pull request Nov 8, 2019

Integrate new LPMS API and load balancer. livepeer/go-livepeer#1124

Merged

3 tasks

j0sh requested review from darkdarkdragon, angyangie, ericxtang, kyriediculous and yondonfu November 12, 2019 21:54

j0sh force-pushed the ja/bottleneck branch from 4824414 to fd3fbb6 Compare November 12, 2019 21:59

j0sh changed the title ~~WIP: LPMS bottleneck fix~~ LPMS bottleneck fix Nov 12, 2019

wohlner assigned j0sh Nov 18, 2019

j0sh force-pushed the ja/bottleneck branch from fd3fbb6 to 6ff9d32 Compare November 19, 2019 05:04

yondonfu reviewed Nov 19, 2019

View reviewed changes

darkdarkdragon reviewed Nov 20, 2019

View reviewed changes

ffmpeg/lpms_ffmpeg.c Show resolved Hide resolved

ffmpeg/lpms_ffmpeg.c Show resolved Hide resolved

ffmpeg/lpms_ffmpeg.c Show resolved Hide resolved

ffmpeg/lpms_ffmpeg.c Outdated Show resolved Hide resolved

ffmpeg/lpms_ffmpeg.c Show resolved Hide resolved

j0sh mentioned this pull request Nov 21, 2019

Update joy4 and ffmpeg. #160

Merged

j0sh force-pushed the ja/bottleneck branch 5 times, most recently from 4c3c91c to c423f33 Compare December 20, 2019 06:49

j0sh force-pushed the ja/bottleneck branch from d76f347 to 56db424 Compare January 8, 2020 23:08

ffmpeg: Update outdated comment around decoding behavior.

f1f732d

j0sh force-pushed the ja/bottleneck branch from 56db424 to f7fb9b6 Compare January 10, 2020 16:29

j0sh added 3 commits January 10, 2020 16:45

install_ffmpeg: Move back to ffmpeg git.

db43fc7

The nvenc fix needed to support flushing is now in ffmpeg master.

ffmpeg: Improve error logging.

72ea7e9

ffmpeg: Reuse transcode session for GPU encoding.

e34a473

j0sh force-pushed the ja/bottleneck branch from f7fb9b6 to e34a473 Compare January 10, 2020 16:46

darkdarkdragon approved these changes Jan 10, 2020

View reviewed changes

yondonfu approved these changes Jan 10, 2020

View reviewed changes

j0sh merged commit e34a473 into master Jan 10, 2020

mkrufky mentioned this pull request Jan 10, 2020

update cuda-ffmpeg-setup for new version of required ffmpeg dependency livepeer/bot-miner-setup#10

Closed

This was referenced Jan 10, 2020

Persistent Transcode Loop #119

Closed

Stream Copy Skips B-Frames #164

Closed

j0sh deleted the ja/bottleneck branch May 15, 2020 18:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LPMS bottleneck fix #141

LPMS bottleneck fix #141

j0sh commented Aug 3, 2019 •

edited

Loading

yondonfu Nov 18, 2019

j0sh Nov 19, 2019

yondonfu Nov 19, 2019

j0sh Nov 19, 2019

yondonfu Jan 10, 2020

j0sh Jan 10, 2020

yondonfu commented Nov 19, 2019

darkdarkdragon left a comment

j0sh commented Nov 21, 2019

j0sh commented Dec 20, 2019

darkdarkdragon commented Jan 9, 2020

darkdarkdragon commented Jan 9, 2020

j0sh commented Jan 10, 2020 •

edited

Loading

darkdarkdragon commented Jan 10, 2020

yondonfu left a comment

yondonfu Jan 10, 2020

j0sh commented Jan 10, 2020

LPMS bottleneck fix #141

LPMS bottleneck fix #141

Conversation

j0sh commented Aug 3, 2019 • edited Loading

Background

LPMS API changes:

Code Flow and Changes

API Usability

yondonfu Nov 18, 2019

Choose a reason for hiding this comment

j0sh Nov 19, 2019

Choose a reason for hiding this comment

yondonfu Nov 19, 2019

Choose a reason for hiding this comment

j0sh Nov 19, 2019

Choose a reason for hiding this comment

yondonfu Jan 10, 2020

Choose a reason for hiding this comment

j0sh Jan 10, 2020

Choose a reason for hiding this comment

yondonfu commented Nov 19, 2019

darkdarkdragon left a comment

Choose a reason for hiding this comment

j0sh commented Nov 21, 2019

j0sh commented Dec 20, 2019

darkdarkdragon commented Jan 9, 2020

darkdarkdragon commented Jan 9, 2020

j0sh commented Jan 10, 2020 • edited Loading

darkdarkdragon commented Jan 10, 2020

yondonfu left a comment

Choose a reason for hiding this comment

yondonfu Jan 10, 2020

Choose a reason for hiding this comment

j0sh commented Jan 10, 2020

j0sh commented Aug 3, 2019 •

edited

Loading

j0sh commented Jan 10, 2020 •

edited

Loading