Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LPMS bottleneck fix #141

Merged
merged 4 commits into from
Jan 10, 2020
Merged

LPMS bottleneck fix #141

merged 4 commits into from
Jan 10, 2020

Conversation

j0sh
Copy link
Collaborator

@j0sh j0sh commented Aug 3, 2019

The commits in this PR are rebased and squashed from the following WIP branch: https://github.com/livepeer/lpms/compare/ja/old-bottleneck

Background

There is a substantial overhead cost to starting new transcode sessions on a GPU. Adding more streams and GPUs exacerbates the problem [1]. Up until now, a new GPU session has been started for each segment that's received. This has led to a massive slowdown in transcoding latency, and becomes worse with each additional concurrent stream.

To mitigate this setup overhead, we make transcode sessions persistent for each stream, so segments after the first can re-use the same session.

[1] There may be a global lock somewhere within FFmpeg or the Nvidia drivers, or it may be intrinsic.

LPMS API changes:

Transcoding becomes stateful: a transcode session is persisted on the GPU in between segments. The caller becomes responsible for terminating session. Working with the transcoder now involves allocating the transcoder, transcoding and stopping the transcoder. The API is as follows:

func NewTranscoder() *Transcoder
func (t *Transcoder) StopTranscoder()
func (t *Transcoder) Transcode(input *TranscodeOptionsIn, ps []TranscodeOptions) (*TranscodeResults, error)

The Transcoder struct is currently opaque to outside packages - nothing within that needs to be exposed at the moment.

The signature to the Transcode function match the existing Transcode3 function signature. Existing transcoding APIs are re-defined in terms of this new API and will continue to work unmodified in older programs.

Note that this doubles down on the "one segment at a time" restriction. In order to process multiple segments for a the same job in parallel, a new session needs to be created. However, this may end up being very slow. Sessions should be reused wherever possible.

The older Transcode3 API is preserved for single-shot transcoding, and is implemented as a wrapper around the three-step API.

Code Flow and Changes

  • Muxer, Filtergraphs and Audio: No change from what we have now. Always recreated in between sessions. Enough code may have been moved around to obscure this fact, but the there should be no net effect on the muxer, filtergraphs and audio as a result of this PR.

  • Decoding: Reused in between sessions. The number of streams and the indices of audio / video components are also cached in between segments in order to avoid a latency-incurring probing step. If the stream ordering changes in between segments, things will likely break. (This shouldn't happen under most circumstances, but isn't unheard of, eg in case there is a reconfiguration somewhere at the source. In which case the best thing to do there would be to create a new session, rather than reuse an existing session.)

Decoder buffering is a problem: sometimes we have completed inputting all packets from a segment, but haven't received all pending frames from the decoder. In order to flush the decoder buffer, we send dummy packet with a sentinel timestamp, keep feeding decoder until we receive a decoded frame back with the sentiel timestamp. Discard any frames with the sentinel timestamp. The dummy packet is taken from the first packet of each segment (and reset for subsequent segments).

  • Encoding: The encoder session is kept alive in between segments if a hardware encoder is used. If a software encoder is used, then the encoder is reset. This fact alone makes the set-up and clean-up flows conditional and rather confusing.

For hardware encoding, we current don't do anything to explicitly flush the encoder in between segments. This seems OK for now - we receive all frames we're supposed to, according to our tests - but this makes me nervous that certain inputs or configurations will leave us with buffered frames (eg, possibly B-frames). (There is a solution that might work for flushing the Nvidia encoder but it involves modifying FFmpeg in questionable ways.)

API Usability

Some usability shortcomings of the current API that should be mentioned. (We can opt to fix this before merging, but I haven't gotten around to it. Happy to do this sooner since it'd be an API break.)

Currently, the output renditions are fixed for the duration of the transcode session, but the user can pass in whatever they want for subsequent segments [1]. We attempt to "check" this with a rudimentary length check of the number of outputs, but doesn't actually enforce that the configuration of those outputs match. A better API would look something like:

NewTranscoder(InputOptions, OutputOptions)
Transcode(InputFilenames, OutputFilenames)

That would mean defining new structs for NewTranscoder and Transcoder since we shouldn't re-use or modify the old parameter sets.

However, the in/out filenames might not be the only thing we legitimately want to pass in to the transcoder on a per-segment basis, so the parameters should be a struct.

[1] Example of poor API usage:

tr := NewTranscoder()
out1 := []TranscodeOptionsOut{ ... one set of renditions ... }
tr.Transcode(in1, out1)
// Most of this is basically ignored, apart from the output filename (which *should* vary)
// Different muxer options will be respected as well, but probably shouldn't for consistency
out2 := []TranscodeOptionsOut{ ... another set of renditions ... }
tr.Transcode(in2, out2)
tr.StopTranscoder()

Fixes livepeer/go-livepeer#997
Fixes livepeer/go-livepeer#951

@@ -194,7 +209,7 @@ func Transcode3(input *TranscodeOptionsIn, ps []TranscodeOptions) (*TranscodeRes
filters := fmt.Sprintf("fps=%d/%d,%s='w=if(gte(iw,ih),%d,-2):h=if(lt(iw,ih),%d,-2)'", param.Framerate, 1, scale_filter, w, h)
if input.Accel != Software && p.Accel == Software {
// needed for hw dec -> hw rescale -> sw enc
filters = filters + ":format=yuv420p,hwdownload"
filters = filters + ",hwdownload,format=nv12"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the change from converting to yuv420p before hwdownload to converting to nv12 after hwdownload?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Old way was actually broken - it only worked with the npp scaler which we phased out in favor of CUDA before the original GPU PR was even merged. See livepeer/go-livepeer#951

This seems to work better now that we're using cuvid as a decoder, but I have to check that it actually fixes livepeer/go-livepeer#951

@@ -203,6 +218,13 @@ func Transcode3(input *TranscodeOptionsIn, ps []TranscodeOptions) (*TranscodeRes
muxOpts.name = C.CString(p.Muxer.Name)
defer C.free(unsafe.Pointer(muxOpts.name))
}
// Set some default encoding options
if len(p.VideoEncoder.Name) <= 0 && len(p.VideoEncoder.Opts) <= 0 {
p.VideoEncoder.Opts = map[string]string{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, in what scenarios wouldn't we want to enforce IDR frames or closed GOPs?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't set it if the user passes in their own options. Practical scenarios for open GOP and non-intra I frames include non-segmented VOD or real time streaming where you want to take advantage of a larger range of pictures for the bitrate savings.

On checking again, nvenc doesn't actually the cgop option, so it's useless for us now. Will remove. It was a leftover from when trying to kludge x264 into doing persistent transcode sessions. We aren't anymore, so this option is no longer needed.

The forced-idr option might not be needed as well...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So forced-idr is in fact needed here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes the force-idr option is required - without it, we get decoder errors at the beginning of playback because SPS/PPS extradata isn't always is emitted for the first frame of new segments, which we signal via an explicit keyframe. This option forces that.

Ideally this option wouldn't be needed at all; ffmpeg itself should but smart enough to emit extradata for the first frame after a flush; I may add that in.

Also just checked and it seems that disabling this option doesn't lead to any unit test failures - we really should capture this behavior in tests. Will make a note to correct that.

ffmpeg/lpms_ffmpeg.c Show resolved Hide resolved
ffmpeg/ffmpeg.go Show resolved Hide resolved
ffmpeg/lpms_ffmpeg.c Show resolved Hide resolved
ffmpeg/lpms_ffmpeg.c Outdated Show resolved Hide resolved
ffmpeg/lpms_ffmpeg.c Outdated Show resolved Hide resolved
ffmpeg/lpms_ffmpeg.c Show resolved Hide resolved
ffmpeg/lpms_ffmpeg.c Outdated Show resolved Hide resolved
ffmpeg/lpms_ffmpeg.h Outdated Show resolved Hide resolved
@yondonfu
Copy link
Member

In theory, if an existing handle was initialized using a certain set of output renditions, that handle could be re-used for any stream as long as the stream is using those set of output renditions right?

Updating the transcode API so that the user can only pass in parameters that can change on a per-segment basis sounds like a good idea to me.

Fixing the output > However, the in/out filenames might not be the only thing we legitimately want to pass in to the transcoder on a per-segment basis, so the parameters should be a struct.

Do you have anything in particular in mind or would using a struct just be for future-proofing?

Copy link
Contributor

@darkdarkdragon darkdarkdragon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't comment in code, so commenting here:

        // we hit this case when decoder is flushing; will be no input packet
        // (we don't need decoded frames since this stream is doing a copy)
        if (ipkt.pts == AV_NOPTS_VALUE) continue;

from av_read_frame documentation:

pkt->pts can be AV_NOPTS_VALUE if the video format has B-frames

So we will skip B-frames in case of stream copy?

ffmpeg/lpms_ffmpeg.c Show resolved Hide resolved
ffmpeg/lpms_ffmpeg.c Show resolved Hide resolved
ffmpeg/lpms_ffmpeg.c Show resolved Hide resolved
ffmpeg/lpms_ffmpeg.c Outdated Show resolved Hide resolved
ffmpeg/lpms_ffmpeg.c Show resolved Hide resolved
@j0sh j0sh mentioned this pull request Nov 21, 2019
@j0sh
Copy link
Collaborator Author

j0sh commented Nov 21, 2019

So we will skip B-frames in case of stream copy?

@darkdarkdragon Damn! Have to test for this, but you're probably right. Good catch!

@j0sh j0sh force-pushed the ja/bottleneck branch 5 times, most recently from 4c3c91c to c423f33 Compare December 20, 2019 06:49
@j0sh
Copy link
Collaborator Author

j0sh commented Dec 20, 2019

Updating the transcode API so that the user can only pass in parameters that can change on a per-segment basis sounds like a good idea to me.

Opted not to do this for now - both because the actual API itself requires some more thought and discussion that I'd like to have some more time for, and because any such API would likely have knock-on effects on the go-livepeer integration that would require even more work there.

Since there isn't really an appetite for prolonging this PR even further, we can keep the API as-is for now.

cc @yondonfu

@darkdarkdragon
Copy link
Contributor

@j0sh what about my comment about skipping B-frames?

@darkdarkdragon
Copy link
Contributor

@j0sh
I'm missing something, or this is
https://github.com/livepeer/lpms/pull/141/files#diff-5e694afe073853397a89ba4797aad0d6L662-L664

  av_init_packet(pkt);
  // loop until a new frame has been decoded, or EAGAIN
  while (1) {

not a loop anymore? It has unconditional break at the end?

@j0sh
Copy link
Collaborator Author

j0sh commented Jan 10, 2020

@j0sh what about my comment about skipping B-frames?

Not in this PR, since that's an unrelated issue.

this is not a loop anymore? It has unconditional break at the end?

Ah yeah - this PR didn't change anything there, but stream copy changed the behavior so the comment has been outdated. Updated the comment, but avoided code changes in e34a473

@darkdarkdragon
Copy link
Contributor

@j0sh

Not in this PR, since that's an unrelated issue.

do we have issue created in Github for this?

otherwise, LGTM, 🚢 !

Copy link
Member

@yondonfu yondonfu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just had one clarifying question

@@ -203,6 +218,13 @@ func Transcode3(input *TranscodeOptionsIn, ps []TranscodeOptions) (*TranscodeRes
muxOpts.name = C.CString(p.Muxer.Name)
defer C.free(unsafe.Pointer(muxOpts.name))
}
// Set some default encoding options
if len(p.VideoEncoder.Name) <= 0 && len(p.VideoEncoder.Opts) <= 0 {
p.VideoEncoder.Opts = map[string]string{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So forced-idr is in fact needed here?

@j0sh
Copy link
Collaborator Author

j0sh commented Jan 10, 2020

do we have issue created in Github for this?

Tracking here: #164

we really should capture this behavior in tests. Will make a note to correct that.

Tracking here: #165

@j0sh j0sh deleted the ja/bottleneck branch May 15, 2020 18:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

LPMS Bottleneck Tasks scale_cuda broken on GPU decode + resize, SW encode
3 participants