PLI dilution #3422

natikaltura · 2024-09-08T14:04:19Z

Fix: overflow of PLI requests to Publishers (especially with simulcast)

This is a PR for v0.14.3 branch (see PR# for the multistream)

We observed a significant issue while using the Janus Media Server on both the v0.14.3 branch and the v1.2.3 multistream branch. When VP8 simulcast is enabled with a group of viewers for a Publisher, the Publisher sometimes receives dozens of Picture Loss Indication (PLI) requests per second. Although these PLI requests don't directly lead to full frame retransmissions or additional outgoing networking overhead from the Publisher, they do trigger a lot of processing, resulting in CPU spikes on the Publisher.

Additional Context
While this PR addresses the overflow of PLI requests, a related issue remains unresolved. Specifically, when Chrome is used for desktop sharing, the generated video streams often operate at a very low frame rate (0.2–0.5 FPS). This low frame rate causes Janus to generate excessive PLIs. The issue arises primarily from the janus_rtp_simulcasting_context_process_rtp() function in rtp.c. It suggests that the Publisher requires a PLI (->need_pli) if no stream packets have been received for more than 250,000 microseconds.

As a temporary fix, we applied a patch that identifies desktop sharing by checking for the “_desktop” string constant we add to the Publisher's display property. However, for a correect solution it might be beneficial to add a lowFps property for Publishers. This would allow for more precise logic to handle such cases effectively.

Thank you in advance for your input.
Nati Baratz

tmatth · 2024-09-08T17:00:10Z

@natikaltura you should change the target branch to 0.x

lminiero · 2024-09-09T11:30:47Z

@natikaltura thanks for the PRs! Answering here even though some considerations will apply to #3423 as well (where there's more changes due to the multiple streams).

I think the patches can be simplified, since you're adding a new volatile that's probably not needed. The main problem is that the fence we currently have doesn't seem to be working well with the multistream nature of Janus (the same function is invoked by multiple threads), and mostly because we only update fir_latest at the end of the function instead of the very beginning. Moving it earlier would IMO do the same thing that the new sending_pli thing you added tries to do.

I don't like the change of the threshold from 100ms to 1 second. It will make responsiveness for new users (and simulcast changes) awful in some cases. While we can agree that 100ms may be too short (and probably the cause of your ~10 PLIs per second to the publisher), I don't see 1s as the answer, especially with the fence fix from the paragraph above.

Coming to multistream, I don't see the point of adding a new function. I'd just add a new property to the existing function to dictate a potential different behaviour (e.g., -1 does what it did before, a positive integer only pokes the specific stream). Not even sure why we need two? If we're starting to do drill down PLIs, why not only do that?

While this PR addresses the overflow of PLI requests, a related issue remains unresolved. Specifically, when Chrome is used for desktop sharing, the generated video streams often operate at a very low frame rate (0.2–0.5 FPS). This low frame rate causes Janus to generate excessive PLIs.

As you noticed, that's because the simulcast code defaults to 250ms, but that's a (reasonable) default. You can override that with the fallback property when subscribing or using configure. As such, I don't see this as an issue, unless by lowFps you mean give a way to the publisher to specify the fallback that should be used as default for that specific stream, so that subscribers don't have to do it themselves.

natikaltura · 2024-09-09T17:48:25Z

@natikaltura thanks for the PRs! Answering here even though some considerations will apply to #3423 as well (where there's more changes due to the multiple streams).

@lminiero Thanks a lot for the comments!

I’ll fix both PRs after getting your inputs about the following:

I think the patches can be simplified, since you're adding a new volatile that's probably not needed. The main problem is that the fence we currently have doesn't seem to be working well with the multistream nature of Janus (the same function is invoked by multiple threads), and mostly because we only update fir_latest at the end of the function instead of the very beginning. Moving it earlier would IMO do the same thing that the new sending_pli thing you added tries to do.

Thanks! I'll remove the sending_pli volatile from the 0.x stream since it seems unnecessary with the other two PR changes:

Moving update_fir_latest to the beginning
Calling janus_videoroom_reqpli() in janus_videoroom_relay_rtp_packet() instead of directly calling gateway->send_pli()

I originally copied the sending_pli volatile from the multistream implementation.
Do you think it’s needed in the main branch? If so I'd be happy to include the removal fix in the second PR.

I don't like the change of the threshold from 100ms to 1 second. It will make responsiveness for new users (and simulcast changes) awful in some cases. While we can agree that 100ms may be too short (and probably the cause of your ~10 PLIs per second to the publisher), I don't see 1s as the answer, especially with the fence fix from the paragraph above.

Just to emphasize the need for this fix, I have logs showing 200 and more PLIs per second (for each layer), especially with screen sharing on challenging networks :)

I took the 1-second threshold from the main (multistream) branch, used in janus_videoroom.c -> janus_videoroom_rtcp_pli_send() function. Do you think this could be an issue?
Would 500ms make more sense, or is that still too high? And let's keep in mind that it's sent for each layer.

Coming to multistream, I don't see the point of adding a new function. I'd just add a new property to the existing function to dictate a potential different behaviour (e.g., -1 does what it did before, a positive integer only pokes the specific stream). Not even sure why we need two? If we're starting to do drill down PLIs, why not only do that?

Sure. I will unite it to one function.

While this PR addresses the overflow of PLI requests, a related issue remains unresolved. Specifically, when Chrome is used for desktop sharing, the generated video streams often operate at a very low frame rate (0.2–0.5 FPS). This low frame rate causes Janus to generate excessive PLIs.

As you noticed, that's because the simulcast code defaults to 250ms, but that's a (reasonable) default. You can override that with the fallback property when subscribing or using configure. As such, I don't see this as an issue, unless by lowFps you mean give a way to the publisher to specify the fallback that should be used as default for that specific stream, so that subscribers don't have to do it themselves.

Thanks! I realize I missed the full picture...
But nevertheless, we implemented our patch for the Publisher, so maybe the best practice would be to add the fallback property for Publishers as well, when publishing or using configure.
If you find it useful for other users, we will try to implement it in the PR's

DenisSicunKaltura · 2024-09-30T14:40:39Z

Hi @lminiero, have you had a chance to review @natikaltura questions?
I wanted to let you know that we are using those changes on our production and it seems to make a big difference.
We would love to get your inputs also on the multistream changes so we can fix both PRs.

Thanks!

lminiero · 2024-10-01T08:21:08Z

@DenisSicunKaltura thanks for the heads up and sorry for missing this. I remember reading the mail notification but being unable to answer right away, and then I forgot about that. Answering the relevant points below.

I originally copied the sending_pli volatile from the multistream implementation.
Do you think it’s needed in the main branch? If so I'd be happy to include the removal fix in the second PR.

IIRC we added it to multistream because there you can have multiple video m-lines, all of them potentially asking for a PLI at the same time, but I may be wrong. It's not a bad idea to keep it in 0.x as well, then, if not to make mantainance easier, rather than removing it.

I took the 1-second threshold from the main (multistream) branch

I should really remember my own code better 🤣
My guess is that the reason is the same as the above, but if we had it there and it works it does indeed make sense for 0.x as well.

2. Would 500ms make more sense, or is that still too high? And let's keep in mind that it's sent for each layer.

I don't have a specific suggestion on how long that should be, just reasonings on response times and what would work best with many new subscribers coming in, or many subscribers to the same sources switching feeds. If you can make some experiments with different thresholds and evaluate what looks like a good trade-off in your environments (since you have big deployments to test this with) it could indeed help. We could even make it configurable, I guess, even though that would be something to do in a separate effort, in case, as it wouldn't be clear what the API would need to be.

If you find it useful for other users, we will try to implement it in the PR's

While it makes sense as an enhancement, I'd rather keep the "default simulcast threshold for publishers" effort to a completely different PR. It's best not to cram too much stuff in the same PR, especially when they touch different things and for different requirements. This PR is good to address the "too many PLIs" issue, but for simulcast switching we have an API already, so a separate effort may provide a different approach to the existing one.

DenisSicunKaltura · 2024-10-01T13:19:37Z

Thanks for the quick answer @lminiero !

I don't have a specific suggestion on how long that should be, just reasonings on response times and what would work best with many new subscribers coming in, or many subscribers to the same sources switching feeds. If you can make some experiments with different thresholds and evaluate what looks like a good trade-off in your environments (since you have big deployments to test this with) it could indeed help. We could even make it configurable, I guess, even though that would be something to do in a separate effort, in case, as it wouldn't be clear what the API would need to be.

As I mentioned, we deployed it to our production. From the last few weeks, 1 second seems to work really well, also for large sessions, with hundreds of simultaneous subscribers (joining/switching feeds). I think it's good to keep it at a pretty high value for users that come from challenging networks or long geographical distance.
If we find anything new about this, we will update.
I agree that creating a config might be an overkill in this case, as it's for a very specific use case. Maybe as part create/configure room requests? Anyway, not now :)

After those clarifications, I think that @natikaltura will complete the rest of the PR comments soon.

lminiero · 2024-10-01T13:24:56Z

Thanks for the feedback! Waiting for the updates then ✌️

This reverts commit 9cb4bf9.

natikaltura · 2024-11-07T16:13:47Z

Hi @lminiero. Completed the 2 PRS, see #3476 (instead of this one) and #3423 (for the multistream)

natikaltura changed the base branch from master to 0.x September 8, 2024 17:17

natikaltura marked this pull request as ready for review September 8, 2024 17:19

natikaltura mentioned this pull request Sep 8, 2024

PLI dilution (Multistream) #3423

Open

natikaltura closed this Sep 9, 2024

natikaltura reopened this Sep 9, 2024

lminiero added the legacy Related to Janus 0.x label Oct 8, 2024

natikaltura added a commit to natikaltura/janus-gateway that referenced this pull request Nov 7, 2024

Fixes according to PR meetecho#3422 remarks

9cb4bf9

natikaltura added a commit to natikaltura/janus-gateway that referenced this pull request Nov 7, 2024

Revert "Fixes according to PR meetecho#3422 remarks"

ac13850

This reverts commit 9cb4bf9.

natikaltura force-pushed the PLI_dilution branch from 41fc449 to 51fe38a Compare November 7, 2024 13:21

natikaltura closed this Nov 7, 2024

natikaltura force-pushed the PLI_dilution branch from 51fe38a to aed150c Compare November 7, 2024 13:44

natikaltura mentioned this pull request Nov 7, 2024

PLI dilution for 0.x. Fixes-according-to-PR-3422-remark #3476

Open

natikaltura deleted the PLI_dilution branch November 7, 2024 16:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PLI dilution #3422

PLI dilution #3422

natikaltura commented Sep 8, 2024 •

edited

Loading

tmatth commented Sep 8, 2024

lminiero commented Sep 9, 2024

natikaltura commented Sep 9, 2024 •

edited

Loading

DenisSicunKaltura commented Sep 30, 2024

lminiero commented Oct 1, 2024

DenisSicunKaltura commented Oct 1, 2024

lminiero commented Oct 1, 2024

natikaltura commented Nov 7, 2024

PLI dilution #3422

PLI dilution #3422

Conversation

natikaltura commented Sep 8, 2024 • edited Loading

tmatth commented Sep 8, 2024

lminiero commented Sep 9, 2024

natikaltura commented Sep 9, 2024 • edited Loading

DenisSicunKaltura commented Sep 30, 2024

lminiero commented Oct 1, 2024

DenisSicunKaltura commented Oct 1, 2024

lminiero commented Oct 1, 2024

natikaltura commented Nov 7, 2024

natikaltura commented Sep 8, 2024 •

edited

Loading

natikaltura commented Sep 9, 2024 •

edited

Loading