Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extended receive-timeout from 1ms to 200ms. #24

Open
wants to merge 1 commit into
base: 4.3
Choose a base branch
from

Conversation

bradfordben
Copy link

This value was too small resulting in messages being dropped by mod_kazoo's erlang message
receiver during high load. 200 is the default value set in mod_kazoo if the value is not
defined in the config file.

resulting in messages being dropped by mod_kazoo's erlang message
receiver.
@bradfordben
Copy link
Author

bradfordben commented Nov 12, 2020

@jamesaimonetti This was the issue to the problem i was talking to you about last week.
@lazedo do you know why it was set to 1 in the configs, im trying to understand if there is a reason for this to be 1?

@dschreiber
Copy link
Member

This should not be merged IMHO.

My understanding on the above is that you want that number as low as possible. 1 ships by default because most people run all their stuff in the same datacenter/LAN. Remember that you are setting a TIMEOUT value - i.e. "how long do I wait for a response" and this is for receive from ecallmgr. Well, that inherently means that if ecallmgr isn't in the process of sending anything to FS it's going to sit there waiting for 200ms. But if FS is wanting to send back stuff to ecallmgr you now have a block, well that is my understanding anyway, or something similar. The reason we even made that variable is once upon a time those blocks added up, to the point where they actually delayed call handling and caused a myriad of other issues. So 1 is best if you can use it.

One strategy to decide what value to use there would be to ping from your FS server to your apps server, cause really the app itself should be responding near-instantly to requests. So the latency is really your timeout issue - meaning if the message arrives after the timeout value due to latency it will be lost. (edited)

So if you have 50ms latency because your stuff is not in the same DC you could maybe set it to 70 or 80 to give some buffer.

200 as a default seems like a bad idea to me but maybe it will work depending on your volume/load. Just sounds really high, latency of that amount would be like, I dunno, pinging Australia, I hope your stuff is not that slow! Something else may be wrong if you really need a latency / receive timeout that's that high.

So while you have indeed probably found a "fix" for your particular setup I am not sure I would call that a bug or go around encouraging everyone to change theirs, as you may actually cause other people's setups to get worse.

It is a value that should be tuned for your particular setup.

@bradfordben
Copy link
Author

@dschreiber Thanks for your reply, You make a good point that 200 might be too high but i feel that 1 is too low, we picked 200 as it was hard coded as the default value but maybe that is not a good value if you have seen the messages build up on the FS side.

The problem is that if ei_xreceive_msg_tmo is only given 1ms to receive the full message that it can timeout during the receive and buffering of a message and then the message is lost. The timeout is not just how long to wait for a message before moving on but also how long it can give to the receive and buffering of a message before it gives up and return an error. 1ms just seems like a very small time window to have as a default value.

I can conform that we had this issue in our production env where our FS and ecallmgr are on the same LAN (Ive just checked pings and they range from 0.186ms to 0.118ms)

If you dont want to increase it then that is ok with me as i understand your reasoning, i just wanted to make sure you were aware of the impact of 1ms.

@bradfordben bradfordben force-pushed the extend_mod_kazoo_receive_timout branch from 966c45b to 043847a Compare November 13, 2020 22:51
@bradfordben
Copy link
Author

Changed to 5ms as per discussion with @dschreiber

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants