Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Websocket communication error #423

Open
batzkass opened this issue Dec 4, 2020 · 7 comments
Open

Websocket communication error #423

batzkass opened this issue Dec 4, 2020 · 7 comments

Comments

@batzkass
Copy link

batzkass commented Dec 4, 2020

Hi Davide,

I finally upgraded Remi from 2019.11 to 2020.11.20 some days ago, so that I can benefit from the bugfix of the connection drop we worked on. It seems that many things changed in Remi since one year and it took me less than 10 minutes to adapt my server code (which is kind of huge now), so congratulations for your efforts to make it easy to maintain.

I saw that you worked on the websocket transmissions on both server and client side, and unfortunately I'm facing a new websocket bug that wasn't on 2019.11. It took me hours to reproduce, isolate and finally create a minimal (not-)working example.

Conditions to reproduce:

  • The client/server must be linked by a "not too fast" connection. I mean this issue doesn't appear when they are on the same machine, wired network, or even very fast wifi. The server on which I'm running Remi usually for production is 500km from me, it is behind some proxies, and I'm connected through wifi to my router at home : this is enough to reproduce the issue. The best way I found to reproduce finally 100% of the time is to run remi at home, open the remi listening port from the Internet, and connect though 4G.
  • The server must send multiple successive websocket messages on a single UI action. The file attached is a basic python script that reproduces well the issue : there is a column (VBox) of 100 Labels, and at the top a Button. The button triggers a function that successively updates the Labels text, hence sending 100 websocket messages in a row. If you don't succeed to reproduce it, increasing from 100 to 1000 the number of Labels should increase the probability of occurrence.
  • Browser: I tried firefox, chrome, and brave (on my android device), all of them fail.
  • I tried ubuntu 18.04 and 19.10, both with remi 2020.11.20.

Symptoms:

  • When you initially load the page everything works well, as the whole page is sent in a single message. Then when you press the button, the first messages are ok, then randomly at a certain point the websocket connexion fails. As far as I understood the client browser triggers _ws.onerror, making the server to close the connection. On the client side, we see for a short time Remi's "loading folding square". After that, a new websocket is automatically established and the page is fully re-sent to the client.
  • Log on the server side
remi.request     DEBUG    App.onpageshow event occurred
remi.server.ws   DEBUG    send_message: 3... -> ('37.170.119.51', 2518)
remi.server.ws   DEBUG    on_message: callback
remi.request     DEBUG    App.onresize event occurred. Width:400 Height:584
remi.server.ws   DEBUG    send_message: 3... -> ('37.170.119.51', 2518)
remi.server.ws   DEBUG    on_message: callback
remi.server.ws   DEBUG    send_message: 1139974414727696,%3Cp%20id%3D%22139974414727696%22%20class%3D%22Label%22%20data-parent-widget%3D%221... -> ('37.170.119.51', 2518)
remi.server.ws   DEBUG    send_message: 1139974414727632,%3Cp%20id%3D%22139974414727632%22%20class%3D%22Label%22%20data-parent-widget%3D%221... -> ('37.170.119.51', 2518)
remi.server.ws   DEBUG    send_message: 1139974415166160,%3Cp%20id%3D%22139974415166160%22%20class%3D%22Label%22%20data-parent-widget%3D%221... -> ('37.170.119.51', 2518)
remi.server.ws   DEBUG    send_message: 1139974415164176,%3Cp%20id%3D%22139974415164176%22%20class%3D%22Label%22%20data-parent-widget%3D%221... -> ('37.170.119.51', 2518)
remi.server.ws   DEBUG    send_message: 1139974415164944,%3Cp%20id%3D%22139974415164944%22%20class%3D%22Label%22%20data-parent-widget%3D%221... -> ('37.170.119.51', 2518)
remi.server.ws   DEBUG    send_message: 1139974414919440,%3Cp%20id%3D%22139974414919440%22%20class%3D%22Label%22%20data-parent-widget%3D%221... -> ('37.170.119.51', 2518)
remi.server.ws   DEBUG    send_message: 1139974414921104,%3Cp%20id%3D%22139974414921104%22%20class%3D%22Label%22%20data-parent-widget%3D%221... -> ('37.170.119.51', 2518)
[...]
remi.server.ws   DEBUG    send_message: 1139974396841040,%3Cp%20id%3D%22139974396841040%22%20class%3D%22Label%22%20data-parent-widget%3D%221... -> ('37.170.119.51', 2518)
remi.server.ws   DEBUG    send_message: 1139974396363472,%3Cp%20id%3D%22139974396363472%22%20class%3D%22Label%22%20data-parent-widget%3D%221... -> ('37.170.119.51', 2518)
remi.server.ws   DEBUG    send_message: 1139974396365200,%3Cp%20id%3D%22139974396365200%22%20class%3D%22Label%22%20data-parent-widget%3D%221... -> ('37.170.119.51', 2518)
remi.server.ws   DEBUG    send_message: 1139974396366800,%3Cp%20id%3D%22139974396366800%22%20class%3D%22Label%22%20data-parent-widget%3D%221... -> ('37.170.119.51', 2518)
remi.server.ws   DEBUG    send_message: 1139974396397264,%3Cp%20id%3D%22139974396397264%22%20class%3D%22Label%22%20data-parent-widget%3D%221... -> ('37.170.119.51', 2518)
remi.request     DEBUG    removing websocket instance, communication error with client
remi.server.ws   ERROR    Error managing incoming websocket message
Traceback (most recent call last):
  File "/home/francois/.local/lib/python3.7/site-packages/remi/server.py", line 165, in read_next_message
    length = self.bytetonum(length[1]) & 127
IndexError: index out of range
remi.server.ws   DEBUG    ws ending websocket service
remi.server.ws   INFO     connection established: ('37.170.119.51', 2524)
remi.server.ws   DEBUG    handle
remi.server.ws   DEBUG    handshake
remi.server.ws   INFO     handshake complete
remi.server.ws   DEBUG    send_message: 0139974414943632,%3Cdiv%20id%3D%22loading%22%20class%3D%22Container%22%20data-parent-widget%3D%22139... -> ('37.170.119.51', 2524)

  • Log on the client side:
opening websocket
Connection is closed... event code: 1006, reason: 
failed connections=1 queued messages=0
opening websocket

Misc. notes
This issue affects my code as, when I do a refresh, I also sometimes send some "user-defined" javascript code to the browser. If the issue happens before my manual javascript sending, so it's not executed and the page doesn't display fully. Apart from that, the client only sees the "waiting rectangle" then a new connection initiate. Also, I'm a bit confused as this bug was not present in Remi 2019.11.

NB: in the log above I increased the number of chars displayed by the sever in server.py send_message to see a little bit more.

Could you please have a look and try to reproduce this ?
Thanks,
François

@dddomodossola
Copy link
Collaborator

Hello @batzkass, wonderful exhaustive bug report, thank you.
There is a server parameter to correct the websocket communication timeout. In your case a longer timeout should fix the problem. The parameter is websocket_timeout_timer_ms. It have to be passed to the start function call. Default value is 1000 (1 second).
Please let me know if this solves your problem.
Have a nice day.

@batzkass
Copy link
Author

batzkass commented Dec 4, 2020

Unfortunately this doesn't change anything, I went up to 10000. In addition, when the issue occurs it's almost instantaneous (subsequent to my click on the button), way less than 1 second. It looks like a sending buffer that is filled faster than it is emptied (by actual sending). Is there somewhere this kind of buffer size in the websocket lib ?

@batzkass
Copy link
Author

batzkass commented Dec 5, 2020

I think about something else : to solve the last issue we had, you added a test, whether the socket is writable or not. If the previous message is still not fully received or acknowledged, the socket may not be writable for the current one. Perhaps we should wait a bit an retry in this case before closing the connexion ?

@batzkass
Copy link
Author

batzkass commented Dec 5, 2020

Well, the following code seems to fix the issue, at least for the minimal example I gave on 4G connection and 5000 (!) labels.

    def send_message(self, message):
        if not self.handshake_done:
            self._log.warning("ignoring message %s (handshake not done)" % message[:10])
            return False

        self._log.debug('send_message: %s... -> %s' % (message[:100], self.client_address))
        out = bytearray()
        out.append(129)
        length = len(message)
        if length <= 125:
            out.append(length)
        elif 126 <= length <= 65535:
            out.append(126)
            out += struct.pack('>H', length)
        else:
            out.append(127)
            out += struct.pack('>Q', length)
        if not pyLessThan3:
            message = message.encode('utf-8')
        out = out + message
        for i in range(10):
            readable, writable, errors = select.select([], [self.request,], [], 0) #last parameter is timeout, when 0 is non blocking
            #self._log.debug('socket status readable=%s writable=%s errors=%s'%((self.request in readable), (self.request in writable), (self.request in error$
            writable = self.request in writable
            if writable:
                self.request.sendall(out)
                return True
            time.sleep(0.1)
        return False

EDIT: ... but I get again bad results on connection drop (server not reacting to clicks for example...). I hope all these information will help you.

@dddomodossola
Copy link
Collaborator

dddomodossola commented Dec 6, 2020

@batzkass thank you very much for all this debug and support. Instead of creating a for loop it could be possible that a timeout other than 0 would solve the problem in this function call:
readable, writable, errors = select.select([], [self.request,], [], 0)

Or maybe we have to check which kind of error is returned.
I will be able to test this on Wednesday ;-)

@batzkass
Copy link
Author

batzkass commented Dec 7, 2020

Well, a timeout of 2s seems to fix the issue (without the loop). But let me a few days to confirm this on real conditions.
EDIT: it works 100% even with brutal connection drops. Thanks for the fix, this will drastically improve the user experience of our app. I'm just waiting for the pypi release to update the server instances I have.

dddomodossola added a commit that referenced this issue Feb 15, 2021
@dddomodossola
Copy link
Collaborator

@batzkass This should be fixed now (with a very long delay, excuse me). If you can give it a try it would be great, if you cannot than no worries. ;-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants