Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

In case of HARDWARE_ERROR event serial.disconnect() never returns #197

Open
mvalla opened this issue Nov 23, 2020 · 18 comments · Fixed by #210
Open

In case of HARDWARE_ERROR event serial.disconnect() never returns #197

mvalla opened this issue Nov 23, 2020 · 18 comments · Fixed by #210
Milestone

Comments

@mvalla
Copy link

mvalla commented Nov 23, 2020

Hallo,
I am using 5.2.1 on Win 10 x64 and java 8 (1.8.0_172)
When I run:
https://github.com/NeuronRobotics/nrjavaserial/blob/master/test/src/test/ReadTest.java
in case of HARDWARE_ERROR event, serial.disconnect() never returns.
In fact if add:

System.out.println("Clean exit of hardware");
serial.disconnect();
System.out.println("after disconnect");

"after disconnect" is never printed.
Thread [RXTXPortMonitor(//./COM3)] is still running, and if I suspend it I get this stack:

RXTXPort.interruptEventLoop() line: not available [native method]	
RXTXPort.removeEventListener() line: 867	
RXTXPort.close() line: 1091	
NRSerialPort.disconnect() line: 153	
MySerialDisconnectTest.lambda$0(DataInputStream, NRSerialPort, SerialPortEvent) line: 42	
1740189450.serialEvent(SerialPortEvent) line: not available	
RXTXPort.sendEvent(int, boolean) line: 787	
RXTXPort$MonitorThread.run() line: 1628	
@MrDOS
Copy link
Contributor

MrDOS commented Nov 23, 2020

Interesting. A quick glance at RXTXPort.interruptEventLoop() doesn't reveal anything that should get “stuck”. The linked-list search is the obvious suspect at a glance, but it's not possible to open a port without populating that list, and nothing every removes old entries from that list, so that shouldn't be it. But the only code after that is Windows-specific, none of which looks like it should block. So it must be getting hung up in that search loop.

How are you reproducing the HARDWARE_ERROR event? What serial port device are you using, a USB-to-serial dongle? Can you tell us what chipset it's using, or if you don't know, can you provide the exact part number of the device, or a link to a webpage for it?

@mvalla
Copy link
Author

mvalla commented Nov 24, 2020

I create the HARDWARE_ERROR by removing my dongle from the USB port.
The dongle is a ZigBee dongle by Legrand/BTicino based on EM3582 chipset from Silicon Labs.
Another hint, maybe related: in another test class I use with JUnit, after HARDWARE_ERROR and when the test ends these logs are printed on std out:

Error 0x5 at src/windows/termios.c(512): Access is denied.
Error 0x5 at src/windows/termios.c(2725): Access is denied.
Error 0x5 at src/windows/termios.c(2611): Access is denied.

which test can I do to help indentify the problem?

@d5smith
Copy link

d5smith commented Jan 27, 2021

I'm reproducing this constantly on macos (Catalina and BigSur) with Java 15 except it causes a SIGSEGV and crashes the JVM :(

It appears to be a timing issue on the destruction of RXTXPort and RXTXPort.MonitorThread.
When the native layer detects the device is gone it throws
java.io.IOException: Device not configured in nativeavailable
If I break there it forestalls the crash

I modified the ReadTest to add a loop that just check available() on the input stream and I trigger the HARDWARE_ERROR by disconnecting the instrument ffrom USB.

Is there anything I can add to help troubleshoot?

@MrDOS
Copy link
Contributor

MrDOS commented Jan 27, 2021

Thanks for the data points. From your notes, I've written this reproduction test case. Unfortunately, I cannot actually reproduce the issue on macOS 10.14 (Mojave) with Java 11 or 15 and the only serial device I have near to hand, an FTDI FT2232D-based adapter. I'll try again on a Windows box, but given the spread of OSes and JVM versions already at play, I think the cause of the issue is probably more related to driver-specific timings. I can certainly try to identify a solution without having a local reproduction, but it'll be harder to nail down.

What hardware are you using for this? The hardware @mvalla mentioned seemed prohibitively expensive to get for a test case. I'm crossing my fingers that yours is something more common.

@d5smith
Copy link

d5smith commented Jan 28, 2021

Thanks for the reply.

I'm using RBR CTDs, ;)
Trying to upgrade Ruskin to use the latest nrjavaserial but the JVM crashing on Macs on instrument disconnect is blocking that.
I'm also able to consistently reproduce it with just one of these .

I'm able to reproduce the issue with your test case (which is simpler than mine) when I debug it in eclispe (with no breakpoints set). If I run it however it does not error.

My initial test case simply added a loop to ReadTest.java before the catch block so as to keep the thread alive while running the test. This fails consistantly
and the offending frame is always:
C [libNRJavaSerial.jnilib+0x4574] drain_loop+0x34

I'll see if I can attach a debugger to the native code and find out anything more.

@mvalla
Copy link
Author

mvalla commented Jan 28, 2021

Just tested your DisconnectTest.java using Aeontec Z-Stick Gen5 ZW090-C, nrjavaserial 5.2.1 on Win 10 x64 and java 11.0.8.
From command line I do not experience any JVM crash, but I get this logger lines and some system outs:

22:33:51.585 [INFO ] [main        ] [org.serial.test.DisconnectTest:19   ] - Opened port COM6 and beginning to read. Trigger a hardware error now.
22:33:57.300 [INFO ] [main        ] [org.serial.test.DisconnectTest:26   ] - Caught an exception
java.io.IOException: Result too large in nativeavailable
        at gnu.io.RXTXPort.nativeavailable(Native Method)
        at gnu.io.RXTXPort$SerialInputStream.available(RXTXPort.java:1626)
        at org.serial.test.DisconnectTest.main(DisconnectTest.java:23)
22:33:57.300 [INFO ] [main        ] [org.serial.test.DisconnectTest:29   ] - Going to disconnect the port...
22:33:57.500 [INFO ] [main        ] [org.serial.test.DisconnectTest:31   ] - Disconnected. If you see this, crash reproduction failed.
Error 0x3e3 at src/windows/termios.c(512): The I/O operation has been aborted bError 0x3e3 at src/windows/termios.c(2725): The I/O operation has been aborted Error 0x16 at src/windows/termios.c(512): The device does not recognize the comError 0x16 at src/windows/termios.c(2725): The device does not recognize the coError 0x16 at src/windows/termios.c(512): The device does not recognize the comError 0x16 at src/windows/termios.c(2725): The device does not recognize the coError 0x16 at src/windows/termios.c(512): The device does not recognize the comError 0x16 at src/windows/termios.c(512): The device does not recognize the comError 0x16 at src/windows/termios.c(2725): The device does not recognize the coError 0x16 at src/windows/termios.c(2725): The device does not recognize the coError 0x16 at src/windows/termios.c(512): The device does not recognize the comError 0x16 at src/windows/termios.c(512): The device does not recognize the comError 0x16 at src/windows/termios.c(2725): The device does not recognize the coError 0x16 at src/windows/termios.c(2725): The device does not recognize the coError 0x16 at src/windows/termios.c(512): The device does not recognize the comError 0x16 at src/windows/termios.c(512): The device does not recognize the comError 0x16 at src/windows/termios.c(2725): The device does not recognize the coError 0x16 at src/windows/termios.c(2725): The device does not recognize the coError 0x16 at src/windows/termios.c(512): The device does not recognize the comError 0x16 at src/windows/termios.c(2725): The device does not recognize the coError 0x16 at src/windows/termios.c(2611): The device does not recognize the co

@d5smith
Copy link

d5smith commented Jan 28, 2021

Here's the ReadTest variant I am using.
Crashes every time on OSX

@MrDOS
Copy link
Contributor

MrDOS commented Jan 29, 2021

I'm using RBR CTDs, ;)

Ah, right – that David Smith! 🙂

Thanks for sharing your test code. I can reliably reproduce the SIGSEGV crash under both Java 8 and 15 on Mojave with that. Turns out my reproduction didn't catch the segfault because it bailed too soon: as soon as the hardware error was thrown, it exited. Your test loops on the thread being interrupted on line 64, but nothing ever interrupts it, so it continues to live on even after the serial port has gone away – and lives on long enough for the monitor thread to crash badly.

I've updated my test case to include some wait time after disconnect for the monitor thread to go down in flames. I've also retooled it to use the more conventional JCA-based API. I can reproduce the segfault with this on NRJavaSerial v5.1.0, but not the previous version, v5.0.2. The big change between those two versions was #172, which introduced the HARDWARE_ERROR event. This crash is likely due to an oversight in some of the post-disconnect cleanup.

Aeontec Z-Stick Gen5 ZW090-C

Well, that's cheaper (and easier to get) than the previous adapter at least, but I'd still like to see if I can get this fixed without buying hardware to reproduce. I'll hunt down the segfault first because I now have a good reproduction case for that, and hopefully getting that fixed will resolve the hang too. I'm getting less sure that the two symptoms are really caused by the same bug and not by two separate bugs that both happen to be triggered by hardware disconnects. Regardless, thanks for chiming in with reproduction notes so quickly.

@mvalla
Copy link
Author

mvalla commented Jan 29, 2021

I've updated my test case

If I run the test case on windows no crash, but the following logs are returned.

Opened port COM3 and beginning to read. Trigger a hardware error now.
Caught an exception: Going to disconnect the port...
No error in nativeavailable
Disconnected. Now going to sleep for 2.0s to give the monitor thread a chance to SIGSEGV.
If you're seeing this, that means the monitor thread didn't segfault, and bug reproduction has failed.
Error 0x5 at src/windows/termios.c(512): Access is denied.
Error 0x5 at src/windows/termios.c(512): Access is denied.
Error 0x5 at src/windows/termios.c(2725): Access is denied.
Error 0x5 at src/windows/termios.c(2725): Access is denied.
Error 0x5 at src/windows/termios.c(512): Access is denied.
Error 0x5 at src/windows/termios.c(2725): Access is denied.
Error 0x5 at src/windows/termios.c(512): Access is denied.
Error 0x5 at src/windows/termios.c(512): Access is denied.
Error 0x5 at src/windows/termios.c(2725): Access is denied.
Error 0x5 at src/windows/termios.c(2725): Access is denied.
Error 0x5 at src/windows/termios.c(512): Access is denied.
Error 0x5 at src/windows/termios.c(512): Access is denied.
Error 0x5 at src/windows/termios.c(2725): Access is denied.
Error 0x5 at src/windows/termios.c(2725): Access is denied.
Error 0x5 at src/windows/termios.c(512): Access is denied.
Error 0x5 at src/windows/termios.c(2725): Access is denied.
Error 0x5 at src/windows/termios.c(512): Access is denied.
Error 0x5 at src/windows/termios.c(2725): Access is denied.
Error 0x5 at src/windows/termios.c(2611): Access is denied.

I can get this fixed without buying hardware to reproduce

You can count on me for testing on Windows and MacOS, with the 2 USB serial hw I have.

@MrDOS
Copy link
Contributor

MrDOS commented Jan 29, 2021

Thanks. I forgot to mention that I did try all three (well, now four) tests on Windows and couldn't reproduce a hang on disconnect() or JVM crash, but I did also get a lot of those “Access is denied” errors. They're annoying but, I think, unrelated.

To date, have you encountered the hang on disconnect() with any hardware other than the original Legrand Zigbee dongle?

@mvalla
Copy link
Author

mvalla commented Jan 30, 2021

To date, have you encountered the hang on disconnect() with any hardware other than the original Legrand Zigbee dongle?

The hang on disconnect() happens in Eclipse with both the USB devices I have (Legrand and Aeotec).
This is the code I use
If I debug this code in eclipse, then remove the stick, it prints "Clean exit of hardware" but then line 43 is never reached.

@MrDOS MrDOS closed this as completed in #210 Feb 1, 2021
MrDOS added a commit to MrDOS/nrjavaserial that referenced this issue Feb 1, 2021
The old `MonitorThreadLock` boolean field was only checked at a very
slow interval (5s!), and, itself not being synchronized, was prone to
race conditions. More importantly, it wasn't set during the monitor
thread's self-cleanup after hardware failure, so under typical access
patterns, the monitor thread and the application thread would both try
to clean up the monitor thread simultaneously. This race condition could
occasionally lead to a segfault (only reproduced on macOS, but I've no
doubt it could happen elsewhere).

As a side effect of preventing the concurrent cleanup behaviour, I
revealed an infinite loop in `RXTXPort.interruptEventLoop()`. This may
be the source of the hang-on-`disconnect()` behaviour discussed in NeuronRobotics#197:
on a slow or single-core machine, the application _wouldn't_ call
`disconnect()`/`close()` concurrently to the monitor thread's own
cleanup. When it eventually got around to it, the monitor thread would
have already disposed of the event info struct, so the loop to look for
it would never exit. I've resolved this by just bailing out of the
linked list search loop after the first pass. I can't see any situation
where looping further would be useful.
MrDOS added a commit to MrDOS/nrjavaserial that referenced this issue Feb 1, 2021
The old `MonitorThreadLock` boolean field was only checked at a very
slow interval (5s!), and, itself not being synchronized, was prone to
race conditions. More importantly, it wasn't set during the monitor
thread's self-cleanup after hardware failure, so under typical access
patterns, the monitor thread and the application thread would both try
to clean up the monitor thread simultaneously. This race condition could
occasionally lead to a segfault (only reproduced on macOS, but I've no
doubt it could happen elsewhere).

As a side effect of preventing the concurrent cleanup behaviour, I
revealed an infinite loop in `RXTXPort.interruptEventLoop()`. This may
be the source of the hang-on-`disconnect()` behaviour discussed in NeuronRobotics#197:
on a slow or single-core machine, the application _wouldn't_ call
`disconnect()`/`close()` concurrently to the monitor thread's own
cleanup. When it eventually got around to it, the monitor thread would
have already disposed of the event info struct, so the loop to look for
it would never exit. I've resolved this by just bailing out of the
linked list search loop after the first pass. I can't see any situation
where looping further would be useful.
@fwolter fwolter reopened this Feb 1, 2021
@MrDOS
Copy link
Contributor

MrDOS commented Feb 1, 2021

Oooops – gotta be real careful with those command words! Thanks for reopening.

@MrDOS
Copy link
Contributor

MrDOS commented Feb 1, 2021

I think #211 might fix the JVM crash, but it probably won't have any bearing on the hang-on-close()/disconnect(). At this point, I'm pretty sure they're unrelated issues. Can you guys try the library from this build, nrjavaserial-fd5fadf?

@mvalla
Copy link
Author

mvalla commented Feb 1, 2021

Just tried with both USB sticks. Actually with this fixed version upon stick removal I now do experience a JVM crash on Windows , with this log:

ERROR: JDWP Unable to get JNI 1.2 environment, jvm->GetEnv() return code = -2
JDWP exit error AGENT_ERROR_NO_JNI_ENV(183):  [util.c:840]
Error 0x5 at src/windows/termios.c(512): Access is denied.
Error 0x5 at src/windows/termios.c(2725): Access is denied.
Error 0x5 at src/windows/termios.c(512): Access is denied.
Error 0x5 at src/windows/termios.c(2725): Access is denied.
Error 0x5 at src/windows/termios.c(512): Access is denied.
Error 0x5 at src/windows/termios.c(2725): Access is denied.
Error 0x5 at src/windows/termios.c(512): Access is denied.
Error 0x5 at src/windows/termios.c(2725): Access is denied.
Error 0x5 at src/windows/termios.c(512): Access is denied.
Error 0x5 at src/windows/termios.c(2725): Access is denied.

@MrDOS
Copy link
Contributor

MrDOS commented Feb 1, 2021

Fascinating. I'll try on my Windows box as soon as I get a chance, but can you confirm you get that every time? What if you just run the test without attaching a debugger? Not that I want to deflect, but it looks like it could be a very old JVM bug. util.c is certainly not part of our source at the very least; neither is JDWP.

@mvalla
Copy link
Author

mvalla commented Feb 2, 2021

That was Java 8.
But actually with your modified lib the JVM crash happens all times, with both USB sticks, with both Java versions (8 and 11) and with both Eclipse debugger / command line.
This is the output using your modified lib, Java 11, from command line (the JVM crashed after removing the USB stick):

java -cp ".\target\classes;C:\Users\myuser\.m2\repository\org\slf4j\slf4j-log4j12\1.7.28\slf4j-log4j12-1.7.28.jar;C:\Users\myuser\.m2\repository\org\slf4j\slf4j-api\1.7.28\slf4j-api-1.7.28.jar;C:\Users\myuser\.m2\repository\log4j\log4j\1.2.17\log4j-1.2.17.jar;.\lib\nrjavaserial-5.2.1-alfa.jar" org.serial.test.MySerialDisconnectTest
Java version: 11.0.8
Availible port: COM3
Connected
Error 0x5 at src/windows/termios.c(512): Access is denied.
Error 0x5 at src/windows/termios.c(2725): Access is denied.
Error 0x5 at src/windows/termios.c(512): Access is denied.
Error 0x5 at src/windows/termios.c(2725): Access is denied.
Error 0x5 at src/windows/termios.c(512): Access is denied.
Error 0x5 at src/windows/termios.c(2725): Access is denied.
Error 0x5 at src/windows/termios.c(512): Access is denied.
Error 0x5 at src/windows/termios.c(2725): Access is denied.
Error 0x5 at src/windows/termios.c(512): Access is denied.
Error 0x5 at src/windows/termios.c(2725): Access is denied.
Error 0x5 at src/windows/termios.c(512): Access is denied.
Error 0x5 at src/windows/termios.c(2725): Access is denied.
Error 0x5 at src/windows/termios.c(512): Access is denied.
Error 0x5 at src/windows/termios.c(2725): Access is denied.
Error 0x5 at src/windows/termios.c(512): Access is denied.
Error 0x5 at src/windows/termios.c(2725): Access is denied.

@claui
Copy link

claui commented Feb 9, 2024

Interesting. A quick glance at RXTXPort.interruptEventLoop() doesn't reveal anything that should get “stuck”. The linked-list search is the obvious suspect at a glance, but it's not possible to open a port without populating that list, and nothing every removes old entries from that list, so that shouldn't be it.

If a monitor thread happens to catch some signal and dies without its event_loop getting to call finalize_event_struct, then the stale struct will remain in the linked list even though the stack frame or whole stack is freed. If the stack gets reused by a new monitor thread later, its stack frames might align exactly with those of the old thread. In that case, the pointer to the struct would end up in the list twice, causing a cycle while navigating the list.

@MrDOS
Copy link
Contributor

MrDOS commented Feb 9, 2024

Holy smokes, you're really peeling back the covers on a couple of really old bugs here. Thank you so much for your efforts!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants