Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BC_FREE_BUFFER not getting executed after processing oneway RIL indication #62

Open
NotKit opened this issue Feb 21, 2021 · 10 comments
Open

Comments

@NotKit
Copy link
Contributor

NotKit commented Feb 21, 2021

At some point in time (hours or days after boot) ofono gets stuck on Ubuntu Touch Android 9 adaptations and stops processing incoming async transactions. It results in more and more buffers being allocated for incoming indication transactions from RIL which do not reach ofono until there are no buffers left, which results in binder transaction failure with code -28 in dmesg:

binder: 2989:3012 transaction failed 29201/-28, size 6572-1304 line 3137

It seems to be caused by BC_FREE_BUFFER command not getting issued under rare circumstances which I have not been able to figure out yet. The way async_todo queue is handled in binder kernel driver, next oneway transaction is dequeued only after BC_FREE_BUFFER command is processed.

I have not seen this issue on SailfishOS, however it is common on Ubuntu Touch devices based on Android 9, even though it uses SailfishOS fork of ofono for those devices and I tested with libgbinder 1.1.5 on UT.

In attached binder_trace.log, the last oneway transaction processed by /dev/hwbinder looper is the following and has no matching binder_transaction_buffer_release:
/dev/hwbinder#-13934 [005] ...1 68522.162937: binder_transaction_received: transaction=45804

It corresponds to 2021-02-19 17:21:25 [gbinder] < BC_TRANSACTION 0x00000002 0x00000082 in ofono.log. ofono_binder_error_28.tar.gz

From my understanding of libgbinder code, the buffers for incoming one-way transactions are supposed to be freed after remote request is unreferenced here: https://github.com/mer-hybris/libgbinder/blob/master/src/gbinder_driver.c#L498. Going through ibgbinder-radio and ofono-ril-binder-plugin code, I have not encountered anything that could increase reference count for it. Would you have any idea what else could this be related to?

@monich
Copy link
Contributor

monich commented Mar 2, 2021

Please check if it's still happening with libgbinder 1.1.6, which touched quite a few things related to buffer management.

@fredldotme
Copy link

It took around 22 hours on the Volla phone to trigger the issue with libgbinder 1.1.6.
The other test subject with both updated libgbinder and libgbinder-radio is still running though.
Both have been started roughly at the same time.

@mariogrip
Copy link

As an additional note, we do use glib2 version 2.48.2, maybe bug there?

@mariogrip
Copy link

So mine failed also with both libgbinder and libgbinder-radio updated, so 1.1.6 sadly didnt fix the issue :(

@monich Maybe the glib2 version have something to say? what version do sailfish use?

@monich
Copy link
Contributor

monich commented Mar 10, 2021

I very much doubt that this has anything to do with glib2, we have tried many different versions and never experienced this sort of problem. Most recently it's been 2.66.2. My feeling is that it's more likely to be related to the binder driver.

@mariogrip
Copy link

I very much doubt that this has anything to do with glib2, we have tried many different versions and never experienced this sort of problem. Most recently it's been 2.66.2. My feeling is that it's more likely to be related to the binder driver.

Reason why i ask, is that we have tried many different phones and chipsets and it happens on all, even tried the same phones with the same kernel and hal on sailfish but it does not happen there. All from kernel, hal, and the ofono stack is the same.

@monich
Copy link
Contributor

monich commented Mar 12, 2021

Another way to try to pin down the root cause would be to try running ofono under valgrind (which is something I periodically do). If it's indeed something not being deallocated (and BC_FREE_BUFFER not being sent as a consequence) it must show up as a leak in valgrind. It would be nice to confirm that it's actually a refcount screw-up somewhere in ofono binder plugin.

@mariogrip
Copy link

Another way to try to pin down the root cause would be to try running ofono under valgrind (which is something I periodically do). If it's indeed something not being deallocated (and BC_FREE_BUFFER not being sent as a consequence) it must show up as a leak in valgrind. It would be nice to confirm that it's actually a refcount screw-up somewhere in ofono binder plugin.

I have ran this under valgrind for over 6 days now... no issues at all. So it does not seems to appear when its ran with valgrind... This devices fails within 24h when running without valgrind.

@monich
Copy link
Contributor

monich commented Mar 27, 2021

I have ran this under valgrind for over 6 days now... no issues at all. So it does not seems to appear when its ran with valgrind... This devices fails within 24h when running without valgrind.

Hmm, this suggests some sort of a race condition (valgrind slowing things down, shifting timings sufficiently for the race to never occur). That may be a tough one to debug.

@monich
Copy link
Contributor

monich commented Mar 27, 2021

Next step that I suggest to do is to reproduce it like @NotKit originally did, but this time with GUTIL_LOG_TID=1 (and libglibutil >= 1.0.51) on the ofono side. That will prefix each line of debug output with the thread id. I added this GUTIL_LOG_TID thing relatively recently: https://git.sailfishos.org/mer-core/libglibutil/commit/c707a15

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants