-
-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
wip: integrate binary data into payloads #283
Conversation
Hi @Totodore, here's what I have so far. I hate PRs that have so many changed lines, but the changes to the library itself are actually not that large. The first commit is all added lines: the addition of |
0164625
to
be9b501
Compare
2243d6a
to
72115a0
Compare
72115a0
to
18dd67c
Compare
18dd67c
to
1e96705
Compare
socketioxide/src/handler/extract.rs
Outdated
_: Option<i64>, | ||
) -> Result<Self, Infallible> { | ||
Ok(Bin(bin)) | ||
Ok(Bin(v.extract_binary_payloads())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've left in the Bin
extractor, and preserved its behavior. Is it worth keeping?
- There is an example in a doc comment where the
Bin
extractor is used, and then the binary payloads are emitted along with three different messages, separate from the data in two of them. Maybe that is a use-case that is useful to preserve? - A downside of keeping it is that
Data
would be much more efficient if it implementedFromMessage
and notFromMessageParts
, since with the latter we have to clone the entire thing. - If we do keep it, maybe
Bin
should instead implementFromMessageParts
and clone the binary payloads rather than extracting them from thePayloadValue
. ThenData
could implementFromMessage
, if we assume that the most common use case is that people will useData
and notBin
; that would make the common case more efficient, removing the clone of the data.
I'm leaning toward keeping it and implementing that last point above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One wrinkle: if I convert Data
to FromMessage
, I'm thinking about TryData
, and maybe leaving it as FromMessageParts
. Maybe someone wants to do a TryData
, but if it fails, still get the data in another form in order to do something else with it. Like:
socket.on("test", |TryData(data): TryData<MyData>, Data(value): Data<PayloadValue>| async {
if let Ok(data) = data {
println!("Got data as expected: {:?}", data);
} else {
println!("Data deser failed, payload received was: {:?}", value);
}
})
Or even implement a handler that can have the data deserialize into multiple possible known types:
|TryData(my_data): TryData<MyData>, TryData(other_data): TryData<OtherData>, Data(last_chance): Data<LastChanceData>|
Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fn on_handle(socket: SocketRef, ack: AckSender, Bin(bin): Bin, Data(data): Data<PayloadValue>) {
info!("Received message: {:?}", data);
ack.send(data.clone()).ok();
socket.emit("message-back", data.clone()).ok();
}
The picture is the printed data, and the data received in on_handle is decoded from binary to string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, I know things don't really work yet ;) I've already fixed this locally (seems String
's default serde impl handles bytes, sigh), but am working on some other fixes as well before I push it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, you've worked hard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've left in the
Bin
extractor, and preserved its behavior. Is it worth keeping?
- There is an example in a doc comment where the
Bin
extractor is used, and then the binary payloads are emitted along with three different messages, separate from the data in two of them. Maybe that is a use-case that is useful to preserve?- A downside of keeping it is that
Data
would be much more efficient if it implementedFromMessage
and notFromMessageParts
, since with the latter we have to clone the entire thing.- If we do keep it, maybe
Bin
should instead implementFromMessageParts
and clone the binary payloads rather than extracting them from thePayloadValue
. ThenData
could implementFromMessage
, if we assume that the most common use case is that people will useData
and notBin
; that would make the common case more efficient, removing the clone of the data.I'm leaning toward keeping it and implementing that last point above.
I think the best would be to remove the Bin
extractor and the bin()
operator and have Data
and TryData
implement FromMessage
at least for Data
. I still don't know what we should do for TryData
maybe we could start by only implement FromMessage
and then later having a custom implementation of FromMessageParts
that clones the data. But we still need to think about this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, removed Bin
, and made Data
implement FromMessage
. I left TryData
as FromMessageParts
for now, but we can revisit that.
8d18205
to
47ada89
Compare
47ada89
to
576baa5
Compare
7c191b8
to
ddf322b
Compare
Did some benchmarking and compared runs with critcmp:
The encoding numbers look really good, improved in a few cases by 75%-110%. But decoding is a bit more mixed; a few large improvements of around 50%, but quite a few 10-20% regressions. Since the decoding regressions are in the cases without binary payloads, my guess is that my |
This will act as a replacement for serde_json::Value that can also hold binary payloads.
Previously, the non-binary part of a message and the binary payloads in a message were represented separately: the non-binary portion was represented by a serde_json::Value, and could be converted to an arbitrary data structure. That data structure would not include the binary data or any indication that there is any binary data at all. The binary data would be provided in a Vec<Vec<u8>>. There were a few problems with this: 1. The original code only supported cases where the payload was a flat array with some binary payloads in the root of the array, or a flat object where the root of the object was a binary payload. Objects with more complicated structure and binary data embedded in various places in the structure were not supported. 2. Adding support for the above turned out to not be possible in a useful way, because the ordering of the Vec<Vec<u8>> matters, and it could never be clear where exactly in the possibly-complex structure each binary payload belonged. 3. One of the useful features of the socket.io protocol is that it lets users efficiently transmit binary data in addition to textual/numeric data, and have that handled transparently by the protocol, with either end of the connection believing that they just sent or received a single mixed textual/numeric/binary payload. Separating the non-binary from the binary negates that benefit. This introduces a new type, PayloadValue, that behaves similarly to serde_json::Value. The main difference is that it has a Binary variant, which holds a numeric index and a Vec<u8>. This allows us to include the binary data where the sender of that data intended it to be. There is currently one wrinkle: serde_json does not appear to consistently handle binary data; when serializing a struct with Vec<u8>, I believe it will serialize it as an array of numbers, rather than recognize that it's binary data. For now, I've included a Binary struct that wraps a Vec<u8>, which can be included as the type of a binary member, instead of using a Vec<u8> directly. Hopefully I'll be able to figure out a better way to do this. Unfinished tasks: * Testing: I have no idea if this even works yet. All I've done is get it to compile. * Benchmarking: I've tried to ensure that I don't copy data any more than the existing library does, but it's possible I've introduced some performance regressions, so I'll have to look into that. * Documentation: the documentation still references the old way of doing things and needs to be updated. Closes Totodore#276.
So far just the doctests that were failing.
cbeff71
to
75cb8b6
Compare
Did some profiling, looked at flamegraphs, made a few changes, but the performance regressions are still there. This is very strange, though. Decoding of |
I'm getting really confused by this, actually. Different runs of the benchmark compare very differently. For example, I just did a run where the |
I have aproximately the same weird results (but on a github codespace container so it is not stable). Either the benchmark code is not working properly (but according to me it is correct code) or it is the machine. Globally the difference is about +10% for the Connect packet decoding. But because the scale is really low and that the variance for connect packet decoding is quite high, I think it is still ok and that we can move on... |
Ok, I quit X, ran a basline against main, then ran it against this branch, comparing with main, 3 times. Numbers are the average change from main:
With one exception (Encode packet connect on /custom_nsp), it's consistent across all 3 runs. But there are still some oddities: "Decode packet connect on /" is somehow ~19% faster, even though that code hasn't changed at all from what's on main. And even stranger, "Decode packet connect on /custom_nsp" is 5% slower, even though a) it does almost the exact same thing as without the custom nsp, and b) that code also hasn't changed from what's on main. Unfortunately I'm at a loss as to what to do, though. One thing that's notable, maybe: |
Oh, it looks like We're not going to get all the way down to 32, but I'm surprised it's not down to 40 bytes: the At any rate, 48 bytes is still better than 56. I'll try to rerun the benchmarks later tonight and see if that changes anything. Feels like a bit of a long shot, though. |
Ok, here are the same numbers as before (against the baseline of the main branch), but now with three more runs with
I'm so confused. Somehow the performance when encoding non-binary events is much much worse, despite now using the same map implementation |
I don't understand what you changed exactly for this benchmark. It would not suprise me that your custom implementation is slower than Appart from that, I also discovered that |
I wonder if it would possible to use a custom |
I changed
Possibly... the downside there is that the public API would then be a little inconsistent. E.g.
Oh neat, I didn't know about that either. I'm not sure that would work, though, since you'd need some kind of heterogeneous array/list implementation that is serde-compatible; that is, the example in the
I assume you're talking about something that socketioxide would provide, but that users would add to their structs with a serde attribute? I don't think that would work, since the deserializer would also need access to the list of binary payloads, and there's no way to get extra data to a deserializer specified that way. Another thing I'm thinking about is sticking with |
That could be a good think to explore. For the moment I'm not fond of the entire replacement of serde_json because it is a lot of code and it will probably introduce a lot of complexity to the lib. That's why I won't merge all of this immediately. What we could do is either improve this PR by working on it or if you find completely different possibilities don't hesitate to create a new PR :). Thanks a lot for all of your work and your time :)! |
Yes, I totally agree. This got pretty large and more complicated than I expected or hoped.
For now I'll probably continue to work here on this PR. Even if I do something completely different I may just squash it all down. I've mostly written ser/deser wrappers for There is one major downside to this approach: since But overall I think this is a better way to go than
You're welcome! Got into this because I needed it for a personal project of mine, and even though it's been a diversion from that project, it's been fun and educational! |
Decided to do a new PR after all; see #284. |
Previously, the non-binary part of a message and the binary payloads in a message were represented separately: the non-binary portion was represented by a serde_json::Value, and could be converted to an arbitrary data structure. That data structure would not include the binary data or any indication that there is any binary data at all. The binary data would be provided in a Vec<Vec>.
There were a few problems with this:
This introduces a new type, PayloadValue, that behaves similarly to serde_json::Value. The main difference is that it has a Binary variant, which holds a numeric index and a Vec. This allows us to include the binary data where the sender of that data intended it to be.
There is currently one wrinkle: serde_json does not appear to consistently handle binary data; when serializing a struct with Vec, I believe it will serialize it as an array of numbers, rather than recognize that it's binary data. For now, I've included a Binary struct that wraps a Vec, which can be included as the type of a binary member, instead of using a Vec directly. Hopefully I'll be able to figure out a better way to do this.
Unfinished tasks:
Closes #276.