Holochain has specific requirements for serialization that need to be enforced consistently in a "fool proof" way.
- Support arbitrary binary data
- Consistent bytes for canonical representations of things for cryptographic use
- Support multiple languages
- Wide industry usage and tooling across many languages including Rust
- Reasonably fast and compact
- Human readable for debugging
- Able to preserve semantics for debugging
- Able to derive a sane serialization for the 99% of use cases
- Able to manually implement serialization for edge-cases (e.g. Merkle Trees)
- Able to form a base infrastructure layer for both wasm and networking
- Able to leverage the Rust compiler as much as possible to avoid bugs and help refactors
There are some additional goals of this crate that I'd like to make explicit that informed some of the code design choices.
- Canonical representations of bytes of data should be enforced as canonical by the compiler at the type level
- For two systems sharing bytes and a shared type crate on the same version, round tripping bytes should be guaranteed safe by the compiler
- The design of
SerializedBytes
should be such that end-user-happ-devs should not need to be aware of it for BAU tasks - Internal dependencies (e.g. serde messagepack handling) of this crate should not bleed into downstream concerns due to macros or whatever
- Everything internal to serialization should be upgradeable and refactorable without any external interface changes for consumers
The usage is very simple. Most of this README is dedicated to discussing design decisions as the crate is fairly opinionated to minimise "foot guns" and maximise how much the compiler can do for us, given that serialized data de facto erases type information across systems (e.g. across wasm guest/host boundary).
A single SerializedBytes
new type that includes a Vec<u8>
of bytes.
These bytes are by default to be MessagePack serialized binary data.
There isn't a constructor for it e.g. there is no SerializedBytes::new()
.
Implement TryFrom<SerializedBytes> for Foo
and TryFrom<Foo> for SerializedBytes
.
There is one macro holochain_serial!
that accepts a list of types.
Each type passed to holochain_serial!
will get default TryFrom
using rust messagepack.
There is also a #[derive(SerializedBytes)]
that uses holochain_serial!
internally
and is re-exported by holochain_serialized_bytes::prelude::*
.
https://github.com/3Hren/msgpack-rust
Definitely use holochain_serial!
or #[derive(SerializedBytes)]
for all your types if you can.
You also need to derive or implement Serialize
and Deserialize
for your types.
It looks like this:
/// struct with a utf8 string in it
#[derive(Serialize, Deserialize, SerializedBytes)]
struct Foo {
inner: String,
}
/// struct with raw bytes in it
#[derive(Serialize, Deserialize, SerializedBytes)]
struct Bar {
whatever: Vec<u8>,
}
let foo = Foo { inner: "foo".into() };
let serialized_bytes: SerializedBytes = foo.try_into().unwrap();
println!("{:?}", &serialized_bytes); // debugs to json: {"inner":"foo"}
println!("{:?}", serialized_bytes.bytes()); // messagepack bytes [129_u8, 165_u8, 105_u8, 110_u8, 110_u8, 101_u8, 114_u8, 163_u8, 102_u8, 111_u8, 111_u8,]
let deserialized_foo: Foo = serialized_bytes.try_into().unwrap();
For debugging, the internal messagepack serialized bytes are transcoded to JSON
using serde-transcode
. This means that you will see JSON output from "{:?}"
which is much easier to read than binary from messagepack.
If you want a read only view of the actual messagepack bytes call the .bytes()
method.
You can fuzz this repository as:
docker run --rm --env FUZZ_TARGET="<some fuzz target>" -it holochain/fuzzbox:holochain-serialization
You may need to pull the tag before fuzzing to get the latest code as it is built on CI against main.
For more information on fuzzbox see https://github.com/holochain/fuzzbox.
These design limitations all exist to keep things simple.
I acknowledge that some of these limitations are quite strict and may even feel quite
restrictive in a "local optimisation" kind of way, but in the context of where and
how SerializedBytes
is intended to be employed, it should all be for the greater good ;)
All limitations from messagepack: https://github.com/msgpack/msgpack/blob/master/spec.md#limitation
Any/all bugs from the rust messagepack implementation https://github.com/3Hren/msgpack-rust
The SerializedBytes
is intented to be immutable because it is a canonical
representation of something else.
Moving from SerializedBytes
to SerializedBytes
is a no-op.
Even though it is binary data that messagepack could represent as a nested binary message, it won't because Rust won't trigger the serialization logic.
If you nest SerializedBytes
inside another struct it WILL be double serialized.
E.g. this (JSON representation):
// Foo { inner: String }
// foo = Foo { inner: "foo".into() }
{"inner":"foo"}
becomes this (JSON representation) when converted into SerializedBytes
then nested in another struct:
// Bar { inner: SerializedBytes }
// bar = Bar { inner: foo.into() }
{"inner":[129,165,105,110,110,101,114,163,102,111,111]}
We intentionally do not support moving from Rust primitives to/from SerializedBytes
.
The only exception is ()
which maps to nil
in messagepack.
I.e. SerializedBytes::try_from(())
is valid.
This is because everything other than nothing (nil) has ambiguous meaning when used across systems.
For example, consider Ok(())
vs. ValidationResult::Ok
.
The former quickly becomes confusing when shared e.g. across a wasm host/guest system boundary.
It gets nastier when these things start to nest like Ok(Err("Some string"))
where the different levels of nesting originate from different systems.
It gets nastier when dealing with systems (like wasm) that reference linear memory directly and we're dealing with lengths/offsets to other data as integers.
So maybe some integer is a reference to memory of some data that is an Ok(Err("Some string"))
,
and maybe that integer is serialized somehow to be sent somewhere else, like across
the wasm host/guest boundary.
It gets nastier again when serialization is represented as strings (e.g. JSON) and we're trying to accept data serialized by other systems in a nested way that can cross multiple nested function calls that can hit other arbitrary systems, that all represent their data like the above...
Very quickly we end up with something the compiler can't help with because it is all essentially "stringly typed" full of backslashes to escape it all.
Things like Option
and Result
aren't representations of domain specific data
anyway, they are conceits of compiler type systems. They don't need to be
serialized because the type information needs to be given to the compiler by the
developer for rust to be able to deserialize anyway.
https://www.youtube.com/watch?v=YR5WdGrpoug&feature=youtu.be
To put it another way, Result
tells the compiler something about the runtime
behaviour of a function, it doesn't represent "something" in the real world or
domain. Option
also tells the compiler to allow the absence of something at
runtime, which doesn't need to be serialized, it can simply be absent in
serialized data.
Of course, we may need to represent a closed set of possible result types, like
ValidationResult
or CallbackResult
and this is something that we can use an
enum for. In Holochain world a ValidationResult
is something in the real world,
that is worth serializing, it feeds into cryptographically signed claims about
the validity of other things according to a set of rules, so the serialized
representation feeds into a cryptographic proof, which can't be said about a
generic Rust Result
that simply means "some function may fail".
The closest we get to "needing" a Result
is to represent the return value of
imported/exported functions between a wasm host/guest, but even this would work
and could even benefit from WasmHostResult
and WasmGuestResult
enums to
track the provenence of any errors.
Other primitives like strings and integers are also NOT supported to directly
move between SerializedBytes
and this is by design.
A single serialized integer or string floating around outside of the compilation context that produced it is mostly useless in a different compilation context. This is especially true for strings when serialized data is also a string, forgetting to serialize or double serializing strings is a huge problem in lots of contexts, including security sensitive ones.
Numbers have problems where the serialization format doesn't map 1:1 with the
compiler types, e.g. when "1" exists in some serialized format it could be signed
or unsigned of any size, whereas rust treats u8
and i8
and u32
as completely
different things. This is more or less of an issue depending on where you sit on
the scale between serialzation formats that are tightly coupled to the language
you are currently using vs. general purpose formats that can't assume anything
about language support.
In addition to these issues on the philosophical/domain-modelling side of things
it is also really messy to "correctly" handle primitives the way we want.
Rust hasn't implemented "type specialization" yet, which makes handling things
like Result<Result<SerializedBytes, String>, String>
lead to hundreds of lines
of (still buggy) code. See the legacy JsonString
implementation and issues for
examples of how edge-cases in the type system can introduce subtle bugs that
break serialization round-trips.
The current setup that avoids primatives does the heavy lifting in under 50 LOC.
- specialization RFC: rust-lang/rust#31844
- example messy code: https://github.com/holochain/holochain-serialization/pull/15/files#diff-634e1fc4ddb3416a36776dac7cfaa965R187
Important note: all of these types, including Result
and Option
and even
SerializedBytes
itself, implement both Serialize
and Deserialize
which
means they can be used within your custom struct/enum type; but please be mindful
of the above when representing domain data in a serialized format.
Important note: New types/tuples serialize to the same bytes as the primitive they wrap in messagepack, so don't worry about bloating serialized data by creating custom types. On the other hand, we are using the messagepack configuration that keeps field names, so creating structs with long fields relative to the inner data may add some overhead.
- Put types that are shared across systems in a shared crate that all systems can include as a direct dependency
- Make thoughtful use of the New Type idiom https://doc.rust-lang.org/rust-by-example/generics/new_types.html to serialize primitives
- Be mindful that serialized data should be domain specific "stuff" (what a thing is) not meta/logic level stuff
- Put things in shared type crates that will bloat or break wasm code
- Do weird nested things with
Result
andOption
to try and avoid the above - Nest
SerializedBytes
in other things unless you really mean to serialize already-serialized data
For any struct/enum you want to move into SerializedBytes
you need to add the
two basic Serde traits. We can't magic this boilerplate away with macros (yet).
Rust guidelines state to impl Serialize
and Deserialize
anyway.
I chose to implement holochain_serial!(FooType, BarType, ...)
as a proc macro in addition to a derive.
I think this is a little non-standard but has a few advantages around dependency management and overall boilerplate.
If you run into dependency issues with the derive, try the holochain_serial!
macro.
Derive macros require a separate *_derive
crate, which means I can't do $crate::SerializedBytes
which means I can't lock down fully qualified paths to things, which introduces room for mistakes and additional
boilerplate/maintenance of dependencies.
Another advantage is that "macros by example" (proc macros) are simply more straightfoward to write and maintain.
Another advantage is that a proc macro gives us more future extensibility than a simple derive, meaning potential for more deep integrations with e.g. the HDK toolkit.
There is no Serialized::new()
or SerializedBytes::from_bytes()
or whatever.
This is by design.
You MUST do this (or equivalent) every time:
let serialized_bytes: SerializedBytes = foo.try_into()?;
and
let foo: Foo = serialized_bytes.try_into()?;
Which of course means you need to define a Foo
for everything that needs to be
formally serialized into bytes, and you need to share that Foo
type in a crate
that everywhere that will round trip Foo
data can use as a depenency.
In the previous iteration of serialization (which was JSON based) we allowed things like this:
// we expected you to create you're own json
let foo: String = json!({ ... });
let json_string = JsonString::from_json(foo);
// we also allowed a regular From to do the same thing
let foo: String = "{...}".into();
let json_string = JsonString::from(foo);
// RawString was a hack to "undo" the above by wrapping String in a newtype that
// is then serialized into json to allow json strings of json strings
let foo: RawString = String::from("bar").into();
let json_string = JsonString::from(foo); // internally as "\"bar\""
Which led to serious issues:
- Sometimes
foo
didn't contain valid JSON data at all, butfrom_json()
still accepted it (or users would have to eat runtime CPU costs to validate and potentially error) - Having
JsonString::from
andJsonString::from_json
andRawString
just muddied the waters with subtleties that maybe a half dozen people ever understood or cared about - There was no guarantee that
foo
would always contain the same bytes even if the serialized data was equivalent (e.g. whitespace, field ordering, etc.) so this breaks cryptography - It's not immediately obvious what the "correct" behaviour of nested
JsonString
structs should be (e.g. no-op vs. double serialize) as ad-hoc data never has structural boundaries in the first place - There's no canonical way to round-trip data if it never had a canonical type/form in the first place
- Wherever the lack of canonical round-tripping (or potential for) touched the wasm host/guest boundary, the
JsonString
abstraction leaked out of the HDK into a happ level concern - It's impossible to modify the serialization approach later (e.g. moving to messagepack) without heroic efforts (vs. simply updating the
holochain_serial!
macro in this crate)
I do understand that for some use-cases, (e.g. merkle trees), you may need exact bytes (i.e. not messagepack).
There is an escape hatch for directly importing u8
bytes into SerializedBytes
.
The above issues have already burned hundreds of development hours, with 105 outstanding uses of from_json()
across 48 files, so please avoid it!
To move bytes into SerializedBytes
use the UnsafeBytes
struct.
UnsafeBytes
does implement From<Vec<u8>>
and round trips through SerializedBytes
.
The round trip between UnsafeBytes
and SerializedBytes
is zero-copy.
Importantly, the intent is that you use UnsafeBytes
as an implementation detail inside a TryFrom.
E.g.
impl TryFrom<Foo> for SerializedBytes {
type Error = SerializedBytesError;
fn try_from(f: Foo) -> Result<SerializedBytes, SerializedBytesError> {
let bytes: Vec<u8> = foo.calculate_bytes_for_foo();
Ok(SerializedBytes::from(UnsafeBytes::from(bytes)))
}
}
This allows us to maintain the rule that we always use TryFrom
to round trip Foo
through SerializedBytes
.
Among other things this rule allows us to write proc macros that completely hide the
SerializedBytes
struct from the end-user-happ-developer.
If our HDK can safely assume the existence of TryFrom<SerializedBytes>
for everything
that crosses the wasm boundary we can achieve an "almost native"
(sans-primitive types, see above) working experience for typed zome functions.
Important note: If you use UnsafeBytes
the expectation is that you are NOT using messagepack any more.
Therefore, if we move away from messagepack (e.g. to BSON or something) then don't expect any compatibility with UnsafeBytes
based code.
This is a key difference with the legacy JsonString::from_json()
approach that assumed valid JSON, here we assume invalid messagepack.
We used JSON for a long time. It certainly has many benefits:
- Very good javascript support
- "lowest common denominator" in many ways
- Very human readable
- Arbitrary length data for all types (c.f. messagepack
i32
size limit on objects) - Speed comparable to all other serialization formats
- Gzipped size comparable or better than other formats
Ultimately though, JSON is not a binary format and a lot of people want a binary format.
Forcing everything through verbose UTF-8 introduces messy base64 encoding, gzipping etc. that leads to overhead and mistakes.
JSON format also suffers the need for complex escaping (backslashes) that is hard to debug by hand.
No particular reason. We had to pick something reasonable, BSON would probably be fine too.
There is a rust crate: https://github.com/mongodb/bson-rust
It didn't show up in benchmarks though: https://github.com/erickt/rust-serialization-benchmarks
And the crate is tied to mongodb c.f. messagepack being more broadly owned/starred/maintained.
If there is a strong pull for BSON we could swap out or augment holochain_serial!
fairly easily.
Using something tied to Rust has technical benefits:
- Compact byte representation
- Fast (sometimes crazy fast) round tripping of data
But there are deal-breaking tradeoffs for us:
- Incompatible with other languages, e.g. JavaScript over the network
- Risk of instability in byte representation across compiler versions that breaks cryptography
- Impossible to represent in a human debuggable way, can't even transcode because not self-describing