On Encodings
Status: Proposal
A perhaps curious feature of the Willow data model is that its specifications rarely talk about encodings. We11Let’s be honest: Aljoscha strongly believe that specifications should concern themselves with purely logical data types as long as possible, treating encodings as a minor and ultimately interchangeable detail. When specifications define concepts in terms of their encodings, results usually end up miserably underspecified (see JSON) or full of incidental complexity (see XML).
Nevertheless, protocols that deal with persistent storage and network transmissions eventually have to serialise data. In this document, we give both some generic definitions around arbitrary encodings, and some specific encodings that recur throughout the Willow family of specifications.
Encodings, In Detail
The Willow protocols are generic over user-supplied parameters. Invariably, some values of the user-supplied types need to be encoded, so there also have to be user-supplied definitions of how to encode things. But not every function that turns values into byte strings defines an encoding.
Hence, we formally define what we mean by encodings. We first give a succinct, rather mathematical definition, followed by a more accessible English explanation.
An encoding function encode_s for a set S is a function from S to the set of bytestrings, such that there exists a function decode_s such that:
- for every s in S and every bytestring b that starts with
encode_s(s)
, we have and - for every s in S and every bytestring b that does not start with
encode_s(s)
, we have
In plain language:
- encode_s must deterministically map any value of type S to exactly one bytestring.
Input Output Correctness “turtle”
“turtle” | “tortoise”
Desired behaviour for encode_animal. - Further, there must be a corresponding decoding function decode_s. This function must map valid encodings back to values of type S, and it must report a decoding error for any input string that does not start with a valid encoding of any value of type S.
Input Output Correctness “turtle”
“tortoise”
“turtle”
“tortoise”
Desired behaviour for decode_animal. - And finally, appending arbitrary bytes to a valid encoding must not cause any change in the return value of decode_s.This requirement makes it so we can encode sequences of multiple values by simply concatenating their encodings, without running the risk that the resulting string would be a valid encoding of a different value as well.
Input Output Correctness “turtle”
“turtleneck”
“turtle”
“turtleneck”
“turtle”
“turtleneck”
Desired behaviour for decode_animal.
Specific Encodings
In this section, we propose some specific encodings for several datatypes of Willow.
When designing encodings, there rarely exists a single best solution. Different scenarios might warrant different trade-offs between encoding and decoding time, encoding size, or simplicity. Furthermore, additional context can often enable more efficient encodings than general-purpose solutions. For example, a database that groups Entries by namespace_id would not need to encode the namespace_id of every individual Entry.
For these reasons, we encourage you to take the following suggested encodings with a grain of salt, and to look for contextual information that you could leverage to obtain even more efficient encodings for your specific use cases.
To encode small numbers with fewer bytes than large numbers, we define for any U64 n the value compact_width(n)
as
- if or
- if or
- if or
- otherwise.
Data Model Encodings
When encoding Paths, we want to use fixed-width unsigned integers of minimal width. Hence, we define:
path_length_power is the least natural number such that We can represent the length of any Component in path_length_power bytes. UPathLengthPower denotes the type of numbers between zero (inclusive) and (exclusive).
path_count_power is the least natural number such that We can represent the number of Components of any Path in path_count_power bytes. UPathCountPower denotes the type of numbers between zero (inclusive) and (exclusive).
encode_path
To encode a Path p, we define encode_path(p)
as the concatenation of
- the number of Components of p, encoded as a big-endian UPathCountPower,
- for each Component of p,
- the length of the Component, encoded as a big-endian UPathLengthPower, followed by the bytes of the Component.
encode_entry
To encode an Entry e, let encode_namespace_id be an encoding function for NamespaceId, let encode_subspace_id be an encoding function for SubspaceId, and let encode_payload_digest be an encoding function for PayloadDigest. We then define encode_entry(e)
as the concatenation of:
encode_namespace_id(e.namespace_id) | |
encode_subspace_id(e.subspace_id) | |
encode_path(e.path) | |
e.timestamp, encoded as a big-endian U64 | |
e.payload_length, encoded as a big-endian U64 | |
encode_payload_digest(e.payload_digest) |
Relativity
When encoding Willow objects, we can often achieve smaller encoding sizes by encoding how some object differs from another. In this section, we define several such relative encodings.
In all subsequent definitions, whenever the value open is part of a numeric computation or comparison, it should be treated as a very large number, say, The definitions all ensure that the resulting values never have to be encoded.
encode_path_relative
To encode a Path p relative to a reference Path ref, we define encode_path_relative(p, ref)
as the concatenation of:
The length of the longest common prefix of p and ref, encoded as a UPathCountPower. | |
encode_path of the Path obtained by removing that longest common prefix from p. |
encode_entry_relative_entry
To encode an Entry e relative to a reference Entry ref, we first define time_diff as the absolute value of e.timestamp - ref.timestamp
. We then define encode_entry_relative_entry(e, ref)
as the concatenation of:
Bit | Big-endian bitfield |
---|---|
0 | Encode e.namespace_id?1 iff e.namespace_id != ref.namespace_id |
1 | Encode e.subspace_id?1 iff e.subspace_id != ref.subspace_id |
2 | Add or subtract time_diff from ref.timestamp?1 iff e.timestamp - ref.timestamp > 0 |
3 | always 0 |
4, 5 | 2-bit integer n such that 2^n gives compact_width(time_diff) |
6, 7 | 2-bit integer n such that 2^n gives compact_width(e.payload_length) |
encode_namespace_id(e.namespace_id) , or the empty string, if e.namespace_id == ref.namespace_id | |
encode_subspace_id(e.subspace_id) , or the empty string, if e.subspace_id == ref.subspace_id | |
encode_path_relative(e.path, ref.path) | |
time_diff, encoded as an unsigned, big-endian compact_width(time_diff) -byte integer | |
e.payload_length, encoded as an unsigned, big-endian compact_width(e.payload_length) -byte integer | |
encode_payload_digest(e.payload_digest) |
encode_entry_in_namespace_area
To encode an Entry e that is included in an outer Area out in a namespace of NamespaceId namespace_id, we first define time_diff as the minimum of e.timestamp - out.times.start
and out.times.end - e.timestamp
. We then define encode_entry_in_namespace_area(e, out, namespace_id)
as the concatenation of:
Bit | Big-endian bitfield |
---|---|
0 | Encode e.subspace_id?1 iff out.subspace_id == any |
1 | Add time_diff to out.times.start, or subtract from out.times.end?1 iff e.timestamp - out.times.start <= out.times.end - e.timestamp |
2, 3 | 2-bit integer n such that 2^n gives compact_width(time_diff) |
4, 5 | 2-bit integer n such that 2^n gives compact_width(e.payload_length) |
6, 7 | always 0 |
encode_subspace_id(e.subspace_id) , or the empty string, if out.subspace_id != any | |
encode_path_relative(e.path, out.path) | |
time_diff, encoded as an unsigned, big-endian compact_width(time_diff) -byte integer | |
e.payload_length, encoded as an unsigned, big-endian compact_width(e.payload_length) -byte integer | |
encode_payload_digest(e.payload_digest) |
encode_entry_in_namespace_3drange
To encode an Entry e that is included in a 3dRange out in a namespace of NamespaceId namespace_id, we first define time_diff as the minimum absolute value of e.timestamp - out.times.start
and e.timestamp - out.times.end
. We then define encode_entry_in_namespace_3drange(e, out, namespace_id)
as the concatenation of:
Bit | Big-endian bitfield |
---|---|
0 | Encode e.subspace_id?1 iff e.subspace_id == out.subspaces.start |
1 | Encode e.path relative to out.paths.start or to out.paths.end?1 iff the longest common prefix of e.path and out.paths.start is at least as long as the longest common prefix of e.path and out.paths.end |
2 | Combine time_diff with out.times.start, or with out.times.end?1 iff time_diff == abs(e.timestamp - out.times.start) |
3 | Add or subtract time_diff?1 iff bit 2 is 1 and e.timestamp >= out.times.start , or bit 2 is 0 and e.timestamp <= out.times.end . |
4, 5 | 2-bit integer n such that 2^n gives compact_width(time_diff) |
6, 7 | 2-bit integer n such that 2^n gives compact_width(e.payload_length) |
encode_subspace_id(e.subspace_id) , or the empty string, if e.subspace_id == out.subspaces.start | |
encode_path_relative(e.path, out.paths.start) if the longest common prefix of e.path and out.paths.start is at least as long as the longest common prefix of e.path and out.paths.end, otherwise encode_path_relative(e.path, out.paths.end) | |
time_diff, encoded as an unsigned, big-endian compact_width(time_diff) -byte integer | |
e.payload_length, encoded as an unsigned, big-endian compact_width(e.payload_length) -byte integer | |
encode_payload_digest(e.payload_digest) |
encode_area_in_area
To encode an Area a that is included in an outer Area out, we first define start_diff as the minimum of a.times.start - out.times.start
and out.times.end - a.times.start
, and we define end_diff as the minimum of a.times.end - out.times.start
and out.times.end - a.times.end
. We then define encode_area_in_area(a, out)
as the concatenation of:
Bit | Big-endian bitfield |
---|---|
0 | Encode a.subspace_id?1 iff a.subspace_id != out.subspace_id |
1 | Encode end value of a.times?1 iff a.times.end == open |
2 | Add start_diff to out.times.start, or subtract from out.times.end?1 iff start_diff == a.times.start - out.times.start |
3 | Add end_diff to out.times.start, or subtract from out.times.end?1 iff end_diff == a.times.end - a.times.start |
4, 5 | 2-bit integer n such that 2^n gives compact_width(start_diff) |
6, 7 | 2-bit integer n such that 2^n gives compact_width(end_diff) |
encode_subspace_id(a.subspace_id) , or the empty string, if a.subspace_id == out.subspace_id | |
encode_path_relative(a.path, out.path) | |
start_diff, encoded as an unsigned, big-endian compact_width(start_diff) -byte integer | |
end_diff, encoded as an unsigned, big-endian compact_width(end_diff) -byte integer, or the empty string, if a.times.end == open |
encode_3drange_relative_3drange
To encode a 3dRange r relative to a reference 3dRange ref, we first define
- start_to_start as the absolute value of
r.times.start - ref.times.start
, - start_to_end as the absolute value of
r.times.start - ref.times.end
, - end_to_start as the absolute value of
r.times.end - ref.times.start
, - end_to_end as the absolute value of
r.times.end - ref.times.end
, - start_time_diff as the minimum of start_to_start and start_to_end, and
- end_time_diff as the minimum of end_to_start and end_to_end.
encode_3drange_relative_3drange(r, ref)
as the concatenation of:Bit | Big-endian bitfield |
---|---|
0, 1 | Encode r.subspaces.start?11 otherwise. |
2, 3 | Encode r.subspaces.end?10 otherwise. |
4 | Encode r.paths.start relative to ref.paths.start or to ref.paths.end?1 iff the longest common prefix of r.paths.start and ref.paths.start is at least as long as the longest common prefix of r.paths.start and ref.paths.end |
5 | 1 iff r.paths.end == open |
6 | Encode r.paths.end relative to ref.paths.start or to ref.paths.end (if at all)? |
7 | 1 iff r.times.end == open |
8 | Encode r.times.start relative to ref.times.start or ref.times.end?1 iff start_to_start <= start_to_end |
9 | Add or subtract start_time_diff?1 iff bit 8 is 1 and r.times.start >= ref.times.start , or bit 8 is 0 and r.times.start >= ref.times.end . |
10, 11 | 2-bit integer n such that 2^n gives compact_width(start_time_diff) |
12 | Encode r.times.end relative to ref.times.start or ref.times.end (if at all)? |
13 | Add or subtract end_time_diff (if encoding it at all)? |
14, 15 | ignored, or 2-bit integer n such that 2^n gives compact_width(end_time_diff) |
encode_subspace_id(r.subspaces.start) , or the empty string if r.subspaces.start == ref.subspaces.start or r.subspaces.start == ref.subspaces.end | |
encode_subspace_id(r.subspaces.end) , or the empty string if r.subspaces.end == open , r.subspaces.end == ref.subspaces.start or r.subspaces.end == ref.subspaces.end | |
encode_path_relative(r.paths.start, ref.paths.start) if the longest common prefix of r.paths.start and ref.paths.start is at least as long as the longest common prefix of r.paths.start and ref.paths.end, otherwise encode_path_relative(r.paths.start, ref.paths.end) | |
start_time_diff, encoded as an unsigned, big-endian compact_width(start_time_diff) -byte integer | |
end_time_diff, encoded as an unsigned, big-endian compact_width(end_time_diff) -byte integer, or the empty string, if end_time_diff.times.end == open |
Capabilities
Encodings for Meadowcap and McSubspaceCapabilities.
encode_subspace_capability
To encode a McSubspaceCapability c, we define encode_mc_subspace_capability(c)
as the concatenation of:
encode_namespace_pk(c.namespace_key) | |
encode_user_pk(c.user_key) | |
encode_namespace_sig(c.initial_authorisation) | |
for each pair (pk, sig) in c.delegations the concatenation of: |
encode_mc_capability
To encode a McCapability c whose granted area is know to be included in some22If no smaller containing Area is known from context, use the full area. Area out, we define encode_mc_capability(c)
depending on whether c.inner is a CommunalCapability or an OwnedCapability.
If c.inner is a CommunalCapability, then encode_mc_capability(c)
is the concatenation of:
byte 0x00 , if c.inner.access_mode == read ,byte 0x01 otherwise. | |
encode_namespace_pk(c.inner.namespace_key) | |
encode_user_pk(c.inner.user_key) | |
for each triplet (area, pk, sig) in c.inner.delegations the concatenation of:
|
If c.inner is an OwnedCapability, then encode_mc_capability(c)
is the concatenation of:
byte 0x02 , if c.inner.access_mode == read ,byte 0x03 otherwise. | |
encode_namespace_pk(c.inner.namespace_key) | |
encode_user_pk(c.inner.user_key) | |
encode_namespace_sig(c.inner.initial_authorisation) | |
for each triplet (area, pk, sig) in c.inner.delegations the concatenation of:
|