and Evolution”
Encoding and Evolution
Prev: storage-and-retrieval Next: replication
Nothing stays still. Change is a constant. To cater to this world, it’s important to talk about backward and forward compatibility.
- Backward compatibility
- Newer code can read data that was written by older code.
- Forward compatibility
- Older code can read data that was written by newer code.
Let’s look at some formats that allow us to achieve that.
Formats for Encoding Data
There are normally two kinds:
- In-Memory (objects, structs, lists, arrays, etc.)
- Persistent (a representation of the object on Disk)
We need to translate between these two forms. In-memory → Persistent
is called encoding
, also (serialization
or marshalling
) and the
reverse is called decoding
, also (parsing
, deserialization
,
unmarshalling
).
Language-Specific Formats
Built-in Language specific formats are generally bad:
- They tie you to a specific language
- They can open you up to security concerns
- Versioning data is poor
- Efficiency is poor
Instead, let’s talk about cross-language variants:
JSON, XML, and Binary Variants
JSON, XML, and CSV are subpar.
XML has security issues:
JSON is poor at handling numbers above (2^53).
CSV has no schema, and every type could be a string or not.
Binary Encoding
JSON is the most popular, so there have been binary versions of it developed that compress it over the wire, like (MessagePack, BSON, BJSON, UBJSON).
Thrift and Protocol Buffers
Thrift and Protocol Buffers (protobuf) are binary encoding libraries that solve a few of the issues that JSON, XML, and CSV have.
Let’s take this JSON data and encode it in Thrift’s schema language, IDL.
{
"userName": "Martin",
"favoriteNumber": 1337,
"interests": ["daydreaming", "hacking"]
}
struct Person {
1: required string userName,
2: optional i64 favoriteNumber,
3: optional list<string> interests
}
In protobuf:
message Person {
required string user_name = 1;
optional i64 favorite_number = 2;
repeated string interests = 3;
}
Field tags and schema evolution
Both formats can keep backward and forward compatibility. All data is pointed to by its field tag, not by its name. So, you can add new fields to the schema by adding a new tag number.
Old code can disregard tag numbers generated by new code. Since the type is given, it is possible to skip around in the binary as required.
However, you are not allowed to add new required fields, since that would break backwards compatibility.
You can remove a field that is optional as well, but never a required field. As well, if you remove a field, you may never use that tag number again (old code will still rely on that old tag.)
You may be able to change the datatype of a particular field, but there are some caveats, like truncation.
Avro
Avro is another binary encoding format, but it doesn’t use tag numbers.
It encodes data in Avro IDL and JSON.
record Person {
string userName;
union { null, long } favoriteNumber = null;
array<string> interests;
}
in JSON:
{
"type": "record",
"name": "Person",
"fields": [
{"name": "userName", "type": "string"},
{"name": "favoriteNumber", "type": ["null", "long"], "default: null},
{"name": "interests", "type": {"type": "array", "items": "string}}
]
}
With Avro, you go through the fields in the order they appear in the schema, using the schema to tell the datatype of each field.
When transferring data, Avro uses two schemas — when encoding, it generates a writer’s schema, and when decoding, a reader’s schema.
The writer’s schema doesn’t have to be the same as the reader’s. When data is decoded, the Avro library resolves differences by translating the writer and reader’s schema. Schema resolution matches up fields by field name, so order doesn’t matter.
If there’s a field that appears in the writer’s schema but not the reader’s schema, it’s ignored. (The reader doesn’t need it).
If the writer’s schema lacks a field, it is filled in with a default value from the reader’s schema.
Schema Evolution Rules
To maintain compatibility both forwards and backwards, you may only add or remove a field that has a default value.
Also, nothing is null by default. You have to use a union
type for
this.
Avro lacks optional
and required
markers like protobuf and thrift.
Changing a data type is possible as long as the data types are convertible.
But what is the writer’s schema?
How do we determine where to store the reader and writer’s schema?
- Large file with lots of records
- Avro specifies a file format to do this.
- Database with individually written records
- include a version number at the beginning of every record, and keep a table with a version → schema mapping in it.
- Sending records over a network connection
- Negotiate the schema version over the lifetime of a connection.
Dynamically generated schemas
This allows Avro to be friendlier for dynamically generated schemas. Since the only information that matters is the field name, when the field name in a database changes, just recompile the schema file.
Code generation
Avro also allows for dynamic languages to use them, as they can be used as a simple JSON schema in other languages.
The Merits of Schemas
Binary schema based encodings are very nice, because they are compact, the schema is an opportunity for documentation, and allows for backwards and forwards compatibility.
Modes of Dataflow
How does data flow through processes?
Dataflow Through Databases
Forward compatibility is normally required, because there may be many readers and writers who use an older version of the code. As well, there’s a concern where an old reader might update data previously read by a newer version of the application, which may lose data if you’re not careful.
Different values written at different times
A database allows any value to be updated at any time. A single database may contain values that were written recently or a long time ago. Databases may allow for simple migrations, such as adding a new column with a default value.
Dataflow Through Services: REST and RPC
The web works with a client and server model; the servers expose an API over the network, and the clients connect to servers to make requests to that API.
Generally servers provide an application-specific API that only allows inputs and outputs that it deems correct for business.
Connecting servers and clients in this way is commonly called a Service Oriented Architecture (SOA) or Microservice Architecture.
The Problems with RPCs
RPCs are Remote Procedure Calls, or a way to execute some code on another computer. However, network calls are problematic for the following reasons:
- Networks are unpredictable: the request or response may be lost due to a network problem, the remote machine may be slow or unresponsive, or DNS resolution may have failed.
- A network call may return without a result, due to a timeout. You don’t know if your call to the server was correct.
- Your request may be making it through, and the response was lost.
- A network request’s latency is wildly variable, ranging from milliseconds to seconds.
- Large objects are extremely expensive to send over the network, as they must be passed by value.
- The client and server may be using different encodings for numbers or strings, causing errors in translation.
REST vs RPCs
REST is typically more popular for public facing APIs, but RPCs are more popular for internal procedure calls.
Message-Passing Dataflow
Message-Passing Dataflows are commonly asynchronous message-passing systems. They deliver a client’s request (a message) to another process with low latency, normally passing requests to a message broker, which store requests temporarily.
Message Brokers have the following Pros:
- They can act as a buffer, improving system reliability
- They can automatically redeliver messages to a crashed process, preventing lost messages.
- The sender doesn’t need to know where the message is going, just the broker.
- One message can be sent to several recipients
- The sender and the recipient can be decoupled. The sender does not need to know about who receives its messages.
Message brokers
Popular message brokers are RabbitMQ, ActiveMQ, and Kafka. Normally, a process sends a message to a named queue or topic, and the broker ensures that the message is delivered to one or more consumers or subscribers to the topic.
This allows for asynchronous processing. A consumer may respond to a producer by publishing a message in another topic, which the producer may read (flipping the roles for the response).
Distributed Actor frameworks
The actor model is designed for concurrency in a single process. Each actor has its own internal state, which it communicates to other actors with asynchronous messages. A distributed version of this scales this up across multiple nodes.
Some notable implementations:
Akka: for the JVM Orleans: for the CLR OTP: for the BEAM
Prev: storage-and-retrieval Next: replication