and Evolution”

Encoding and Evolution

Prev: storage-and-retrieval Next: replication

Nothing stays still. Change is a constant. To cater to this world, it’s important to talk about backward and forward compatibility.

Backward compatibility
- Newer code can read data that was written by older code.
Forward compatibility
- Older code can read data that was written by newer code.

Let’s look at some formats that allow us to achieve that.

Formats for Encoding Data

There are normally two kinds:

In-Memory (objects, structs, lists, arrays, etc.)
Persistent (a representation of the object on Disk)

We need to translate between these two forms. In-memory → Persistent is called encoding, also (serialization or marshalling) and the reverse is called decoding, also (parsing, deserialization, unmarshalling).

Language-Specific Formats

Built-in Language specific formats are generally bad:

They tie you to a specific language
They can open you up to security concerns
Versioning data is poor
Efficiency is poor

Instead, let’s talk about cross-language variants:

JSON, XML, and Binary Variants

JSON, XML, and CSV are subpar.

XML has security issues:

JSON is poor at handling numbers above (2^53).

CSV has no schema, and every type could be a string or not.

Binary Encoding

JSON is the most popular, so there have been binary versions of it developed that compress it over the wire, like (MessagePack, BSON, BJSON, UBJSON).

Thrift and Protocol Buffers

Thrift and Protocol Buffers (protobuf) are binary encoding libraries that solve a few of the issues that JSON, XML, and CSV have.

Let’s take this JSON data and encode it in Thrift’s schema language, IDL.

{
  "userName": "Martin",
  "favoriteNumber": 1337,
  "interests": ["daydreaming", "hacking"]
}

struct Person {
  1: required string userName,
  2: optional i64 favoriteNumber,
  3: optional list<string> interests
}

In protobuf:

message Person {
  required string user_name = 1;
  optional i64 favorite_number = 2;
  repeated string interests = 3;
}

Field tags and schema evolution

Both formats can keep backward and forward compatibility. All data is pointed to by its field tag, not by its name. So, you can add new fields to the schema by adding a new tag number.

Old code can disregard tag numbers generated by new code. Since the type is given, it is possible to skip around in the binary as required.

However, you are not allowed to add new required fields, since that would break backwards compatibility.

You can remove a field that is optional as well, but never a required field. As well, if you remove a field, you may never use that tag number again (old code will still rely on that old tag.)

You may be able to change the datatype of a particular field, but there are some caveats, like truncation.

Avro

Avro is another binary encoding format, but it doesn’t use tag numbers.

It encodes data in Avro IDL and JSON.

record Person {
  string userName;
  union { null, long } favoriteNumber = null;
  array<string> interests;
}

in JSON:

{
  "type": "record",
  "name": "Person",
  "fields": [
    {"name": "userName", "type": "string"},
    {"name": "favoriteNumber", "type": ["null", "long"], "default: null},
    {"name": "interests", "type": {"type": "array", "items": "string}}
  ]
}

With Avro, you go through the fields in the order they appear in the schema, using the schema to tell the datatype of each field.

When transferring data, Avro uses two schemas — when encoding, it generates a writer’s schema, and when decoding, a reader’s schema.

The writer’s schema doesn’t have to be the same as the reader’s. When data is decoded, the Avro library resolves differences by translating the writer and reader’s schema. Schema resolution matches up fields by field name, so order doesn’t matter.

If there’s a field that appears in the writer’s schema but not the reader’s schema, it’s ignored. (The reader doesn’t need it).

If the writer’s schema lacks a field, it is filled in with a default value from the reader’s schema.

Schema Evolution Rules

To maintain compatibility both forwards and backwards, you may only add or remove a field that has a default value.

Also, nothing is null by default. You have to use a union type for this.

Avro lacks optional and required markers like protobuf and thrift.

Changing a data type is possible as long as the data types are convertible.

But what is the writer’s schema?

How do we determine where to store the reader and writer’s schema?

Large file with lots of records

Avro specifies a file format to do this.

Database with individually written records

include a version number at the beginning of every record, and keep a table with a version → schema mapping in it.

Sending records over a network connection

Negotiate the schema version over the lifetime of a connection.

Dynamically generated schemas

This allows Avro to be friendlier for dynamically generated schemas. Since the only information that matters is the field name, when the field name in a database changes, just recompile the schema file.

Code generation

Avro also allows for dynamic languages to use them, as they can be used as a simple JSON schema in other languages.

The Merits of Schemas

Binary schema based encodings are very nice, because they are compact, the schema is an opportunity for documentation, and allows for backwards and forwards compatibility.

Modes of Dataflow

How does data flow through processes?

Dataflow Through Databases

Forward compatibility is normally required, because there may be many readers and writers who use an older version of the code. As well, there’s a concern where an old reader might update data previously read by a newer version of the application, which may lose data if you’re not careful.

Different values written at different times

A database allows any value to be updated at any time. A single database may contain values that were written recently or a long time ago. Databases may allow for simple migrations, such as adding a new column with a default value.

Dataflow Through Services: REST and RPC

The web works with a client and server model; the servers expose an API over the network, and the clients connect to servers to make requests to that API.

Generally servers provide an application-specific API that only allows inputs and outputs that it deems correct for business.

Connecting servers and clients in this way is commonly called a Service Oriented Architecture (SOA) or Microservice Architecture.

The Problems with RPCs

RPCs are Remote Procedure Calls, or a way to execute some code on another computer. However, network calls are problematic for the following reasons:

Networks are unpredictable: the request or response may be lost due to a network problem, the remote machine may be slow or unresponsive, or DNS resolution may have failed.
A network call may return without a result, due to a timeout. You don’t know if your call to the server was correct.
Your request may be making it through, and the response was lost.
A network request’s latency is wildly variable, ranging from milliseconds to seconds.
Large objects are extremely expensive to send over the network, as they must be passed by value.
The client and server may be using different encodings for numbers or strings, causing errors in translation.

REST vs RPCs

REST is typically more popular for public facing APIs, but RPCs are more popular for internal procedure calls.

Message-Passing Dataflow

Message-Passing Dataflows are commonly asynchronous message-passing systems. They deliver a client’s request (a message) to another process with low latency, normally passing requests to a message broker, which store requests temporarily.

Message Brokers have the following Pros:

They can act as a buffer, improving system reliability
They can automatically redeliver messages to a crashed process, preventing lost messages.
The sender doesn’t need to know where the message is going, just the broker.
One message can be sent to several recipients
The sender and the recipient can be decoupled. The sender does not need to know about who receives its messages.

Message brokers

Popular message brokers are RabbitMQ, ActiveMQ, and Kafka. Normally, a process sends a message to a named queue or topic, and the broker ensures that the message is delivered to one or more consumers or subscribers to the topic.

This allows for asynchronous processing. A consumer may respond to a producer by publishing a message in another topic, which the producer may read (flipping the roles for the response).

Distributed Actor frameworks

The actor model is designed for concurrency in a single process. Each actor has its own internal state, which it communicates to other actors with asynchronous messages. A distributed version of this scales this up across multiple nodes.

Some notable implementations:

Akka: for the JVM Orleans: for the CLR OTP: for the BEAM