Replication and Consistency over Distributed Architecture: No Silver Bullet

Mainak Saha
6 min readJan 23, 2021

Replication is one of the key concepts that we should learn before implementing various design patterns over distributed systems. For a Solution Architect, it is crucial to define RTO and RPO of a system. Those are directly tied with how your data is getting replicated, what methodologies have been applied to move your data from disk to disk, system to system, or data center to data center.

We cannot ignore these concepts, even in the era of the rise of the Cloud, where we are developing a tendency of dumping everything on cloud, as if it will solve all our problems, trust me, still there is no silver bullet for a few of the core problems, one of them is replication.

In this article, I’ll try to cover a few basic concepts and techniques of replication, along with their limitations. It is an ongoing work field, where a lot of new changes will come with advanced network and computing power.

Leaders, Followers, and Consistency

Each node of datastore is called a replica, the node where data gets write is called leader, and from the leader, it is copied to all other nodes known as followers.

I like this picture ..

Synchronous vs. Asynchronous Replication

I will not explain synchronous replication, as from practical design, you’ll never achieve scalability if you have to write on multiple nodes and block the writer until the time data is not written to all nodes.

All kind of fun starts with asynchronous replication, to explain it better, I’ll borrow one of the diagrams from my previous article on CQRS.

fig 1: Replication Lag

Here we see the consistency lag between two nodes or between leader and follower. The follower takes time to get itself in a consistent state with the leader. This is called eventual consistency, or read after write consistency is eventual.

Handling eventual consistency is very important while designing a replicated system across multiple zone/region on cloud. To be very clear, you have to keep eventual consistency in mind while using the below-mentioned databases.

  • AWS RDS
  • AWS DynamoDB
  • Azure CosmoDB
  • GCP Cloud SQL
  • Any other DB with Zonal / Regional replication

Here is a diagram that describes availability and scalability with consistency.

fig 2: Consistency vs. Performance

It is an inverse relation. If you are looking for higher availability, you have to compromise on consistency. It is a trade-off between performance and consistency.

Before moving to tell some solution approach to consistency issues, let me describe a classic “time travel” scenario due to the replication lag.

In this scenario, we have two writers, one observer, and data is getting written and saved across multiple regions to multiple nodes.

fig 3: Sequencing issue

Here you can see for Mr. Z, who is reading from a replica, it seems Mr. Y can truly see the future, as he provides the answer, even before the question is getting asked.

How to deal with Replication Lag

As the title of the article said, there is no silver bullet for Replication Lag. Some design considerations can solve some of the issues.

  • If the user performs data change specific to the profile, then write and read should happen from the same node. This theory doesn’t scale if the user can change a huge set of data, then you end up with traffic going to only a few specific nodes.
  • For transactional data, where an application is making a decision based on some values, then it should always use the leader. Other replicated nodes can be used for view/report use cases e.g., balance check before spending the money should be done on the leader, whereas balance can be shown on the screen from followers. A screen refresh by the user will always pull the latest data and display, so momentarily it can show that you have $10 in your wallet, where actually it is $8, but it will not let you spend $9, as the balance check will be from a leader (refer fig: 1)
  • If it is a single leader with a single partition, replication lag can be handled by sequence number.

Multiple Leaders

Till now, we are banging our head with one leader, now think about a truly global system, where an application is deployed across multiple regions across countries, and they maintain their own leaders.

A typical multileader topology looks like this. Conflict resolution is one of the major parts of it.

fig 4: Multi Leader

A simple way to describe conflict is when newer data get updated by old data. 100% conflict cannot be resolved without the implementation of transaction. One writer needs to take a lock on the record before updating it. Now it is not possible when you are dealing with distributed architecture. You have to embrace conflict and need robust conflict resolution strategies. I don’t want to discuss conflict resolution algorithms in this article and cover that in the future.

fig 5: Topology

Three common topologies of leader replication.

  • Circular
  • Star
  • All to all

All these 3 topologies are extensively used in distributed systems, now they have their shortfalls.

Circular and Star data needs to be traveled through multiple nodes before reaching all nodes, if one node goes down, high possibility of breaking the chain.

All to all helped to resolve the above-mentioned problem, but it has its own issues. Due to network latencies between nodes, some later updates can reach before previous updates from another node, corrupting the data.

fig 6: Issue with All to all topology

Leaderless Replication: Quorums

This is a little bit different technique, till now whatever we have discussed, once data get written to a leader, it was a leader’s responsibility to replicate others. Now we are handing off that responsibility to a client or one step above the leader. In this replication, there are no leaders, it is the client’s responsibility to update all nodes. Similarly, it is the client’s responsibility to read from nodes and determine the data's correct state.

This is called Quorums for reading and writing. If n is number of replicas, w is number of nodes for successful write, and r number of nodes for successful read. If we validate w+r > n then system is running fine. r and w are called quorum read and writes. If you are familiar with Zookeeper, you can find them in zK configurations.

fig 7: Quorums

Here you can see how a quorum based system can resolve conflicts automatically, this is called read repair.

Is there any easy way to sync? Timestamp?

If you have earlier designed this kind of system, you’ll be wondering why they are not using the timestamp to synchronize their updates; that should resolve all conflicts.

I agree life would be much simpler if we can build systems only based on the timestamps, but the reality is if you go to seconds level ..

  • Across the servers clocks are not synced.
  • Across multiple zones, clocks are out of sync.
  • Across multiple regions .. what do you expect?

Hence, when data flows from region to region, we cannot resolve conflict or update the data based on their timestamps.

Cloud Spanner: GCP

To answer many of these, GCP came up with Cloud Spanner, considered as the current silver bullet in this space.

Cloud Spanner used Google’s TruTime to determine exact time across data centers up to milliseconds level and synchronize writes. It promises strong global transactional consistency, without compromising much on performance.

I tried to squeeze in a lot of information over a short read, each of these items is itself a topic, and can be discussed for hours. I have referred a few books and articles if you want to dig in further.

Further Reads

--

--

Mainak Saha

Cloud / Artificial Intelligence / Financial Services Enthusiast ..