Creating a website is the first step when setting up your presence on the Internet. To thrive long-term, you must also ensure your site can scale to accommodate growth. And one of the first steps is to implement a database that can scale with you. Otherwise, you risk experiencing slow query performance and database outages.
This post will discuss how you can use database sharding to achieve high scalability and availability for your data. We will also touch on the drawbacks of sharding and the different sharding architectures you can use.
What Is Database Sharding?
Sharding is an optimization technique that distributes tables across other database servers. It’s like partitioning in the sense that both involve breaking up data into smaller subsets. The difference is that sharding distributes these subsets to different servers while partitioning stores )them in one database. These servers use the same database engine and hardware type to achieve a similar performance level for all shards.
Sharding aims to accomplish a share-nothing architecture, eliminating processing bottlenecks and single points of failure.
You can implement sharding in two ways — horizontally and vertically. Horizontal sharding divides the table based on rows, while vertical sharding divides the tables based on columns.
In this regard, sharding is like partitioning, which divides large tables into smaller ones.
Horizontal sharding is effective for databases where most queries return a subset of rows, such as a customer database that returns data (like name, address, email, and so on) at once.
Vertical sharding is effective for databases whose queries return single columns. For example, if the customer database returned the customer’s name or email separately, you could separate the name and email into different clusters.
Benefits of Database Sharding
Below are some of the advantages of database sharding.
Improved Horizontal Scaling
You can scale your database vertically or horizontally. Vertical scaling refers to adding more central processing units (CPU) and random access memory (RAM) to the server to improve performance. Vertical scaling is a helpful solution for small to medium databases. However, as your data grows, vertical scaling becomes infeasible. There’s only so much power you can add to a single server.
Horizontal scaling is more flexible. It enables you to scale your database as needed by adding more servers to your system. Each of these servers provides resources to different database shards. This distributes the workload and improves the system’s capability to handle more requests.
Faster Query Response Times
Shards have only a few rows and columns. Because of this, it takes less time to process database queries. In contrast, a query of a non-sharded database might require a search through hundreds — or even thousands — of rows.
Increased Reliability in Outage Situations
Database outages happen for various reasons, including accidental data deletion, connection errors, and cybersecurity attacks. Sharding minimizes the effects of outages. Since each shard is autonomous, only the affected shard faces downtime. For example, if you have four shards and experience an outage in one of them, only 25 percent of operations will be affected.
Drawbacks of Sharding
Although sharding improves a database’s reliability and availability, implementing it is complex. Using the wrong sharding architecture can slow down performance and lead to data loss.
Be sure to choose a sharding technique that allows a balanced data distribution across all shards. Without this balance, you risk creating database hotspots, which happen when one shard stores most of the data while other shards remain practically empty. This reduces the write throughput to the single shard.
To solve this, you could partition the unbalanced shard even further, but that process is challenging and may take down your database while you migrate data.
Another drawback of sharding is that SQL joins involving multiple tables in different shards can become too slow and degrade performance. However, with the right architecture, you can avoid this problem.
You can implement sharding using three architectures:
- Key-based sharding
- Range-based sharding
- Directory-based sharding
The architecture you choose depends on your use case.
In a key- or hashed-based sharding architecture, a database application uses a shard key to locate a shard. A hashing function hashes the sharding key value, and the output maps data to a particular shard. A simple hashing function can be the modulus of the key and the number of shards.
The hash function can take more than one sharding key. Because of this, key-based sharding is suitable for data records that may have shared keys. Algorithmically distributing the data minimizes the possibility of creating database hotspots where one shard contains more data than the other.
However, since distribution relies only on the hashing function, it’s impossible to logically group data together. Therefore, database operations that require data from multiple shards may be inefficient as they require reading data from each shard.
Range-based sharding involves sharding a database depending on a specified range of values.
It uses a sharding key to determine which shard to assign a value to. The database application checks the shard that corresponds to the sharding key in a lookup table and stores the data. Because of this, range-based sharding is easy to design and implement.
For example, you could use the user ID value in a user database as the sharding key. You could store users with IDs from 0-2,000 on one shard, those between 2,000 and 4,000 on another shard, and so on.
Range-based sharding can cause database hotspots. Consider a user database in which most of your user IDs lie between 2,001 and 4,000. The process assigns them to a single shard, creating an imbalance over time. Range-based sharding, therefore, works best for evenly distributed data.
Directory-based sharding groups logically related data in the same shard. It uses a lookup table containing a list of mappings for each entity in the database. Each mapping corresponds to a database shard.
Directory-based sharding is more flexible than range-based or key-based sharding because you can add data to shards dynamically. There’s no sharding function to follow or range values to stay within. This flexibility increases the database efficiency: You can store related data in one shard, which means executing common queries takes less time.
For example, if you used directory-based sharding and grouped users according to their location, retrieving users from a particular place, you only query a single shard.
Database Sharding with Kinsta
Most modern database engines provide database sharding support. One of these database engines is MariaDB, a commercially supported fork of MySQL. It’s a high-performing open-source database system adopted by companies like IBM, GitHub, and Wikimedia. It is also part of the high-performance server stack at Kinsta.
MariaDB offers built-in sharding features through the spider storage engine. The spider storage engine is a cluster formation engine that supports partitioning and extended architecture (XA) transactions. It allows you to treat remote tables from different instances as if they are in the same instance. Once you create a table in the spider storage engine, the table links to another table in the remote MariaDB server. Once establishing the connection, the storage engine shares the link with all tables that are part of the same transaction.
Database sharding is a scaling technique that partitions tables into smaller subsets and distributes them to different servers called shards. You can implement sharding through various means, like key-based sharding, range-based sharding, and directory-based sharding.
While sharding improves a database’s scalability, reliability, and availability, it’s very complex to implement. Furthermore, once you create a shard, it isn’t easy to revert the database to its unsharded state. Because of this, use sharding for optimization only when you are sure other scalability options won’t work.