TL;DR: The efficient operation of your MongoDB database depends on which field in the documents you designate as the shard key. Since you have to select the shard key up front and can’t change it later, you need to give the choice due consideration. For query-focused apps, the key should be limited to one or a few shards; for apps that entail a lot of scaling between clusters, create a key that writes efficiently.
The outlook is rosy for MongoDB, the most popular NoSQL DBMS. Research and Markets’ March 2014 report entitled Global NoSQL Market 2014-2018 predicts that the overall NoSQL market will grow at a compound annual rate of 53 percent between 2013 and 2018. Much of the increase will be driven by increased use of big data in organizations of all sizes, according to the report.
Topping the list of MongoDB’s advantages over relational databases are efficiency, easy scalability, and “deep query-ability,” as Tutorialspoint’s MongoDB Tutorial describes it. As usual, there’s a catch: MongoDB’s efficient data storage, scaling, and querying depend on sharding, and sharding depends on the careful selection of a shard key.
As the MongoDB Manual explains, every document in a collection has an indexed field or compound indexed field that determines how the collection’s documents are distributed among a cluster’s shards. Sharding allows the database to scale horizontally across commodity servers, which costs less than scaling vertically by adding processors, memory, and storage.
A mini-shard-key-selection vocabulary
When a MongoDB collection grows beyond its cluster, it chunkifies its documents based on ranges of values in the shard key. Keep in mind that once you choose a shard key, you’re stuck with it: you can’t change it later.
The characteristic that makes a chunk easy to divide is cardinality. The MongoDB Manual recommends that your shard keys have a high degree of randomness to ensure the cluster’s write operations are distributed evenly, which is referred to as write scaling. Conversely, when a field has a high degree of randomness, it becomes a challenge to target specific shards. By using a shard key that is tied to a single shard, queries run much more efficiently; this is called query isolation.
When a collection doesn’t have a field suitable to use as a shard key, a compound shard key can be used, or a field can be added to serve as the key.
Choice of shard key depends on the nature of the collection
How do you know which field to use as the shard key? A post by Goran Zugic from May 2014 explains the three types of sharding MongoDB supports:
The primary consideration when deciding which shard key to designate is how the collection will be used. Zugic presents it as a balancing act between query isolation and write scaling: the former is preferred when queries are routed to one shard or a small number of shards; the latter when efficient scaling of clusters between servers is paramount.
MongoDB ensures that all replica sets have the same number of chunks, as Conrad Irwin describes in a March 2014 post on the BugSnag site. Irwin lists three factors that determine choice of shard key:
Irwin provides two examples. The simplest approach is to use a hash of the _id of your documents:
In addition to distributing reads and writes efficiently, the technique guarantees that each document will have its own shard key, which maximizes chunk-ability.
The other example groups related documents in the index by project while also applying a hash to distinguish shard keys:
A mini-decision tree for shard-key selection might look like this:
This and other aspects of optimizing MongoDB databases can be handled through a single dashboard via the Morpheus database-as-a-service (DBaaS). Morpheus lets you provision, deploy, and host heterogeneous MySQL, MongoDB, Redis, and Elasticsearch databases. It is the first and only DBaaS that supports SQL, NoSQL, and in-memory databases. Visit the Morpheus site to sign up for a free account!