Resources / Blog

How Is Google Analytics So Damn Fast

By: Morpheus Data

Sep 2014

TL; DR: Google Analytics stores a massive amount of statistical data from web sites across the globe. Retrieving reports quickly from such a large amount of data requires Google to use a custom solution that is easily scalable whenever more data needs to be stored.

At Google, any number of applications may need to be added to their infrastructure at any time, and each of these could potentially have extremely heavy workloads. Resource demands such as these can be difficult to meet, especially when there is a limited amount of time to get the required updates implemented.

If Google were to use the typical relational database on a single server node, they would need to upgrade their hardware each time capacity is reached. Given the amount of applications being created and data being used by Google, this type of upgrade could quite possibly be necessary on a daily basis!

The load could also be shared across multiple server nodes, but once more than a few additional nodes are required, the complexity of the system becomes extremely difficult to maintain.

With these things in mind, a standard relational database setup would not be a particularly attractive option due to the difficulty of upgrading and maintaining the system on such a large scale.

Finding a Scalable Solution

In order to maintain speed and ensure that such incredibly quick hardware upgrades are not necessary, Google uses its own data storage solution called BigTable. Rather than store data relationally in tables, it stores data as a multi-dimensional sorted map.

This type of implementation falls under a broader heading for data storage, called a key/value store. This method of storage can provide some performance benefits and make the process of scaling much easier.

Information Storage in a Relational Database

Relational databases store each piece of information in a single location, which is typically a column within a table. For a relational database, it is important to normalize the data. This process ensures that there is no duplication of data in other tables or columns.

For example, customer last names should always be stored in a particular column in a particular table. If a customer last name is found in another column or table within the database, then it should be removed and the original column and table should be referenced to retrieve the information.

The downside to this structure is that the database can become quite complex internally. Even a relatively simple query can have a large number of possible paths for execution, and all of these paths must be evaluated at run time to find out which one will be the most optimal. The more complex the database becomes, the more resources will need to be devoted to determining query paths at run time.

Information Storage in a Key/Value Store

With a key/value store, duplicate data is acceptable. The idea is to make use of disk space, which can easily and cost-effectively be upgraded (especially when using a cloud), rather than other hardware resources that are more expensive to bring up to speed.

This data duplication is beneficial when it comes to simplifying queries, since related information can be stored together to avoid having numerous potential paths that a query could take to access the needed data.

Instead of using tables like a relational database, key/value stores use domains. A domain is a storage area where data can be placed, but does not require a predefined schema. Pieces of data within a domain are defined by keys, and these keys can have any number of attributes attached to them.

The attributes can simply be string values, but can also be something even more powerful: data types that match up with those of popular programming languages. These could include arrays, objects, integers, floats, Booleans, and other essential data types used in programming.

With key/value stores, the data integrity and logic are handled by the application code (through the use of one or more APIs) rather than by using a scheme within the database itself. As a result, data retrieval becomes a matter of using the correct programming logic rather than relying on the database optimizer to determine the query path from a large number of possibilities based on the relation it needs to access.

Data Access

How data access differs between a relational database and a key/value database. Source: readwrite

Getting Results

Google needs to store and retrieve copious amounts of data for many applications, included among them are Google Analytics, Google Maps, Gmail, and their popular web index for searching. In addition, more applications and data stores could be added at any time, making their BigTable key/value store an ideal solution for scalability.

BigTable is Google’s own custom solution, so how can a business obtain a similar performance and scalability boost to give its users a better experience? The good news is that there are other key/value store options available, and some can be run as a service from a cloud. This type of service is easily scalable, since more data storage can easily be purchased as needed on the cloud.

A Key/Value Store Option

There are several options for key/value stores. One of these is Mongo, which is designed as an object database that stores information in JSON format. This format is ideal for web applications since JSON data makes it easy to pass data around in a standard format among the various parts of an application that need it.

For example, Mongo is part of the MEAN stack: Mongo, Express, AngularJS, and NodeJS—a popular setup for programmers developing applications. Each of these pieces of the puzzle will send data to and from other one or more of the other pieces. Since everything, including the database, can use the JSON format, passing the data around among the various parts becomes much easier and more standardized.

MySQL vs. MongoDB

How mySQL and Mongo perform the same tasks. Source: Rick Osborne

How to Make Use of Mongo

Mongo can be installed and used on various operating systems, including Windows, Linux, and OS X. In this case, the scalability of the database would need to be maintained by adding storage space to the server on which it is installed.

Another option is to use Mongo as a service on the cloud. This allows for easy scalability, since a request can be made to the service provider to up the necessary storage space at any time. In this way, new applications or additional data storage needs can be handled quickly and efficiently.

Morpheus is a great option for this service, offering Mongo as a highly scalable service in the cloud: Users of Morpheus get three shared nodes, full replica sets, and can seamlessly provision MongoDB instances. In addition, all of this runs on a high-performance, Solid State Drive (SSD) infrastructure, making it a very reliable data storage medium. Using Morpheus, a highly scalable database as a service can be running in no time!