Merge Databases with Different Schema and Duplicate Entries

By: Morpheus Data

Removing duplicate entries from merged database tables can be anything but routine — and the source of performance woes.

TL;DR: Combining tables frequently results in duplicate entries that can be removed in several ways. The trick is knowing which way is best for a given situation. Often the only way to determine the best approach is by testing several and comparing their effect on database performance.

It is one of the most common operations in database management: Merge two tables that use different schema while also removing duplicate entries. Yet there are as many approaches to this problem as there are types of database tables. There are also as many potential glitches.

Here’s a look at three ways to address the situation in SQL and MySQL.

All the news that’s fit to merge

Combining multiple tables with similar values often creates duplicate entries. Several methods are available for eliminating duplicate values in SQL, but it can be tricky to determine which is best in a given situation.

In a StackOverflow post from October 2012, a number of approaches were proposed for removing duplicates from joined tables. The first was to convert an inner query to a common table expression (CTE):

A common table expression for an inner join often has a lower impact on performance than using the DISTINCT keyword to eliminate duplicates. Source: StackOverflow

The second approach was to use the DISTINCT keyword, which one poster claims performs better in some cases. Also suggested were use of the string_agg function and the group by clause.

Getting up close and personal with the UNION clause

One of the basic elements in the SQL toolbox is the UNION operator, which checks for duplicates and returns only distinct rows, and also stores data from both tables without duplicates:

Insert rows from a second table when their values don’t match those of the joined table, or create a new table that doesn’t affect either of the original tables. Source: StackOverflow

Alternatively, you can use the SELECT INTO command to create a new table from the contents of two separate tables in a way that removes duplicates:

The SELECT INTO command creates a new table from the content of two others and removes duplicates in the original tables. Source: StackOverflow

Combining multiple gigabyte-size tables without a performance hit

It isn’t unusual for database tables to become massive. Imagine merging a dozen tables with a total of nearly 10 million separate records and more than 3GB. The first suggestion on StackOverflow was to create a new table with unique constraint on the set of columns that establish a row’s uniqueness, then to use INSERT IGNORE INTO … SELECT FROM to move rows from the old table to the new one, and finally to truncate the old tables and use INSERT INTO … SELECT FROM to return the rows to the original table.

Another proposed solution was to create a specific view that combines the results of the 12 tables, and then to filter the results by querying on the view you just created.

The Morpheus database-as-a-service (DBaaS) makes analyzing and optimizing databases much more efficient. Morpheus lets you provision, host, and deploy MySQL, MongoDB, Redis, and ElasticSearch databases via a single dashboard. It is the only DBaas to support SQL, NoSQL, and in-memory databases.

In addition to automatic daily backups, each database instance is deployed with a full replica set for fail over and fault tolerance. Morpheus’s solid state disk infrastructure, direct patches into EC2, and colocation with fast peering points ensure peak database performance.

Visit the Morpheus site for pricing information and to create a free account. Few things in the database universe are as straightforward as Morpheus’s DBaaS.