Synchronize a large MySQL Table when mysqldump not (really) works

6 min readApr 23, 2020

And before you ask, “what is large?”, well, let´s say 200k entries inside an InnoDB table. Don´t laugh, for many of you this is nothing and a well configured MySQL Server should handle that much of data while sleeping, but we don´t live in a perfect world. And I am happy with that, otherwise, there wouldn´t be any reasons to become creative 🤓

So, let´s start:

I am pretty sure every DevOp has been in this situation, where he or she had to sync a very large table from MySQL-Server A to MySQL-Server B.

We all agree, usually, this is something very simple:

mysqldump my_database my_large_table |mysql -h other-host other_database

Did I try it this way? Hell, yes, of course! 😂
Did it work? Hell, no, of course! 😂

So here´s the reason why everything went a little bit awkward:

A Software on the other host is reading from that table (i have to sync) like crazy, re-creating the whole table is not an option
Usually, I monitor these re-crated tables with CREATE_TIME from information_schema.TABLES (not possible)

So after recognizing, I am not able to do it the regular way, I had to find a way to keep the table alive (without re-creating it) and only add entries, that are missing and delete entries, that no longer exist. What do you think? EASY! 😁 So let´s create a little picture for you, that makes it easier to follow my path:

Here´s the table that needs to be up to date:

the table on the other host that needs to be up 2 date

You might ask: What´s the primary key of this table?
Funny: there is none!

And even better: art_num and oxartnum or arz_idmaybe the same while her_id, mod_id, mer_id might have changed.

At this point you might mention, the whole table should look completely different, yeah… but let´s keep it that way, otherwise, I wouldn´t be able to show you some interesting stuff 😉

So let's get back to our sync. I dump the whole table from the source host (A) to the target (B) host as table_source while the table table_target already exists on host B.

Keep in mind, both tables are 100% equal.

Let´s add items that are missing:

INSERT INTO table_target
 SELECT
  *
 FROM table_source s
  LEFT OUTER JOIN table_target t USING (arz_id)
 WHERE t.arz_id IS NULL;

Let´s delete items that no longer exist:

DELETE t.* FROM table_target t
   LEFT OUTER JOIN table_source s USING (arz_id)
 WHERE s.arz_id IS NULL;

This looks nice and easy and should work IF there would be a primary key with arz_id , but there isn´t. I wanted to show these statements anyway because I really like JOINS. Sadly, this does not work, because as mentioned before: arz_id may stay the same, while other IDs might change.
In the worst case, I would add missing and delete old entries, but I wouldn´t catch changed entries.

Okay, we need a unique identifier for each row!

MD5 to the rescue!

So when creating the table on the source host, let´s add a compare_hash to it,
we can use later to find items, that have been added, deleted or changed.

So we change this:

CREATE TABLE table_source 
 SELECT a.*
  FROM view_source a;

CREATE TABLE table_source
a.*,
MD5(CONCAT(a.art_num,a.oxartnum,a.arz_id,a.her_id,a.mod_id,a.mer_id)) AS compare_hash
  FROM view_source a;

Can´t you add the MD5 to the view?
Sure, the result is the same, but let´s pretend you are not able to change any views.

So, we have added a unique hash that represents a whole row with its individual IDs. The table_source now looks like this:

table on source host with a unique md5 for each row: compare_hash (last column)

Don´t forget to add an Index on this new column 💡

ALTER TABLE table_source
  ADD INDEX IDX_tabe_source_hash (compare_hash);

We now mysqldump table_source from host A to our host B while there is still our target_table (on host B) without compare_hash . So we add this column (and the Index!) there, too. Easy.

So let´s remember our INSERT and DELETE JOINs from earlier and change these statements to make use of our new compare_hash :

INSERT INTO table_target
 SELECT
  *
 FROM table_source s
  LEFT OUTER JOIN table_target t USING (compare_hash)
 WHERE t.compare_hash IS NULL;

Should this work? Yes 😄
Did it work for me? NO 🙈

Why: Really? I don´t know. After waiting for more than 30 minutes I decided, this takes way too long, even this should work, I will do it in a different way.

I don´t have control over the target host, maybe the MySQL Settings are not well set, maybe there is no join-buffer, maybe, maybe, maybe… no time for
finding problems on a host I don´t have access to. Even that SQL would take 5 minutes, that´s too long. However….

Here´s my script that does the job in a few seconds:

https://gist.github.com/michabbb/84f79e290fe4b628754499689daaa0fc

The key to success “in this situation” for me has been:

create temporary tables in memory
compare only the hash (reduce data)
join only on relevant data (the hash)

So shortly explained, what the script does:

we create a MEMORY table with all hashes of our table_source
we create a MEMORY table with all hashes of our table_target
we compare both tables to find hashes, we have to add, table to_add
we compare both tables to find hashes, we have to delete, table to_delete
finally, we are able to do the regular INSERT INTO JOIN and DELETE JOIN we wanted to do in the first place.
the difference: we join our MEMORY tables, which is super fast

Someone might mention at this point: “large tables” in MEMORY are not a nice combination, by definition 😉 I say: It depends. If your host has enough RAM, you should take advantage of MEMORY tables. I cannot remember how often this ENGINE helped me to speed up complex queries.
Btw, in your my.cnf there is a key called max_heap_table_size ,
(docu) , check its value and compare it to your free amount of RAM, maybe you can use more RAM for the MEMORY-ENGINE.

Anything else?

Yes. I should mention, when searching for tools to solve this “table sync problem”, you might stumble over pt-table-sync by percona.
This tool is using hashes, too, of course. 😉

Why didn´t I use it? I don´t know if that tool does the job here, I never tested it for this situation. I just wanted a native solution, trying things out, maybe learn something. If you always go the easy way, there is no space for developing your own skills 🤓

To give you an idea, how pt-table-sync works, here´s an example call (that has nothing to do with my story):

pt-table-sync --print --ignore-tables=table1,table2,table3 --databases=mydb1,mydb2,mydb3 --nocheck-triggers h=mysql_source,P=3306,u=root,p=mypwd h=mysql_target,P=3306,u=root,p=mypwd > sql_sync_statements.sql

You can compare whole databases. sql_sync_statments.sql will show you all SQL queries that need to be made, to keep the tables (or databases) in sync.
You can run them directly or, as shown here, only print to a file to manually check what will happen.

Because I love Docker, I also should mention, it´s easy to run the MySQL client or pt-table-sync with Docker so that you are not forced to install any packages on your host and get in trouble with any version conflicts.
Here´s an old Dockerfile I used in the past that gives you an idea.

And, if you have to deal with millions of records, give crate.io a chance,
simply said: It´s an ElasticSearch Cluster with SQL-Like syntax, you don´t have to learn ES DSL, crate’s SQL is similar to MySQL Query-Syntax, this enables you to get started quickly. In one of my projects, I am running a Cluster with 12.5 million records and queries are incredibly fast. 🚀

echo “Greetings from bash! 😉”

Synchronize a large MySQL Table when mysqldump not (really) works

MD5 to the rescue!

Anything else?

Written by Micha(el) Bladowski