Monday, January 23, 2012

Replication stopped with duplicate entry upon starting up of MySQL from cold backup

A few days ago, we faced a Duplicate entry problem in the mysqld server on one of our customer’s slave after cold backup. We do regular data sync checks for all of our customer's as daily reports and check for data discrepancies between master and slave server's, if any got picked up by our checks to investigate, and there was no issues for the particular server and with the report. When checked the entry details, there is a row with the same data in the table that was mentioned by Slave status Last_error entity. Quick check on the error log showed that the slave was started after cold back's and immediately stopped with a duplicate entry error and the mysqld server version is 5.0.77-log Source distribution. From the analysis of the below statements, we found that the slave SQL thread was stopped with the error at a position backwards than the SQL thread initialized after cold backup's.

[perl]
[pythian@dol5 ~]$ tail -20 /var/log/mysql/dol5-error.log
111221 7:42:33 [Note] Slave I/O thread: connected to master 'sslrepl@dol50.dollive.com:3306', replication started in log 'dol50-binlog.000763' at position 533765390
111222 7:05:55 [Note] /usr/libexec/mysqld: Normal shutdown
111222 7:05:55 [Note] Slave SQL thread exiting, replication stopped in log 'dol50-binlog.000767' at position 357198241
111222 7:05:55 [Note] Slave I/O thread killed while reading event
111222 7:05:55 [Note] Slave I/O thread exiting, read up to log 'dol50-binlog.000767', position 357511554 111222 7:05:55 InnoDB: Starting shutdown... 111222 7:06:01 InnoDB: Shutdown completed; log sequence number 372 128185461
111222 7:06:01 [Note] /usr/libexec/mysqld: Shutdown complete
111222 07:06:02 mysqld ended 111222 07:42:09 mysqld started
111222 7:42:18 InnoDB: Started; log sequence number 372 128185461
111222 7:42:20 [Note] /usr/libexec/mysqld: ready for connections. Version: '5.0.77-log' socket: '/var/lib/mysql/mysql.sock' port: 3306 Source distribution
111222 7:42:20 [Note] Slave SQL thread initialized, starting replication in log 'dol50-binlog.000767' at position 357198241, relay log '/var/lib/mysql/dol50-relay.000066' position: 357198386
111222 7:42:20 [Note] Slave I/O thread: connected to master 'sslrepl@dol50.dollive.com:3306', replication started in log 'dol50-binlog.000767' at position 357511554
111222 7:42:21 [ERROR] Slave: Error 'Duplicate entry '94335' for key 1' on query. Default database: 'live'. Query: 'insert into charges (amount, key, id) values (0.37, 'dlive', 23892)', Error_code: 1062
111222 7:42:21 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'dol50-binlog.000767' position 357198241

The slave status output on the server that which error occurred is as below.

mysql> show slave status\G
*************************** 1. row ***************************
...... ...........
Last_Errno: 1062
Last_Error: Error 'Duplicate entry '94335' for key 1' on query. Default database: 'live'. Query: 'insert into charges (amount, key, id) values (0.37, 'dlive', 23892)'
...... ..........
1 row in set (0.00 sec)
[/perl]

     Another quick check of the transaction details in master binary logs revealed that the transaction involved is the mixture of storage engines InnoDB and MyISAM. At this point, we have understood that the Duplicate entry error was due to the usage of different storage engines for tables involved in a transaction but how the rollback was performed is still a Question and it was not from application used by Customer. Here is the simple example below that shows the rows inserted into MyISAM table couldn't rolled back though we rollback the transaction.

[perl]
* Creating a table with MyISAM engine.
mysql> CREATE TABLE myisam_tbl ( item int, val int ) engine=MyISAM ;
Query OK, 0 rows affected (0.06 sec)

* Creating a table with Innodb engine.

mysql> CREATE TABLE innodb_tbl ( item int, val int ) engine=Innodb ;
Query OK, 0 rows affected (0.02 sec)

* Starting transaction, inserting rows into both tables and issues rollback.

mysql> START TRANSACTION ;
Query OK, 0 rows affected (0.06 sec)

mysql> INSERT INTO myisam_tbl (item, val) VALUES (1, 1), (2, 2);
Query OK, 2 rows affected (0.02 sec) Records: 2 Duplicates: 0 Warnings: 0

mysql> INSERT INTO innodb_tbl (item, val) VALUES (1, 1), (2, 2);
Query OK, 2 rows affected (0.00 sec) Records: 2 Duplicates: 0 Warnings: 0

mysql> ROLLBACK ;
Query OK, 0 rows affected, 1 warning (0.00 sec)

mysql> show warnings ; +---------+------+---------------------------------------------------------------+ | Level | Code | Message | +---------+------+---------------------------------------------------------------+ | Warning | 1196 | Some non-transactional changed tables couldn't be rolled back | +---------+------+---------------------------------------------------------------+ 1 row in set (0.00 sec)

* This shows the rows inserted into MyISAM engine table exist and commited.

mysql> select * from myisam_tbl ;
+------+------+ | item | val | +------+------+
| 1 | 1 | | 2 | 2 |
+------+------+
2 rows in set (0.00 sec)

* This show the rows inserted into Innodb engine table were not commit-ed. mysql> select * from innodb_tbl ; Empty set (0.00 sec) [/perl]


    We did a check and collected the MyISAM and Innodb tables involved SQL statements in the transaction from binary logs at the position where the slave was stopped with an error. We have deleted all the inserted rows of the MyISAM engine involved in the tables of the transaction manually all at once on the slave server ( which caused the duplicate entry errors ) and started slave, it worked fine and no other error's found after it. If we delete one of the inserted rows of MyISAM involved tables one at a time and start the slave, then the replication thread will start executing the whole transaction again from the starting position where replay of transaction inserts the previously deleted entry and will stop at the second entry and goes on, it continues in round-robin fashion and the transaction never completes as it executes from the start position of the transaction again when we issue START SLAVE unless we skip the error which skips the complete transaction.

        After a detailed investigation of the incident by another DBA, we found that the cause was determined to be a transaction involving both MyISAM and InnoDB tables, which was rolled backed during MySQL shutdown and was applied again after start-up (this is a normal MySQL behavior). More details on the MySQL server shutdown process can be found here and see point 4 for rollback of SQL thread during shutdown if it is in the middle of transaction. The STOP SLAVE transaction behavior has been fixed in the version 5.1.35 as per change log to avoid the problems encountered in replication for transactional engine stated here, but it was impossible to roll back a mixed-engines transaction when one of the engines was non-transactional, which meant that the slave could not safely re-apply any transaction that had been interrupted by STOP SLAVE. While changes to InnoDB table were rolled back and then applied normally, changes to MyISAM table couldn't be rolled-back and were applied twice, as MyISAM is non-transactional engine where every single SQL statement will be committed right after the execution.

We made some recommendations to the Customer as below in which the option 3 was chosen as the slave was dedicated for backup's and a quick solution applicable for their data set and timeline for resolution of the issue.

1) Change queries or table types to ensure that transactional and non-transactional tables are not mixed in any transaction
2) Upgrade to MySQL 5.1.35 or later in which the stop slave behavior has been changed to avoid this particular issue ( transaction rollback during MySQL shutdown ).
3) Consider moving to hot backups which does not require MySQL to be shutdown and always ends at a completed transaction.