Monday, October 5, 2015

Multiple ways to find duplicates rows and delete safely from bulk number of tables

Here comes the scenario, we have a bug in our app which introduced duplicate rows into the table where we don't have unique key on the table and How do we find and safely delete the duplicate rows  and have a copy of rows what we deleted.

Preparing a test setup with scenario, create table, insert some duplicates as below and do a select of rows...


CREATE TABLE `test` (
  `bid` int(11) NOT NULL AUTO_INCREMENT,
  `addtime` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  `refid` varchar(255) NOT NULL,
  PRIMARY KEY (`bid`),
  KEY `addtime` (`addtime`),
  KEY `refid` (`refid`)
) ENGINE=InnoDB AUTO_INCREMENT=128 DEFAULT CHARSET=utf8

MariaDB [test]> insert into test(refid) values ('suresh'),('kumar'),('kuna') ;
Query OK, 3 rows affected (0.00 sec)
Records: 3  Duplicates: 0  Warnings: 0

MariaDB [test]> insert into test(refid) values ('suresh'),('kumar'),('kuna') ;
Query OK, 3 rows affected (0.00 sec)
Records: 3  Duplicates: 0  Warnings: 0

MariaDB [test]> insert into test(refid) values ('suresh'),('kumar'),('kuna') ;
Query OK, 3 rows affected (0.00 sec)
Records: 3  Duplicates: 0  Warnings: 0

MariaDB [test]> insert into test(refid) values ('suresh'),('kumar'),('kuna') ;
Query OK, 3 rows affected (0.00 sec)
Records: 3  Duplicates: 0  Warnings: 0

MariaDB [test]> select * from test ;
+-----+---------------------+--------+
| bid | addtime             | refid  |
+-----+---------------------+--------+
|  84 | 2015-10-05 14:31:36 | test   |
|  85 | 2015-10-05 14:31:36 | test1  |
| 113 | 2015-10-05 14:31:50 | suresh |
| 114 | 2015-10-05 14:31:50 | kumar  |
| 115 | 2015-10-05 14:31:50 | kuna   |
| 116 | 2015-10-05 15:35:39 | suresh |
| 117 | 2015-10-05 15:35:39 | kumar  |
| 118 | 2015-10-05 15:35:39 | kuna   |
| 119 | 2015-10-05 15:35:40 | suresh |
| 120 | 2015-10-05 15:35:40 | kumar  |
| 121 | 2015-10-05 15:35:40 | kuna   |
| 122 | 2015-10-05 15:35:40 | suresh |
| 123 | 2015-10-05 15:35:40 | kumar  |
| 124 | 2015-10-05 15:35:40 | kuna   |
| 125 | 2015-10-05 15:35:41 | suresh |
| 126 | 2015-10-05 15:35:41 | kumar  |
| 127 | 2015-10-05 15:35:41 | kuna   |
+-----+---------------------+--------+
17 rows in set (0.00 sec)

Now, we have a table called test with duplicate rows in the table. 

We can delete the duplicates in several ways...and the most easier way if we have only one table with few rows to delete is to a simple select with where condition as below and delete using subquery select. Run the delete statement in for loop until we see the zero rows left in the where condition.



MariaDB [test]> select refid, count(1) from test where addtime between '2015-09-23 00:00:00' and '2015-10-30 00:00:00' group by refid HAVING count(*) > 1;
+--------+----------+
| refid  | count(1) |
+--------+----------+
| kumar  |        5 |
| kuna   |        5 |
| suresh |        5 |
+--------+----------+
3 rows in set (0.00 sec)


MariaDB [test]>  delete from test where refid in (select * from (select refid  from test where addtime between '2015-09-23 00:00:00' and '2015-10-30 00:00:00' group by refid HAVING count(*) > 1 ) a)  limit 1 ;
Query OK, 1 row affected, 2 warnings (0.00 sec)

MariaDB [test]> select refid, count(1) from test where addtime between '2015-09-23 00:00:00' and '2015-10-30 00:00:00' group by refid HAVING count(*) > 1;
+--------+----------+
| refid  | count(1) |
+--------+----------+
| kumar  |        5 |
| kuna   |        5 |
| suresh |        4 |
+--------+----------+
3 rows in set (0.00 sec)

MariaDB [test]> delete from test where refid in (select * from (select refid  from test where addtime between '2015-09-23 00:00:00' and '2015-10-30 00:00:00' group by refid HAVING count(*) > 1 ) a)  limit 1 ;
Query OK, 1 row affected, 2 warnings (0.00 sec)

MariaDB [test]> select refid, count(1) from test where addtime between '2015-09-23 00:00:00' and '2015-10-30 00:00:00' group by refid HAVING count(*) > 1;
+--------+----------+
| refid  | count(1) |
+--------+----------+
| kumar  |        4 |
| kuna   |        5 |
| suresh |        4 |
+--------+----------+
3 rows in set (0.00 sec)

MariaDB [test]> delete from test where refid in (select * from (select refid  from test where addtime between '2015-09-23 00:00:00' and '2015-10-30 00:00:00' group by refid HAVING count(*) > 1 ) a)  limit 1 ;
Query OK, 1 row affected, 2 warnings (0.01 sec)

MariaDB [test]> select refid, count(1) from test where addtime between '2015-09-23 00:00:00' and '2015-10-30 00:00:00' group by refid HAVING count(*) > 1;
+--------+----------+
| refid  | count(1) |
+--------+----------+
| kumar  |        4 |
| kuna   |        4 |
| suresh |        4 |
+--------+----------+
3 rows in set (0.00 sec)

Now, the interesting part. Let's say we have thousands of tables in several databases and we have to delete all the duplicates with less effort and as simple as above or more simpler than above.

For that, we need to have the percona tool called "pt-archiver". Download the tool on the server where you have the databases & tables which have duplicate and make it runnable or executable file.

Create a new table as same as the table which have duplicates to store the deleted duplicate row ( just in case if we need the data later to compare or to see for some or other reason ).  Altered the table and drop the the auto_increament key as we will have duplicates to insert into when we have multiple tables and dropped the primary key.

Finally, use the pt-archiver to store the duplicate rows in a separate table and start deleting from the bunch of tables. Verify the data in the test_duplicates table which has got deleted.

NOTE : Please use the --dry-run before running the pt-archiver to check on what is happening with the created statement.







wget percona.com/get/pt-archiver
chmod +x pt-archiver


create table test_duplicates like test ;
alter table test_duplicates modify bid int(11) NOT NULL ;


alter table test_duplicates drop primary key ;

* Inserted a bunch of rows again into the table and executing the pt-archiver.

for i in `mysql -u root -pxxxx test -s -N -e "show tables like 'test%';"`  ; do /root/pt-archiver  --source h=localhost,D=test,t=$i,u=root,p=xxxx --dest D=test,t=test_duplicates --where " refid in (select max(refid) from $i where addtime between '2015-09-23 00:00:00' and '2015-10-30 00:00:00' group by refid HAVING count(*) > 1 ) " --purge --no-check-columns --no-check-charset --limit=1 --commit-each --statistics --progress=1  ; done

TIME                ELAPSED   COUNT
2015-10-05T14:35:35       0       0
2015-10-05T14:35:35       0       1
2015-10-05T14:35:35       0       2
2015-10-05T14:35:35       0       3
2015-10-05T14:35:35       0       4
2015-10-05T14:35:35       0       5
2015-10-05T14:35:35       0       6
2015-10-05T14:35:35       0       7
2015-10-05T14:35:35       0       8
2015-10-05T14:35:35       0       9
2015-10-05T14:35:35       0      10
2015-10-05T14:35:35       0      11
2015-10-05T14:35:35       0      12
2015-10-05T14:35:35       0      13
2015-10-05T14:35:35       0      14
2015-10-05T14:35:35       0      15
2015-10-05T14:35:35       0      16
2015-10-05T14:35:35       0      17
2015-10-05T14:35:35       0      18
2015-10-05T14:35:35       0      19
2015-10-05T14:35:35       0      20
2015-10-05T14:35:35       0      21
2015-10-05T14:35:35       0      22
2015-10-05T14:35:35       0      23
2015-10-05T14:35:35       0      24
2015-10-05T14:35:35       0      25
2015-10-05T14:35:35       0      26
2015-10-05T14:35:35       0      27
2015-10-05T14:35:35       0      28
2015-10-05T14:35:35       0      29
2015-10-05T14:35:35       0      30
2015-10-05T14:35:35       0      31
2015-10-05T14:35:35       0      32
2015-10-05T14:35:35       0      32
Started at 2015-10-05T14:35:35, ended at 2015-10-05T14:35:35
Source: D=test,h=localhost,p=...,t=test,u=root
Dest:   D=test,h=localhost,p=...,t=test5,u=root
SELECT 32
INSERT 32
DELETE 32
Action         Count       Time        Pct
select            33     0.0150      37.79
commit            66     0.0075      18.95
deleting          32     0.0058      14.56
inserting         32     0.0057      14.32
other              0     0.0057      14.37


MariaDB [test]> select * from test_duplicates limit 32 ;
+-----+---------------------+--------+
| bid | addtime             | refid  |
+-----+---------------------+--------+
|   1 | 2015-10-05 12:13:25 | suresh |
|   2 | 2015-10-05 12:13:25 | kumar  |
|   3 | 2015-10-05 12:13:25 | kuna   |
|   4 | 2015-10-05 12:13:28 | suresh |
|   5 | 2015-10-05 12:13:28 | kumar  |
|   6 | 2015-10-05 12:13:28 | kuna   |
|   7 | 2015-10-05 12:13:29 | suresh |
|   8 | 2015-10-05 12:13:29 | kumar  |
|   9 | 2015-10-05 12:13:29 | kuna   |
|  10 | 2015-10-05 12:13:29 | suresh |
|  11 | 2015-10-05 12:13:29 | kumar  |
|  12 | 2015-10-05 12:13:29 | kuna   |
|  13 | 2015-10-05 12:13:30 | suresh |
|  14 | 2015-10-05 12:13:30 | kumar  |
|  15 | 2015-10-05 12:13:30 | kuna   |
|  16 | 2015-10-05 12:13:31 | suresh |
|  17 | 2015-10-05 12:13:31 | kumar  |
|  18 | 2015-10-05 12:13:31 | kuna   |
|  19 | 2015-10-05 12:13:31 | suresh |
|  20 | 2015-10-05 12:13:31 | kumar  |
|  21 | 2015-10-05 12:13:31 | kuna   |
|  22 | 2015-10-05 12:13:32 | suresh |
|  23 | 2015-10-05 12:13:32 | kumar  |
|  24 | 2015-10-05 12:13:32 | kuna   |
|  25 | 2015-10-05 12:13:32 | suresh |
|  26 | 2015-10-05 12:13:32 | kumar  |
|  27 | 2015-10-05 12:13:32 | kuna   |
|  28 | 2015-10-05 12:13:33 | suresh |
|  29 | 2015-10-05 12:13:33 | kumar  |
|  30 | 2015-10-05 12:13:33 | kuna   |
|  31 | 2015-10-05 12:13:33 | suresh |
|  32 | 2015-10-05 12:13:33 | kumar  |
+-----+---------------------+--------+
32 rows in set (0.00 sec)

MariaDB [test]> select * from test;
+-----+---------------------+--------+
| bid | addtime             | refid  |
+-----+---------------------+--------+
| 33 | 2015-10-05 XX:XX:XX | suresh |
| 34 | 2015-10-05 XX:XX:XX | kumar  |
| 35 | 2015-10-05 XX:XX:XX | kuna   |
+-----+---------------------+--------+
5 rows in set (0.00 sec)


Now, we have all the duplicates deleted from all the tables. The process with take time as per your number of tables and duplicates. It does take really very long time....

Due to the time contraint and with some options available in hand, started looking for another option.  Select the duplicates from all the tables and insert into a second table with the table name and then do a delete.

Create a table with additional column tablename, with out primary keys or unique keys.

create table test5 (
  `bid` int(11) NOT NULL,
`addtime` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  `refid` varchar(255) NOT NULL,
  `tablename` varchar(255) NOT NULL,
  KEY `addtime` (`addtime`),
  KEY `refid` (`refid`)

) ENGINE=InnoDB

For a bunch of tables with duplicate rows, let's use s simple for loop to select and insert the duplicates into second table and then do a delete with join.

 for i in `mysql -u root -proot test -s -N -e " show tables like 'test%';"` ; do `mysql -u root -pxxxx test -e "insert into test1.test5(bid,addtime,refid,tablename) select  bid,addtime,refid,'$i' from $i where addtime between '2015-09-23 00:00:00' and '2015-10-11 00:00:00' group by refid HAVING count(*) > 1 ;  select sleep(1) ;  delete $i from $i inner join test1.test5 b on $i.refid=b.refid and $i.addtime=b.addtime and $i.bid=b.bid where 1=1 ; ” ` ; done

This one worked pretty well and we will have the backup of deleted rows. Go ahead and try...