Monday, June 20, 2011

Investigating Sort aborted issues

A few days ago, I and few of my fellow members had a discussion about the "Sort aborted" error in the error log should be monitored or it can be ignored because it doesn't bring any value by alerting the oncall DBA who checks the machine and mysqld server status when the error appears.

We have identified the possible or known causes of this error were

1) It could be generated when
a) users cancelling or killing queries in the middle of a filesort
b) web servers rebooting while some queries were sorting i.e when the connection to the database drops suddenly
2) Another is lack of sufficient temp disk space in the tmpdir prevented tmpfile from being created for queries to execute.

As a DBA, we can't do much with the first point apart from informing customer to check at the application side for connection drop outs.
The answer for the second point is to change tmpdir in the mysql to point to a directory (a new one, just for this purpose) in the biggest file system.

3) a transaction got rolled back or aborted due to lock wait timeout or deadlock.

In a database, a deadlock is a situation in which two or more transactions are waiting for one another to give up locks.
Even if a transaction is not involved in a deadlock, it might have to wait a considerable amount of time to obtain a lock because of a long-running transaction or transactions holding locks on the tables it needs. In such a situation, you might not want a transaction to wait indefinitely. Instead, you might want the waiting transaction to abort, or time out, after a reasonable amount of time, called a lock wait timeout.

In such cases, we have to increase the innodb_lock_wait_timeout value in the database server if we want the transaction to wait for more time and complete. We can find the deadlock details in "LATEST DETECTED DEADLOCK" section of the show engine innodb status\G output and therefore needs an investigation of the whole sql statements executed by both transactions with the application dev team to identify the root cause and it's resolution.

4) Unexpected errors, such as source table or even tmp table was corrupt processing of a subquery failed which was also sorting.

In this case, we have to repair the corrupted source table using repair table table_name or mysqlcheck.

5) insufficient memory for sort_buffer_size allocated or Out of sort memory; increase server sort buffer size.

Check the sort buffer size allocated on the server and increase a little at a time in amount of memory because it is a per connection variable and can cause out of memory on the server itself.

[perl]
mysql> show global variables like '%sort%';
+---------------------------+--------------+
| Variable_name | Value |
+---------------------------+--------------+
| max_length_for_sort_data | 1024 |
| max_sort_length | 1024 |
| myisam_max_sort_file_size | 107374182400 |
| myisam_sort_buffer_size | 36700160 |
| sort_buffer_size | 262144 |
+---------------------------+--------------+
5 rows in set (0.14 sec)
[/perl]

Apart from the above causes, Few more things needs to be checked when we see this type of messages in the error logs are :

1) How big is your biggest single table (data + index) ?
2) Do you have any JOINs that involve a huge amount of data being scanned or returned ?
3) What file-system do you have on your machine ?
4) Can you identify the queries that were running at that time ? How long does it run ?

Let's go through the above things in detail.

To find out the biggest table size in the database, execute the below sql query

[perl]
SELECT SQL_CALC_FOUND_ROWS TABLE_SCHEMA, TABLE_NAME, TABLE_TYPE, ENGINE, TABLE_ROWS,
TRUNCATE(DATA_LENGTH/pow(1024,2),2) as 'DATA (M)',
TRUNCATE(INDEX_LENGTH/pow(1024,2),2) as 'INDEX (M)',
TRUNCATE(((DATA_LENGTH+INDEX_LENGTH))/pow(1024,2),2) AS 'TOTAL(M)'
FROM information_schema.tables
WHERE TABLE_SCHEMA not in ( 'information_schema', 'mysql')
AND TABLE_TYPE = 'BASE TABLE'
ORDER BY 8 DESC
LIMIT 1;
[/perl]

The easiest way to find the query with JOINs that involve a huge amount of data being scanned or returned is to enable slow query log on the database server and run the below command on your shell prompt by replacing the actual slow query log, which gives the maximum rows scanned list in the slow query log.

[perl]
$less slowquery.log | grep -i "# Query_time:" | awk '{print $9}' | sort -rn | more
123949125
123947793
123858573
123855366
76087924
75933311
75776390
75691301
75671196
75556710
[/perl]

After finding out the top few large scan number details, then run the below by replacing the highest number in the grep condition to see the query details.

[perl]
$less slowquery.log | grep 123949125 -A 2 B4 | more
[/perl]

Once we got the list of queries which are going for large scan, we can tune it by using Explain plan, creating appropriate indexes for the columns involved in the joins and the sort order etc.

Coming to different filesystems and it's limitations, it can be find here
The filesystems and it's type on a linux machine can be found by running the below commands in the shell prompt.

[perl]
[suresh@home]$ df -T
Filesystem Type 1K-blocks Used Available Use% Mounted on
/dev/mapper/sysvg-root
ext3 470000608 24019012 421721808 6% /
/dev/mapper/mysqlvg-mysql
ext3 945112464 289180564 607148540 33% /mysql
/dev/sda1 ext3 505604 17557 461943 4% /boot
tmpfs tmpfs 24724440 0 24724440 0% /dev/shm

[suresh@home]$ mount
/dev/mapper/sysvg-root on / type ext3 (rw)
proc on /proc type proc (rw)
sysfs on /sys type sysfs (rw)
devpts on /dev/pts type devpts (rw,gid=5,mode=620)
/dev/mapper/mysqlvg-mysql on /mysql type ext3 (rw)
/dev/sda1 on /boot type ext3 (rw)
tmpfs on /dev/shm type tmpfs (rw)
none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)
[/perl]

If we have reached the file system size limits for a give table then we have go for better filesystems.

To find the queries that are running at that time and how long is to log the processlist into a log file periodically using the below command.

[perl]
mysqladmin --user=xxxx --password=xxxxx --sleep=300 processlist
[/perl]

The above are the possible ways of finding the possible causes and it's resolution methods for the error "Sort aborted".

Previously, earlier to the version 5.5.10, for queries that were aborted due to sort problem, the server wrote the message Sort aborted to the error log.
In the latest versions, developers realize that they need to add more details to the error log when ever the error occurs, mysqld server writes more information to provide a more specific message which is useful, such as:

[perl]
[ERROR] mysqld: Sort aborted: Out of memory (Needed 24 bytes)

[ERROR] mysqld: Sort aborted: Out of sort memory,
consider increasing server sort buffer size

[ERROR] mysqld: Sort aborted: Incorrect number of arguments for
FUNCTION test.f1; expected 0, got 1
In addition, if the server was started with --log-warnings=2,
the server write information about the host, user, and query.
[/perl]

In addition, if the server was started with --log-warnings=2, the server write information about the host, user, and query.