We had a cron job running at 3am that synchronized inventory that started failing a couple of weeks ago. We recognized the issue only this last week. I determined the cause and the solution and wanted to share to help others in the community.
Every time the Magento cron runs, it checks to see if a cron job is already running and exits quietly if so (Oddly, this only behaves this way if shell execution is enabled in PHP. The php code version of this does not follow the same logic). This means that a long running cron job blocks scheduling and execution of other cron jobs. I will write a more complete post on Magento cron job behavior in general later.
Log cleaning of customer records gets exponentially slower as the amount of data increases. At some point, unless MySQL timeout settings are extemely high, this causes the script to fail with a timeout, causing the logs to never get cleaned, meaning they just keep growing.
This is the problem query:
SELECT `log_customer_main`.`log_id` FROM `log_customer` AS `log_customer_main` LEFT JOIN `log_customer` ON log_customer_main.customer_id = log_customer.customer_id AND log_customer_main.log_id < log_customer.log_id WHERE (log_customer.customer_id IS NULL) AND (log_customer_main.log_id < 553985)
It comes from line 147 of Mage_Log_Model_Resource_Log in the _cleanCustomers() method.
The purpose of this query seems to be to get the latest log id for each customer. It is doing this by finding the negative results of join queries of each row to similar rows (by customer_id) that have a higher log_id entry.
A much simpler way to accomplish this is with a group by statement and a MAX expression.
SELECT MAX(log_id) as log_id FROM log_customer GROUP BY customer_id
In my testing, this always produces the same results. The difference is that the time for the latter query is relatively constant regardless of the amount of data. The time for the former grows exponentially as the data grows.
For 90 records, the first query took 18 seconds in my local environment. The second query took 0.4 seconds. In our production environment, the first query won’t even run because the timeout is set to less than four hours. The second query executes in less than 0.4 seconds – with 320,000 records.
The query looks like this in Magento code:
$select = $readAdapter->select() ->from($this->getTable('log/customer'), array('log_id' => new Zend_Db_Expr('MAX(log_id)'))) ->group('customer_id') ->where('log_id < ?', $lastLogId + 1);
If anyone sees a reason why the first query must be used over the second, please let me know. They both seem to accomplish the same goal and produce the same results every time for me.
Here is a quick sample file on how to fix this: