Magento Customer and Visitor Logging

Related to the issue in my previous post about log cleaning, I would like to take some time to discuss customer and visitor logging in a little more detail.

The log_url_info table can bloat your database quickly. It logs every single unique visited URL. I have commonly seen this grow to 5GB or more, even when storing logs for only a short time period. As a general rule, database size has a direct impact on performance.

So what is this data really used for and is there a safe way to reduce it without affecting site functionality?

It seems Magento’s primary purpose for this data is analytics. In my opinion, there are much more robust platforms for this like Google Analytics or Omniture which should be used. If anyone sees analytics value in Magento’s data over other platforms, please share your experience. Also note that if you are using a proxy cache like Squid or Varnish, a frontend analytics platform like Google Analytics will track data more accurately since requests are not always sent to the server; Varnish will serve a page directly, the Magento app will not run, and the customer visit will not be logged.

These logs also drive the “Online Customers” page in the admin. When you load the “Online Customers” page, the log_visitor table is read and the log_visitor_online table is populated with the users who have visited within the threshold you have set; by default, 15 minutes.

/**
 * Mage_Log_Model_Resource_Visitor_Online 
 * 
 * This is the method that prepares log_visitor_online for viewing 
 * on the "Online Customers" page
 */
public function prepare(Mage_Log_Model_Visitor_Online $object)
{
    // Check if log_visitor_online was recently updated. If so, exit
    if (($object->getUpdateFrequency() + 
        $object->getPrepareAt()) > time()) {
        return $this;
    }
    
    //...
    // Delete existing data from log_visitor_online
    $writeAdapter->delete($this->getMainTable());
    
    //...
    // Build online visitors based on online interval    
    $lastDate = Mage::getModel('core/date')->gmtTimestamp() 
        - $object->getOnlineInterval() * 60;

    $select = $readAdapter->select()
        ->from(
            $this->getTable('log/visitor'),
            array('visitor_id', 'first_visit_at', 
                  'last_visit_at', 'last_url_id'))
        ->where('last_visit_at >= ?', 
            $readAdapter->formatDate($lastDate));

    // Add additional visitor data    
    // ...
    
    // Save the prepared date
    $object->setPrepareAt();
    
    return $this;
}

The time that this was calculated gets stored in cache and prevents the data from being calculated again until a specified amount of time has passed. This time threshold defaults to 60 seconds and is dictated by the “log/visitor/online_update_frequency” config setting. Clearing the cache resets this and causes log_visitor_online to rebuild. Oddly, Magento does not make this variable visible in the admin via a system.xml file. You can edit this in a config file, the database directly, or add your own system.xml definition.

/**
 * Mage_Log_Model_Visitor_Online
 * 
 * The time log_visitor_online was built is stored in cache.
 */
public function getPrepareAt()
{
    return Mage::app()->loadCache('log_visitor_online_prepare_at');
}

public function setPrepareAt($time = null)
{
    if (is_null($time)) {
        $time = time();
    }
    Mage::app()->saveCache($time, 'log_visitor_online_prepare_at');
    return $this;
}

I have not found any other direct uses of this data within Magento. My review of this has determined that it is safe to disable this entirely, if you are ok with losing the “Online Customers” functionality. However, I don’t recommend this unless you have achieved similar functionality in another analytics platform; seeing online customers is pretty useful. Should you want to do so, here is a good post on how to disable customer and visitor logging. This might even result in a performance improvement since Magento is no longer writing to the log tables on every request.

Recommendation

My recommendation is to reduce the number of days logs are saved to one. This keeps the “Online Customers” list functional, while keeping the table sizes relatively small. You can do this at System > Configuration > Advanced > System > Log Cleaing. If you have a high traffic site, you should also apply the patch mentioned in my previous post that speeds up log cleaning; otherwise, it can become slow, fail to run, and begin to grow larger every day.

Log Cleaning Config Screenshot

If you are reading this, you probably already have huge log tables. You can feel relatively safe about nuking this data, unless you have specific modifications around this. You were only planning on keeping it 180 days anyways. Always do your own research though and always test before you do anything in Production.

How to clear this data:

TRUNCATE log_customer;
TRUNCATE log_quote;
TRUNCATE log_summary;
TRUNCATE log_url;
TRUNCATE log_url_info;
TRUNCATE log_visitor;
TRUNCATE log_visitor_info;
TRUNCATE log_visitor_online;

Magento Log Cleaning Blocking Other Cron Jobs

We had a cron job running at 3am that synchronized inventory that started failing a couple of weeks ago. We recognized the issue only this last week. I determined the cause and the solution and wanted to share to help others in the community.

Every time the Magento cron runs, it checks to see if a cron job is already running and exits quietly if so (Oddly, this only behaves this way if shell execution is enabled in PHP. The php code version of this does not follow the same logic). This means that a long running cron job blocks scheduling and execution of other cron jobs. I will write a more complete post on Magento cron job behavior in general later.

Log cleaning of customer records gets exponentially slower as the amount of data increases. At some point, unless MySQL timeout settings are extemely high, this causes the script to fail with a timeout, causing the logs to never get cleaned, meaning they just keep growing.

This is the problem query:

SELECT `log_customer_main`.`log_id` FROM `log_customer` AS `log_customer_main`
LEFT JOIN `log_customer` ON log_customer_main.customer_id = log_customer.customer_id AND log_customer_main.log_id < log_customer.log_id
 WHERE (log_customer.customer_id IS NULL)
 AND (log_customer_main.log_id < 553985)

It comes from line 147 of Mage_Log_Model_Resource_Log in the _cleanCustomers() method.

The purpose of this query seems to be to get the latest log id for each customer. It is doing this by finding the negative results of join queries of each row to similar rows (by customer_id) that have a higher log_id entry.

A much simpler way to accomplish this is with a group by statement and a MAX expression.

SELECT MAX(log_id) as log_id
FROM log_customer
GROUP BY customer_id

In my testing, this always produces the same results. The difference is that the time for the latter query is relatively constant regardless of the amount of data. The time for the former grows exponentially as the data grows.

For 90 records, the first query took 18 seconds in my local environment. The second query took 0.4 seconds. In our production environment, the first query won’t even run because the timeout is set to less than four hours. The second query executes in less than 0.4 seconds – with 320,000 records.

The query looks like this in Magento code:

$select = $readAdapter->select()
->from($this->getTable('log/customer'), array('log_id' => new Zend_Db_Expr('MAX(log_id)')))
->group('customer_id')
->where('log_id < ?', $lastLogId + 1);

If anyone sees a reason why the first query must be used over the second, please let me know. They both seem to accomplish the same goal and produce the same results every time for me.

Here is a quick sample file on how to fix this:
My_Module_Model_Resource_Log