fromJune 2014
Feature:

MySQL Query Optimization

Leaves, Trees, and Bad Nature Metaphors
0

A large part of MySQL optimization lies in improving poorly performing SQL queries. While tuning is important, it often has nowhere near the impact of actually fixing a poorly performing query. Fixing queries is also a lot more fun. Obviously query optimization is a large subject, and can’t possibly be covered in full in a single article. I highly recommend that you get a book on this subject; for any Drupal developer, it is well worth learning.

As a web developer using a CMS, you are only slightly removed from the SQL layer. Not completely knowing how to use this layer and how to optimize it is very limiting. To get you started, we will cover some very basic optimization, index usage, and join optimization techniques.

Index Basics

Even though indexes are very important for database performance, they are not completely understood by many developers, which often leads to easily-avoidable problems. The main issue is the mystical belief that the MySQL optimizer should be able to quickly run a query if an index so much as touches the columns in question. Sadly, indexes are not magical.

It is best to think of an index as a tree, largely because they are trees in most DB systems. (B+Trees, specifically; for more information, see http://wdog.it/4/1/btree.)

Thus, if you have an example index test that covers (columnA, columnB), you literally have a tree of columnA values, with columnB values in the leaves. If you have a query that has a WHERE condition on these two columns, MySQL will go through this tree looking for the correct columnA value first, and then go into the leaves of that object, and find the correct columnB value.

Due to this tree organization, there are some limitations. For example, if you have a query with a WHERE condition on columnB, and not columnA, it can’t use the index on (columnA, columnB). It makes sense. How would you get to the columnB values in this index? You must have a way to traverse the columnA branches to get to the columnB values. This leads to some interesting decisions. For example, if you have five queries that run against a table and three of them have WHERE conditions on columnB alone, that column should come first in the index so that it can be used.

Base Tables and Join Order

Now that you have a basic understanding of how indexes work, you need an equally basic understanding of joins. In keeping with our nature metaphor, we are going to visualize indexes as trees and tables as pools of water: specifically, you can think of a table as a large pool of water at the top of a hill, with smaller pools of water under it as you progress down the hill. Each pool has a little waterfall that flows into the pool below. The largest pool at the top of the hill is the base table for the join. The goal is to limit the size of the waterfalls and ensure that the resulting pool doesn’t need water treatment (temporary tables used to service a GROUP BY, ORDER BY, or DISTINCT).

Let’s consider an example. Our biggest pool is the Drupal node table. We then join against a taxonomy table (pool) and filter on a taxonomy type. We have no filter on the node table, and our only real filter – the one that defines our dataset – is on the taxonomy table. How do we limit the water flowing from the node pool into the taxonomy pool? We don’t. This is going to be a big waterfall.

The node table being the base table, and all the filtering/conditions being on sub-tables, is a very common problem in Drupal. The issue here seems conceptually simple: to fulfill this query, we need to join every node against every taxonomy term. But in reality, it would be far better to start with the taxonomy table and only join the filtered rows that match our condition. You always want to have your major conditions (the ones that limit your returned data the most) in the base table. Drupal – in particular Views – sometimes makes this very difficult.

Let’s consider another example. Again, we have node as our base pool, and we are going to join against taxonomy. This time our only filter is on node; that’s good, but then we are going to GROUP BY a column in taxonomy. Unfortunately, we now have a problem. We can’t execute this GROUP BY via an index when it’s not in the base pool, and thus the resulting pool, after the join, will need some water treatment – a temporary table sort. Again, this is a very common problem with Drupal Views queries. It all goes back to ensuring that you actually have the right table as the base for your query.

The next question is: How do you determine the base table for a join? For Drupal, the answer usually comes down to making sure your relationships in Drupal Views are set correctly. It is quite easy to have a Views query with all LEFT JOINs. A query like that forces MySQL to treat the node table as the base table, which is almost always a problem. Other than that issue, you often have to just run EXPLAIN on the query in question, and figure out why the base table is what it is, then edit your View or Views handlers to swap the tables or set the conditions differently.

Common Issues

Let’s cover some examples of common problems that arise with Drupal Views, indexing, and joins.

First, we’ll discuss the debugging options available to you with Views. In the Views settings, there are a few very important options for tracking Views and Views performance:

  • "Show the SQL query" (live preview section) This allows you to see the SQL query Views is building for execution. Obviously, this is quite important.
  • "Show performance statistics" (live preview section) This allows you to easily track build and execution time.
  • "Show other queries run during render during live preview" (live preview section) It is quite possible that other queries are being called during the processing of a View, besides the main view query. This is a common point of confusion and this debug setting will help immensely.
  • "Add Views signature to all SQL queries" (advanced tab) This is arguably the most important option on that page, as it allows you to actually tell where a query is coming from. Back in “the good old days,” you could just search for a query in an application to find it. These days, with the advent of Views and other dynamic query builders, it’s basically impossible. Having the Views signature built into the query is hugely valuable.

Now that you have Views set up to assist you, or at least to not insist on making your job impossible, it’s time to look at some common issues.

Range Queries

Range queries are a special case. A range query is one that is going to return a range of values in an index. A good example is a datetime query, such as SELECT * FROM table WHERE date > "SOMEDATE";.
Sometimes you may want to put a particular column at the beginning of an index (henceforth called the prefix of the index), but if that column’s condition is a range, you have to reconsider. MySQL cannot use any column of an index after a range query. So if your index is (columnA, columnB) and your query is SELECT * FROM table WHERE columnA > 5 AND columnB = 2;, you will not be using the index to satisfy the second condition (columnB = 2). In that situation, you must have columnB in the prefix of the index for it to be used.

The ORDER BY on an Unrelated Table

This truly is a classic for Drupal Views. Because so much of the data in Drupal is heavily normalized (split into different tables), it’s quite common to have your dataset defined by tableA (i.e., all your WHERE conditions are on this table) and then ORDER BY your data on a column in a totally different table. This is a problem, as the ORDER BY will have to be serviced by a temp table or filesort. The tracker query that ships with Views is a great example of this:

SELECT node.type AS node_type, node.title AS node_title, node.nid AS nid,
 users_node.name AS users_node_name, users_node.uid AS users_node_uid,
 node_comment_statistics.comment_count AS node_comment_statistics_comment_count,
 node_comment_statistics.last_comment_timestamp AS
 node_comment_statistics_last_comment_timestamp, history.timestamp AS
 history_timestamp, node.created AS node_created, node.changed AS node_changed,
 'tracker:default' AS view_name
 FROM
 {node} node
 LEFT JOIN {users} users_node ON node.uid = users_node.uid
 INNER JOIN {node_comment_statistics} node_comment_statistics ON
 node.nid = node_comment_statistics.nid
 LEFT JOIN {history} history ON node.nid = history.nid AND history.uid = '1'
 WHERE (( (node.status = '1') ))
 ORDER BY node_comment_statistics_last_comment_timestamp DESC
 LIMIT 25 OFFSET 0

This query has a single filter on node.status, and then ORDER BYs on node_comment_statistics. However, this query is better than most because at least node_comment_statistics is a required relationship in this View, which means it’s an INNER JOIN and not a LEFT JOIN, which means that if the optimizer decides that node.status as a filter is not very useful (i.e., all your nodes have the same status), it can “flip the join” and start with node_comment_statistics, making it possible to use an index for the ORDER BY. (You can only use an index for this when the table containing the column or columns in question is the base table.)

The Useless DISTINCT (“In Case of Accidents!”)

Views allows you to just “throw in” a DISTINCT in the Query Settings of the View. The ease with which you can add a DISTINCT to a query to ensure it doesn’t start returning duplicates is a big problem, much more so than you might think. Many times performance can be significantly improved by just removing these safety-net DISTINCTs. However, it takes a lot of work to differentiate between a useless DISTINCT and one that is really needed. It’s better to think about this from the beginning.

Tip: It is a good idea to always EXPLAIN a query after you add a DISTINCT. It’s good to know if you should be thinking about structuring a query a different way, or if the DISTINCT is fine when you are still in the development stage. Restructuring a query is far more difficult to do after the testing stage.

Starfish Syndrome (All LEFT JOINS)

When you create a query, Drupal Views has you define relationships between the data. When a relationship is not required, it’s translated into a LEFT JOIN in most cases. However, when it is required, it’s just an INNER JOIN. The problem with this is the amount of importance placed on a somewhat poorly worded checkbox. Quite often, this checkbox is left unchecked even when a relationship is required.

For example, if you have a user table, and another table that holds e-mail addresses for those users, there will always be at least one row per user, and this is a required relationship. However, it’s tempting to not check that box, because it doesn’t really impact functionality and it seems more complicated. The impact of this decision will be to greatly limit MySQL’s optimizer, as it will only ever be able to use the user table as the base table for the query.

We call this starfish syndrome, because if you diagram queries like this, you have the base table in the center and all the joined tables around it, like the legs of a starfish. The optimizer has no choice in JOINs. This is almost always bad for Views queries in particular, because often the default base table is node. Not only is this a huge table in most installations, but it often doesn’t define the returned data very well, since it doesn’t contain columns that define custom data.

Going back to our waterfall metaphor, you could think of this as a big pool in the middle with all the smaller pools flowing into it. The amount of water never gets smaller; the central pool just gets bigger and bigger.

Node Access

Node access is a problem. Having fine-grained permissions on a per-node basis is a nice idea. However, from a query performance perspective, it is terrible. It adds a set of filters that are totally separate from anything defining the dataset for your query, and that use tables that have nothing to do with your dataset. While it is possible, if you labor to make this work quickly and have good queries, you can, in general, either have node access or good queries. It’s an OR, not an AND, question.

There is work underway on refactoring node_access to work in a different way, with better performance on new versions of MySQL. Hopefully, that work will resolve this long-standing issue.

This article text is excerpted from High Performance Drupal published by O’Reilly Media, Inc., 2013, ISBN: 978-1-4493-9261-1. You can purchase the book from your local retailer, or directly from O’Reilly Media’s online store: http://wdog.it/3/2/hpd

Image: ©iStockphoto.com/chimpyk