MySQL (InnoDB) clustered and non-clustered indexes
We all remember the classical explanation about the indexes in the database and how they make the task easier to find the right lines. I'm sure most of you visualizes something like this:
It becomes clear right away that it takes much less effort to find two or three right lines throughout the data. It is brilliant, easy, and clear.
Personally, I always thought that there is no room for any improvement regarding this method until I got familiar with the clustered indexes. It turned out that the non-clustered indexes are not that perfect as I though.
So here are some questions: What is exactly a clustered index? Why is it better than the non-clustered one? What’s going on with it in MySQL?
In order to avoid some confusion, so far we will take a look at the non-clustered index step by step. A simplified non-clustered index can be represented as a separate table, and each line of the table refers to one or more lines in this table with data. The lines in the index table are ordered and grouped by the values of key fields. Let’s imagine a simple query:
SELECT * FROM `t1` WHERE `fld1` = 12;
There will be read and checked every line without any indexation, and the lines that will not satisfy a set condition just never get in the result, but they all will be read.
When using the non-clustered index, the finding task is strongly accelerated. First, the index table is much smaller than the tables with data, and therefore it can be read faster. Second, the DBMS most often try to cache the indexes in RAM, which is faster than the hard drive. Third, the indexes do not duplicated lines, so as soon as we found the first value, the search can be stopped. Fourth, the data is sorted in the index. Third and fourth together allow using a binary search algorithm (it is also known as a bisection method), which efficiency surpasses a simple search.
If resources allow, the data table should be cached in RAM. In addition, it is accepted for obvious reasons to pay more attention to the indices and space for them in RAM.
Indexing is a great power, but if all arrows of the index table will be connected to the lines in a data table simultaneously, we get quite complicated web:
This web with many intersecting arrows brings us to the problem, which creates the non-clustered index.
MySQL optimizer might not use indexes to search by the small tables at all. Why is that? It is because the search reads data sequentially and a pointer refers to the separate sections of the data in the index. Eventually, these jumps all over references from index may not be worth of that.
So what we have at this stage of the indexing evolution. Let’s imagine a large fragmented table in terms of indexing. For example, the data came chaotic and unsorted, so it was stored like that. Now imagine an index table and our good old query:
SELECT * FROM `t1` WHERE `fld1` = 12;
What is happening? There is the value in the index. The lines are read from the data table on which this index refers to. Naturally, if there is a big fragmentation of table, then the burden of reading from its different parts becomes visible.
Here it will be useful to us ...
The clustered indexes are different from the non-clustered in the same way as the table of contents differs from an alphabetical index. This index (non-clustered index) gives the exact number of pages (lines in the database). Table of contents specifies a range of pages corresponding to a particular chapter, where is already a search word. By the way each chapter can contain its own table of contents, if it is large enough.
The clustered index is a tree data structure, where index values are stored with the data relevant to them. The indexes and data are in order. When a new line is added in the table, it is written in the right branch of the tree structure, corresponding to sorting.
There is one of the most powerful and productive engines for MySQL - InnoDB. There are many reasons for that, and one of them is the clustered indexes. The easiest way to understand the structure of clustered indexes is to present them in the dynamics, such as how they grow as the data is add, and how the table starts to branch out.
The first stage: a flat list
The data in InnoDB is stored by pages in the size of 16 Kb. The size of one page is the size limit of node of the tree structure, which determines at what point to start branching. If the entire table is fit into one page, then it is stored as a flat list, sorted by the key field without a separate index table.
All our data will be presented in the form of same small tables in the future, and the chain index pages will connect them to the tree.
The second stage: the tree
When data is no longer fits into one page, the list becomes a tree. A page with the data is divided into two, though in the node where the data was before now is the index that covers both new pages. A specific node of the tree must include the indices of all the child nodes or the resulting data, if the node is last one. Nodes can refer to each other only in one direction: from a parent to the child.
As the data will be added more and more, the tree will become more complicated and deeper. The more it will grow and branch out, the better will be system of data storage.
The gray pages are identical to the page of the first stage; it is just a sorted data, leaves (ending nodes) of our tree. Blue pages are intermediate nodes of the tree, and they contain only the index and do not contain data. The arrows are marked the search ways of specific key values.
Let’s recall our query (green arrow):
SELECT * FROM `t1` WHERE `fld1` = 12;
Referring to the table, the query gets on the first page and receives the index, and immediately sends it to the ending page with the data where are the lines that satisfy the search criteria. The page is already read during the search, all data is collected, and the database can return a response.
However, the index that points to another page does not necessarily lead directly to a page with the data. The index may point to a page with an intermediate index. It is possible if there are large sizes of tables, and the database will have to make more iterations of search, but each such iteration involves a minimum size of data, so the search still goes faster.
This simple rule is valid, and it is relevant for any type of index: the more various data, the more efficient is to use the index to find the specific values.
Since the data is part of the index that sorted and deliberately fragmented, it is obvious that only one cluster key can be used for a single table. Also, there is one more important consequence, namely they are the write operations, and especially changing the key fields in existing data. It is extremely resource-intensive process. Try to use rarely changed fields for the clustered indexes.
As for the complex (composite) clustered keys, they use exactly the same system, only the data sorting is done by two fields. The very same index is not very different from the non-clustered composite key.
Clustered keys in InnoDB
It's simple. Every InnoDB table has a clustered key. No exceptions.
It is much more interesting what fields are selected for this purpose.
• If in the table is set PRIMARY KEY – this is it.
• Otherwise, if the table has UNIQUE indexes - this is the first of them.
• Otherwise, InnoDB itself creates a hidden field with the surrogate ID in size of 6 bytes.
In my opinion, it is better if you will not let your server get to the third item, so you should add the ID on your own.
Do not forget that InnoDB holds in the secondary keys a complete field value set of clustered key as a reference to the ending line in the table. The bigger is the primary key, the greater are the secondary keys.
|Vote for this post
Bring it to the Main Page