Query Optimization
Picking Up Where You Left Off……
I started this as a response to Keith Murphy's post at http://www.paragon-cs.com/wordpress/?p=54, but it got long, so it deserves its own post. The basic context is figuring out how not to cause duplicate information if a large INSERT statement fails before finishing.
Firstly, the surefire way to make sure there are no duplicates if you have a unique (or primary) key is to use INSERT IGNORE INTO.
Secondly, I just experimented with adding an index to an InnoDB table that had 1 million rows, and here's what I got (please note, this is one experience only, the plural of "anecdote" is *not* "data"; also I did this in this particular order, so there may have been caching taking place):
Way #1:
- ALTER the table to add the new index. This was the slowest method, taking over 13 minutes.
Way #2:
- CREATE a new table with the same schema as the old except for adding the new index
- INSERT INTO newtable SELECT * FROM oldtable;
- ALTER TABLE oldtable RENAME somethingdifferent;
- ALTER TABLE newtable RENAME oldtable;
The ALTER TABLEs happen instantly. This was faster by a few seconds, which is statistically negligible given the 13+ minutes total time.
Way #3:
- mysqldump the table schema only (--no-data) into a file (tableschema.sql).
- mysqldump the table data only (-t) into another file (tabledata.sql).
- optionally pipe into awk to replace "^INSERT INTO" with "INSERT IGNORE INTO"
- edit the table schema file, adding the new index into the table definition
- optionally change the name of the table to something like newtable, making sure to change the DROP TABLE *and* CREATE TABLE statements.
- mysql < tableschema.sql (this will drop the old table unless you changed the name)
- mysql < tabledata.sql ()
- If you changed the table name in the DROP and CREATE statements, run - ALTER TABLE oldtable RENAME somethingdifferent; and ALTER TABLE newtable RENAME oldtable;
- Delete the "somethingdifferent" table
This way took just over 10 minutes, 3 minutes faster than the other 2 ways, for a time savings of 25%.
CAVEAT: MySQL helpfully moves references on a table to the new table name when you ALTER TABLE...RENAME. You will have to adjust your foreign keys, stored procedures, functions and triggers if you use anything other than Way #1.
CAVEAT #2: Make sure that the character set of the MySQL server is supported by the MySQL client and the operating system where you're dumping the file to, otherwise special characters can end up falling victim to mojibake.
Top 10 MySQL Best Practices
So, O'Reilly's ONLamp.com has published the "Top 10 MySQL Best Practices" at http://www.onlamp.com/pub/a/onlamp/2002/07/11/MySQLtips.html. Sadly, I find most "best practice" list do not thoroughly explain the "why" enough so that people can make their own decisions.
For instance, #3 is "Protect the MySQL installation directory from access by other users." I was intrigued at what they would consider the "installation" directory. By reading the tip, they actually mean the data directory. They say nothing of the log directory, nor that innodb data files may be in different places than the standard myisam data directories.
They perpetuate a myth in #4, "Don't store binary data in MySQL." What they really mean is "don't store large data in MySQL", which they go into in the tip. While it's true that there is very little benefit to having binary data in a database, they don't go into what those benefits are. This means that people can't make informed decisions, just "the best practice is this so I'm doing it."
The benefit of putting binary data in MySQL is to be able to associate metadata and other data. For instance, "user 200 owns file 483". If user 200 is gone from the system, how can you make sure file 483 is as well? There's no referential integrity unless it's in the database. While it's true that in most cases people would rather sacrifice the referential integrity for things like faster database backups and easier partitioning of large data objects, I believe in giving people full disclosure so they can make their own informed decision.
#5 is my biggest pet peeve. "Stick to ANSI SQL," with the goal being to be able to migrate to a different platform without having to rewrite the code. Does anyone tell Oracle folks not to use pl/sql like collections? Nobody says "SQL is a declarative language, pl/sql is procedural therefore you should never use it". How about SQL Server folks not to use transact-sql statements like WAITFOR? MATCH... AGAINST is not standard SQL, so I should never use it?
Now, of course, if you're selling a product to be run on different database platforms, then sure, you want to be platform agnostic. But you'd know that from the start. And if you have to migrate platforms you're going to have to do lots of work anyway, because there are third-party additions to all the software any way.
And why would *anyone* choose a specific database, and then *not* use those features? I think that it's a good tip to stick to ANSI SQL if you *know* you want to, or if you have no idea about the DBMS you're using.
If you want to see how this cripples MySQL, check out Visibone's SQL chart at: http://www.visibone.com/sql/chart_1200.jpg -- you can buy it here:
I would amend #6 by explaining how the industry standard is to use sequences, as well as data tables such as calendar tables, and that stored procedures should be built to deal with sequences. The problem with a coded solution is that it is difficult do the proper locking needed for sequences if more than one access is needed at a time. The reason I say stored procedures is that I believe that the database should handle the locking of tables, not the code, just as I believe referential integrity should be enforced at the database level. Everything else is pretty good. I had a list a while back of "SQL Best Practices" here: http://sheeri.com/archives/104. I may post later on about my own personal MySQL Best Practices....
Making Queries 45-90 Times Faster!!
aka....."when good queries go bad!"
So, today the developers were debugging why a script was running much longer than expected. They were doing text database inserts, and got to the point where they realized that double the amount of text meant the queries took double the amount of time.
You see, they were doing similar text inserts over and over, instead of using connection pooling and/or batching them. Apparently the other DBA explained that it was a limitation of MySQL, but either the developers didn't convey what they were doing well, or the DBA didn't think to mention batching.
I ran a simple test on a test server. I used the commandline to connect to a db server on the same machine (even though in qa and production the db machine is on a different machine) just to make a point:
| Type | Connects | Queries | Length of data transmitted | Time |
|---|---|---|---|---|
| One-off | 1000 | 1 | 619 bytes | 12.232s |
| Single Connection | 1 | 1000 | 604 kilobytes | 0.268s |
| Batch | 1 | 1 | 517 kilobytes | 0.135s |
So 1000 INSERTs using 1 connection is over 45 times faster than 1000 INSERTs using 1000 connections.
Using 1 batch INSERT statement is over 1.75 times faster than using 1 connection.
Using 1 batch INSERT statement is over 90 times faster than 1000 INSERTs using 1000 connections.
Note that while it's faster to send a batch, if you don't support sending 517 kilobytes to your database at once, you'll want to break it up. That's a small coding price to pay for 90x the database performance!!!
For reference, the formats used:
One-off:
INSERT INTO foo (col1, col2...) VALUES (val1, val2...);
Single Connection:
INSERT INTO foo (col1, col2...) VALUES (val1, val2...);
INSERT INTO foo (col1, col2...) VALUES (val1a, val2a...);
Batch: INSERT INTO foo (col1, col2...) VALUES (val1, val2...), (val1a, val2a);
OurSQL Episode 22: Things To Avoid With MySQL Queries
Feature:
Things to Avoid in Queries
Subqueries and Correlated subqueries
http://dev.mysql.com/doc/refman/4.1/en/correlated-subqueries.html
Jan Kneschke's post on Groupwise Maximum:
http://jan.kneschke.de/projects/mysql/groupwise-max
Calculated comparisons do not use indexes
INSERT IGNORE
REPLACE
INSERT...ON DUPLICATE KEY
Feedback:
Email podcast@technocation.org
call the comment line at +1 617-674-2369
use Odeo to leave a voice mail through your computer:
http://odeo.com/sendmeamessage/Sheeri
or use the Technocation forums:
http://technocation.org/forum
Direct play this episode at:
http://technocation.org/content/oursql-episode-22%3A-things-avoid-mysql-queries
OurSQL Episode 19: MySQL Proxy
Direct play the episode at:
http://technocation.org/content/oursql-episode-19%3A-mysql-proxy-0
Feedback:
http://odeo.com/sendmeamessage/Sheeri
Call the comment line at +1 617-674-2369.
E-mail podcast@technocation.org
News:
MySQL Focuses on Japan
http://www.mysql.com/news-and-events/news/article_1368.html
MySQL Associate Certification Now Available
http://www.mysql.com/certification/
Learning resource:
Pythian Group's Carnival of the Vanities for the DBA community, published weekly on Fridays.
http://www.pythian.com/blogs/category/log-buffer/
Feature:
MySQL Proxy
Giuseppe Maxia's Blog:
http://datacharmer.blogspot.com
Getting Started with MySQL Proxy article plus tutorials:
http://www.oreillynet.com/pub/a/databases/2007/07/12/getting-started-with-mysql-proxy.html
public Subversion tree:
http://forge.mysql.com/wiki/mysql_proxy
Tutorials:
Intercept and dump queries (part 1): http://forge.mysql.com/snippets/view.php?id=75
Make macros to map "cd" to "use" and "ls" to "show tables" (part 2): http://forge.mysql.com/snippets/view.php?id=76
Injection Queries (part 3): http://forge.mysql.com/snippets/view.php?id=77
Lua interpreted language:
http://www.lua.org/
OurSQL Episode 18: De-myth-tifying Indexes
Direct play this episode at:
http://technocation.org/content/oursql-episode-18%3A-de-myth-tifying-indexes-0
Feedback:
Email podcast@technocation.org
call the comment line at +1 617-674-2369
use Odeo to leave a voice mail through your computer:
http://odeo.com/sendmeamessage/Sheeri
Or use the Technocation forums:
http://technocation.org/forum
Episode 18: De-Myth-tifying Indexes
Where I have been:
Wedding video (short) and photos:
http://www.sheeri.com/wedding
Honeymoon (and wedding) photos:
http://www.sheeri.com/photos
News:
FREE training for Oracle DBAs who want to learn MySQL
http://www.planetmysql.org/robin/?p=51
http://www.mysql.com/oracle/
mysqlnd (PHP native driver) needs testers and benchmarkers:
http://www.planetmysql.org/kaj/?p=117
Learning Resource:
The MySQL category at howtoforge.com:
http://www.howtoforge.com/taxonomy_menu/1/7
Podcast Promo:
http://www.themysqlguy.com
Feature:
Big O notation:
http://www.eecs.harvard.edu/~ellard/Q-97/HTML/root/node8.html
BTREE Index podcast:
http://technocation.org/content/oursql-episode-13%3A-nitty-gritty-indexes-0
HASH Index podcast:
http://technocation.org/content/oursql-episode-17%3A-hashing-it-out-0
Quote:
http://www.helpothers.org/story.php?sid=6784
Music:
The main theme used is Angry Red Dwarf's "I Dream About You"
http://tinyurl.com/26hbg6
Smallfish's "The Thank you song"
http://www.smallfishadventures.com/Home.html
OurSQL Episode 10: How About Some Cache?
This week I talk about the MySQL Query Cache.
Direct play the podcast here:
http://technocation.org/content/oursql-episode-10%3A-how-about-some-cache%3F-0
Subscribe to the podcast by clicking:
http://phobos.apple.com/WebObjects/MZStore.woa/wa/viewPodcast?id=206806301
You can Direct download all the oursql podcasts at:
http://technocation.org/podcasts/oursql/
Show notes:
Listener Feedback:
http://dev.mysql.com/doc/refman/4.1/en/mysql-commands.html
News:
Daylight Savings Time and how to check your system:
http://sheeri.net/archives/188
There's not much more time left to register for the MySQL Users Conference & Expo before the $200 early bird discount disappears!
http://www.mysqlconf.com
Learning Resource:
Check out the 2006 MySQL conference presentation slides by the speakers! http://mysqlconf.com/pub/w/45/presentations.html
Feature: How about some cache?
The MySQL Manual has a short, very readable chapter on the Query Cache, which starts here:
MySQL Queues, part II — groups of queues
I believe this is a huge optimization for a heavily implemented Web 2.0 idea.
This article makes simple work of groups of queues. An example of this would be "the most recent 10 people to view an article," so each article has a queue of up to 10 items in it. This method eliminates the need for multiple SQL statements or using TRIGGERS to check to see if the queue is full.
I bow down to Baron Schwartz, aka Xarpb, for his article on how to implement a queue in SQL:
http://www.xaprb.com/blog/2007/01/11/how-to-implement-a-queue-in-sql/
I am very excited because this also works for groups of objects, and we're about to implement something at work that needs this idea. The idea of "the most recent x things" or "the top x things" is huge, especially in social networking, and probably one of the most often sought after features.
The biggest issue is that in order to display, say, the most recent posts, a query has to find the time of all the posts and only get the most recent 10. This can be made easy by the logic that the 10 most recent posts are the last 10 rows in the table. Any logic is also added, as in "the last 10 rows in the table viewable and for this guest/login."
What if you want to track the last 10 people to view the post? Aha, this gets trickier. Convention would say that when a person views a post, have an SQL transaction that adds the information (person x viewed post y at time z and anyo other info, such as browser type, IP, etc) and if there are more than 10 entries for that post, delete the oldest ones until you have 10 entries. This transaction could be done via the application code or via triggers in MySQL 5.0 and up.
However, both those methods use multiple SQL queries, and in the case that an article has been viewed fewer than 10 times, the queries are unnecessary. And given each article has a different popularity -- some are viewed lots more than others -- running multiple queries ends up being a waste of cycles for articles whose last 10 viewers change infrequently.
These commands were tested on MySQL 4.1.19-standard-log. I use REPLACE INTO because it's shorter than SELECT...ON DUPLICATE KEY UPDATE, and yes, those aren't
Let's say you have a New Year's Resolution to eat 5 servings of fruits and 5 servings of vegetables per day. The only thing that changes from Baron's example is that we add a group field (called 'kind'). The "fruit" field was changed to "edible" and will still contain the name of the edible.
As Baron does, I will use a MySQL-specific command. However, he used SELECT...ON DUPLICATE KEY and I will use REPLACE, as it is smaller in syntax.
use test;
CREATE TABLE q (
id int NOT NULL,
modulo int NOT NULL,
kind char(1) NOT NULL,
food varchar(10) NOT NULL,
PRIMARY KEY(id,kind),
UNIQUE KEY(modulo,kind)
);
The basic statement is below -- I've added AS clauses to make the variables more clear. The modulus is, in this case, 5, but in the article case above would be 10. The "kind" is either "f" or "v", these are your groups of queues. In this case they stand for "fruits" and "vegetables" but they might be numbers referring to articles. The "food" stands for the type of food eaten, but in the article scenario would represent the username or user id of the customer viewing the article.
REPLACE INTO q (id, modulo, kind, food)
SELECT
(COALESCE(MAX(id), -1) + 1) AS id,
(COALESCE(MAX(id), -1) + 1) MOD 5 AS modulo,
'f' AS kind,
'apple' AS food
FROM q WHERE kind='f';
mysql> SELECT * FROM q order by kind,id;
| id | modulo | kind | food |
| 0 | 0 | f | apple |
As expected, 1 "fruit" row.
mysql> REPLACE INTO q(id, modulo, kind, food)
-> SELECT
-> (COALESCE(MAX(id), -1) + 1),
-> (COALESCE(MAX(id), -1) + 1) MOD 5,
-> 'f',
-> 'orange'
-> FROM q WHERE kind='f';
Query OK, 1 row affected (0.00 sec)
Records: 1 Duplicates: 0 Warnings: 0
mysql> SELECT * FROM q order by kind,id;
| id | modulo | kind | food |
| 0 | 0 | f | apple |
| 1 | 1 | f | orange |
As expected, 2 "fruit" rows.
mysql> REPLACE INTO q(id, modulo, kind, food)
-> SELECT
-> (COALESCE(MAX(id), -1) + 1),
-> (COALESCE(MAX(id), -1) + 1) MOD 5,
-> 'v',
-> 'okra'
-> FROM q WHERE kind='v';
Query OK, 1 row affected (0.00 sec)
Records: 1 Duplicates: 0 Warnings: 0
mysql> SELECT * FROM q order by kind,id;
| id | modulo | kind | food |
| 0 | 0 | f | apple |
| 1 | 1 | f | orange |
| 0 | 0 | v | okra |
As expected, 2 "fruit" rows and 1 "vegetable" row. Now, let's quickly populate the fields so the "fruit" group reaches it's maximum of 5.
REPLACE INTO q(id, modulo, kind, food)
SELECT
(COALESCE(MAX(id), -1) + 1),
(COALESCE(MAX(id), -1) + 1) MOD 5,
'v',
'squash'
FROM q WHERE kind='v';
REPLACE INTO q(id, modulo, kind, food)
SELECT
(COALESCE(MAX(id), -1) + 1),
(COALESCE(MAX(id), -1) + 1) MOD 5,
'f',
'peach'
FROM q WHERE kind='f';
REPLACE INTO q(id, modulo, kind, food)
SELECT
(COALESCE(MAX(id), -1) + 1),
(COALESCE(MAX(id), -1) + 1) MOD 5,
'f',
'cherries'
FROM q WHERE kind='f';
REPLACE INTO q(id, modulo, kind, food)
SELECT
(COALESCE(MAX(id), -1) + 1),
(COALESCE(MAX(id), -1) + 1) MOD 5,
'f',
'pear'
FROM q WHERE kind='f';
REPLACE INTO q(id, modulo, kind, food)
SELECT
(COALESCE(MAX(id), -1) + 1),
(COALESCE(MAX(id), -1) + 1) MOD 5,
'v',
'celery'
FROM q WHERE kind='v';
SELECT * FROM q order by kind,id;
| id | modulo | kind | food |
| 0 | 0 | f | apple |
| 1 | 1 | f | orange |
| 2 | 2 | f | peach |
| 3 | 3 | f | cherries |
| 4 | 4 | f | pear |
| 0 | 0 | v | okra |
| 1 | 1 | v | squash |
| 2 | 2 | v | celery |
We have 5 values in the "fruit" group and 3 values in the "veggie" group. Now let's see what happens when another fruit is added:
REPLACE INTO q(id, modulo, kind, food)
SELECT
(COALESCE(MAX(id), -1) + 1),
(COALESCE(MAX(id), -1) + 1) MOD 5,
'f',
'banana'
FROM q WHERE kind='f';
Query OK, 2 rows affected (0.00 sec)
Records: 1 Duplicates: 1 Warnings: 0
Note that a duplicate has been found! This is because the modulo wrapped around. The id of "banana" is 5, and 5 modulo 5 = 0 - the same as 0 modulo 5, which was the modulo value previously taken by "apple". So "apple" is pushed off the end of the queue.
SELECT * FROM q order by kind,id;
| id | modulo | kind | food |
| 1 | 1 | f | orange |
| 2 | 2 | f | peach |
| 3 | 3 | f | cherries |
| 4 | 4 | f | pear |
| 0 | 5 | f | banana |
| 0 | 0 | v | okra |
| 1 | 1 | v | squash |
| 2 | 2 | v | celery |
To find the current list of all fruits, with the most recent fruit first, run:
SELECT * FROM q WHERE kind='f' ORDER BY id DESC;
| id | modulo | kind | food |
| 1 | 1 | f | orange |
| 2 | 2 | f | peach |
| 3 | 3 | f | cherries |
| 4 | 4 | f | pear |
| 0 | 5 | f | banana |
Let's get back to the example of page views, though. We probably care about when the pages were viewed, so let's add a timestamp:
ALTER TABLE q ADD COLUMN fed TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP;
I ran the queries again, with some delays, so the timestamps wouldn't all be the same.
SELECT * FROM q order by id,kind;
| id | modulo | kind | food | fed |
| 1 | 1 | f | orange | 2007-01-15 14:48:25 |
| 2 | 2 | f | peach | 2007-01-15 14:48:28 |
| 3 | 3 | f | cherries | 2007-01-15 14:48:28 |
| 4 | 4 | f | pear | 2007-01-15 14:48:31 |
| 5 | 0 | f | banana | 2007-01-15 14:48:34 |
| 1 | 1 | v | squash | 2007-01-15 14:48:28 |
| 2 | 2 | v | celery | 2007-01-15 14:48:31 |
| 3 | 3 | v | beet | 2007-01-15 14:48:31 |
| 4 | 4 | v | spinach | 2007-01-15 14:48:34 |
| 5 | 0 | v | cucumber | 2007-01-15 14:48:34 |
Or, what the query would be in a real system -- find all fruits eaten and sort by time, most recent first:
SELECT food,fed FROM q WHERE kind='f' ORDER BY fed DESC;
| banana | 2007-01-15 14:48:34 |
| pear | 2007-01-15 14:48:31 |
| peach | 2007-01-15 14:48:28 |
| cherries | 2007-01-15 14:48:28 |
| orange | 2007-01-15 14:48:25 |
edit:
One edge case to be aware of -- reaching the limit of the id field.
If your application does 100 of these per second, an UNSIGNED INT will last 1.36 years -- not so great. You should use an UNSIGNED INT because you're never going to have a negative number, and a signed int would only last just over 8 months if there were 100 REPLACEs per second.
(60 seconds/min, 60 min/hr, 24 hrs/day, 365 days/yr)
4294967295/60/60/24/365/100=1.3619251950
So, for scaling/known high traffic, use a BIGINT. However, in this case, do NOT use UNSIGNED, as all MySQL arithmetic is done with signed bigints or doubles. Not that it matters in this case; at 100 REPLACEs per second, you will wrap at 2.9 billion years:
9223372036854775807/60/60/24/365/100=2,924,712,086.7753601074
Let's say your system does 10,000 of these REPLACEs per second, for eternity (our main database system, where we're about to use this, average 6,000 qps, not all writes, but it's a good figure to use for our own numbers) -- move the decimal places a few spots over and you're down to running out of numbers in 29 million years.
That's an OK limit for us. :)
One TRIGGER Fact, Optimizer Rewriting Stuff
| BEFORE AFTER |
INSERT UPDATE DELETE |
That was brought up at the User Group meeting.
In other news, I have been reading the MySQL Internals Manual. The most interesting chapter yet is Chapter 3, the optimizer. This can be read out of order, and actually contains lots of information that folks can use. For instance:
WHERE col1=col2 AND col2='x'
transforms to
WHERE col1='x' AND col2='x'
because constant matching is faster.
Tautologies and always-false statements are usually optimized out of the query so that something like WHERE 0=0 does not waste time by actually being tested.
A table with 0 or 1 rows is a constant value. Another constant table scenario is WHERE col1=[constant] for all the columns of any UNIQUE or PRIMARY KEYS, assuming they are defined as NOT NULL. This is a good reason to use NOT NULL for column definitions unless you really need NULL.
A constant table is what is used by the optimizer when you see "const" as the "join type" in the result of an EXPLAIN.
WHERE col1 in (1,2,3) is the exact same expression as WHERE col1=1 OR col1=2 OR col1=3
WHERE col1 BETWEEN 1 AND 3 is the exact same expression as WHERE col1>=1 AND col1< =3
The latter 2 examples are good to know, because the first of the two statements in each case are more clear.
Semi-Dynamic Data
http://www.artfulsoftware.com has a "Common Queries" page, which I find rather useful. I hadn't realized its history, as described on the homepage:
Our collection of common MySQL queries outgrew Chapter 9, and is still growing, so we turned it into a PHP page driven from a MySQL table.
One day, I clicked on the page and got the dreaded "blank" PHP page.
This gets into one of the fundamental flaws I find with "semi-dynamic data" (my terminology for it) -- it is not completely dynamic data, because it gets updated by humans, and it is deterministic*, so it does not need a completely dynamic page.
Part of the updating process could be a "generate the web page" script, that runs what the actual page is now, but stores the result as an HTML page. In this way, if 1000 users want the same page, there are *no* database queries done. After all, it only needs to change when content is uploaded, which isn't very often at all.
The "generation" script could easily be a part of a web form that uploads content, or it could be a separate form/script run after a batch of changes is done, so multiple changes do not require generating pages that will just be written over after the next update in a minute or so. As well, it could write to a temporary file, and the very last script action would move the temporary file to the right place. In this way, a generation script that takes a long time to finish would not be partially overwritten by another, simultaneous generation script.
I have used this technique in content management systems -- particularly with templates, as I've found you can separate different content items (such as menus) and "break apart" a template into pieces, and with about an hour you can support a new template into an existing system, and have a user compare templates to see which they'd rather use, given their own content.
I have also used this technique with a listing of organizations around the world. All the listings (3,000) were stored in a database. From this, I ran a weekly (but it would be easy to run it hourly or daily) script that made "browse by" pages, categorizing all of the resources by first letter of their name as well as by their area, province/state and country. The script, which took a full 10 minutes due to poor optimization, made an overall browsing page, 26 "by letter" pages, one page for each country, and a directory with one page for each state/province and area for each country. It also generated the page and compared it to the existing page, and only overwrote the page when they differed (and then put a "last updated on:" message at the end).
Folks searching could still get truly dynamic pages, but I cut down on needless direct database calls to find out which organizations were in "England", and more needless database calls to find out which organizations were in the "Greater London" area, when those pages changes rather rarely.
This could also be useful for large sites, such as photo sharing galleries. Sites that allow comments may or may not see a performance gain -- for each new comment, generating the page again may not be the best solution. However, if there's at least one page hit for each database write, then using this method will have better performance.
* an example of a nondeterministic page is one that changes based on the time, such as "show me all activity that has happened today [up until this moment]" because it may change from second to second.
Selectivity and Index Performance
Let's look into selectivity, as this is an important topics when looking at index performance. (Oooh, I said "performance", watch everyone's ears perk up!).
This will probably answer the questions "Why isn't MySQL using my index?" or "Why is my query so slow when I have an index on it?"
Selectivity describes how different values of a field are. It is a number from 0-1, although you can also think of it as a percentage. A value of 1, or 100%, means that each value in the field is unique. This happens with UNIQUE and PRIMARY keys, although non-unique fields may have a selectivity of 1 -- for example, a timestamp value in a not-often-used table.
To calculate this, you take the total number of DISTINCT records and divide by the total number of records.
My company has a large Users table, so I grabbed some statistics off of that:
+----------+
| count(*) |
+----------+
| 817666 |
+----------+
1 row in set (0.63 sec)
+--------------------------+
| count(distinct username) |
+--------------------------+
| 817666 |
+--------------------------+
1 row in set (1.63 sec)
So the selectivity is 81766/81766, or 1. If this were not a UNIQUE KEY already, it's a good candidate for one.
the "created" field is a timestamp for when the user record was created
+-------------------------+
| count(distinct created) |
+-------------------------+
| 811227 |
+-------------------------+
1 row in set (2.04 sec)
As I expect, there are *some* duplicates, but for the most part, everyone has a different creation time. 811227/817666 = 0.99.
+--------------------+
| count(distinct IP) |
+--------------------+
| 544694 |
+--------------------+
1 row in set (2.35 sec)
This is interesting -- lots of people logon from public places, or use their friends' computers, so there are duplicate IP's (the last IP used to login is associated with a user, so it's a 1-to-1 relationship). The selectivity here is 0.67.
+-------------------------+
| count(distinct browser) |
+-------------------------+
| 25699 |
+-------------------------+
1 row in set (1.70 sec)
This is what the server reports the user's browser is. It records the last browser used by the user. This gives us about a 0.03 for selectivity.
+---------------------+
| count(distinct age) |
+---------------------+
| 83 |
+---------------------+
1 row in set (0.63 sec)
There are only 83 different reported ages on our site. That makes the selectivity of age 0.000101508. That is very low, effectively zero.
So why is this important? I'm glad you asked....
MySQL has a cost-based optimizer. This means that MySQL calculates the costs of different ways of performing a query and then chooses the cheapest one. Sounds reasonable, right? Well, calculating the costs is an inexact science. In order to calculate the exact cost, the optimizer would actually have to run the query. So an estimate is taken, and the estimate is wrong sometimes. Most of the time the estimate is correct.
In contrast, some database systems allow a rule-based optimizer. This means that no matter what the data state, the database uses rules to figure out the "optimal" path to the query. In most enterprise-level database systems, a cost-based optimizer performs better than a rule-based optimizer. In other words, there are so many exceptions to the rules that the calculation overhead is worth it.
(Just to clarify, in both systems, the correct result set will be generated. The optimizer determines the path to the information.)
This cost-based optimizer uses selectivity information when it decides whether or not to use an index.
But what does this mean for me?
Well, in the example above, this means if you want to query folks by age or age group, it's useless to put an index on it. It's a waste of cpu time and disk I/O to have an index for something with such a low selectivity. The optimizer will NEVER use it.
I've heard that the optimizer will do a full table scan if it calculates it will return more than 30% of the table. Why is that number so low, why not more like 50 or 75%? Well, first the server has to go to the index, search the index, and find if the index record matches. Then it needs to follow the index record's pointer to the real record on disk. And the MySQL gurus have decided that around 30% is the place where using the index is slower than just doing a full table scan.
(Note: I'm not exactly sure what the # is but I've heard it's around 30%. As well, at the user conference I saw graphs that showed that for the most part this was true. I thought it was in Jay Pipes' Performance Tuning presentation, but the graphs are not in the slides. Pointers are appreciated.)
So in this case, should we put an index on browser? Well, this is one of those cases where I'd think about how often we'd be doing queries and how much we care about server performance doing a report versus server performance while inserting or updating a record. If we really care one way or another, go that way. And document!
Another thing to consider is the nature of the data. For something like age, that's not going to change. Sure, we might have some 120 year olds eventually, but there's not going to be that much variance. For browsers, there will only be more and more types put out, considering different version numbers and OS configurations of standard browsers as well as mobile phone browsers.
However, if it does not matter, or if it's too difficult to decide which is more important (your boss says "we can't be slow during normal usage OR during reporting!") I default to MySQL -- it's better at optimization than I am. I'd probably put an index so MySQL could decide whether to do a full table scan or use an index, particularly given that I expect the number of browsers to keep increasing.
For something like IP, that has a selectivity of 0.67, so putting an index on it is worth it if we query on IPs a lot.
I hope this article has been helpful!
Jay Pipes has a great article on using the INFORMATION_SCHEMA to find out selectivity:
http://tinyurl.com/kfffp
The Care and Feeding of MySQL Tables
Our site went from weekly crashes during our two busiest nights to not even peeping this week (during the two busiest nights), and the only thing we changed was that we did some table maintenance. We hadn't done table maintenance at least as long as I've been around, which is 6 months. We are a site with high volumes of both reads and writes. This article will talk about the care and feeding of tables; feel free to use this for justification to have a maintenance window, or even permission to run table maintenance statements.
MySQL uses a cost-based optimizer to best translate the written query into what actually happens. This means when you write:
SELECT foo FROM t1 INNER JOIN t2 USING (commonField);
The optimizer looks at the statistics for tables t1 and t2 and decides which is better:
1) To go through each item in t1, looking for a matching "commonField" in t2
or
2) To go through each item in t2, looking for a matching "commonField" t1
If t1 is very large and t2 is very small, it makes sense to follow plan #2. This is a simplified example, of course.
From the documentation:
ANALYZE TABLE analyzes and stores the key distribution for a table. During the analysis, the table is locked with a read lock. This statement works with MyISAM, BDB, and InnoDB tables.
If the key distribution is off, the optimizer will be using incorrect (out-of-date) information. Therefore, the optimizations it makes will not be...well, optimal.
ANALYZE TABLE takes a very short amount of time -- less than a second for even a million rows. I tested with InnoDB and MyISAM, but I'd guess that BDB is the same. Our database of 14G took less than a minute to analyze all 112 tables in 3 datases.
Documentation: http://dev.mysql.com/doc/refman/5.0/en/analyze-table.html
CHECK TABLE check tables and views for incorrectly closed tables, incorrect or deleted links, and verified checksums for the rows. It can also check for full consistency.
This takes a bit -- checking our tables for everything (but not checking for full consistency, as it takes longer) took 11 minutes (14G, 112 tables in 3 databases). Next month I will run a CHECK EXTENDED and see how long this takes.
Documentation: http://dev.mysql.com/doc/refman/5.0/en/check-table.html
And the daddy of them all:
OPTIMIZE TABLE can be used on MyISAM, BDB and InnoDB tables. In MyISAM tables, it repairs deleted or split rows, updates index statistics, and sorts the index pages. For InnoDB and BDB, OPTIMIZE TABLE maps to ALTER TABLE and just rebuilds the index, thereby getting rid of defragmentation, corruption and incorrect statistics.
Documentation:
http://dev.mysql.com/doc/refman/5.0/en/optimize-table.html
This took 36 minutes on our (14G, 112 tables in 3 databases) server.
From the documentation:
In most setups, you need not run OPTIMIZE TABLE at all. Even if you do a lot of updates to variable-length rows, it is not likely that you need to do this more than once a week or month and only on certain tables.
This morning, we backed up our data. Then we ran ANALYZE TABLE on all of our tables first. Sure, OPTIMIZE TABLE performs the same function, but OPTIMIZE TABLE takes a long time, and we wanted the benefit of ANALYZE TABLE right away. Plus, if anything failed, at least the table's index statistics are up-to-date. Then we ran CHECK TABLE, and then OPTIMIZE TABLE on each table. We did this while running live, and as I said, the entire process took less than an hour.
Actually, it took 40 minutes, because once the script I had running to CHECK TABLE was halfway through, I started the OPTIMIZE TABLE script. I specifically set the tables to run through in alphabetical order, so there was no chance of the scripts trying to run on the same table. I will not do that again, I will just run them serially for safety's sake.
One thing MERGE Tables Are Good For
Many people have some kind of reporting or auditing on their database. The problem is that the data grows very large, and lots of times there is data that can be purged. Sure, theoretically one never needs to purge data, but sometimes a "delete" flag just won't work -- when you search on the delete flag, a full table scan may be the most efficient way to go.
Of course, that's not acceptable. And in many cases, say when you have users who no longer use the site but did in the past (and perhaps have billing data associated with them), you never want to get rid of them.
So what to do? Make a special reporting database, that gathers information from the production database(s). Use MyISAM tables, because a reporting server can afford to be behind the master, and MyISAM is better for reporting -- better metadata. For something like a "Users" table, make 2 more tables:
1) DeletedUsers
2) AllUsers
Where Deleted Users is where you put information about the users you delete (something like INSERT INTO DeletedUsers SELECT * FROM Users WHERE [parameters of deletion] and then run DELETE FROM Users WHERE [parameters of deletion] on the master. On the reporting slave, make a MERGE table called "AllUsers" and run your reports from that, when you might need to gather historical table.
Thoughts?
Working Smarter, Not Harder (SET and ENUM)
So, in a previous post I talked about smart code. Today I put myself square into a discussion about ENUM and SET. ENUM is an enumerated list of values; similar to a pull-down menu, where the only values allowed in that field are the ones defined, with the option of also having a null column.
The ENUM field takes the idea of normalizing the data and eliminates the need for a join on that data. It also makes data integrity easy -- if the value you're trying to enter is not in the ENUM column definition, MySQL throws an error.
ENUM is not a standard SQL data type. It is MySQL specific.
As an example, in the real world I run a database for an international not-for-profit. Whenever a donation comes in, it is associated with some form of solicitation -- either "October 2005 mailing" or "website donation" or "2005 Los Angeles House Party", etc.
To normalize this data in a pure relational database, we'd have to have a separate table, with 2 fields -- an ID, and the name of the solitication. We would add to this table whenever we have an event, or do a mailing. In the "donation" table, we would reference the solicitation via the solicitation ID, and set up referential integrity so that an ID that does not exist could not be entered into the table.
Instead, we take advantage of MySQL's ENUM column type. I have a script that will alter the table when we need to add a new solicitation, so even non-technical folk can do it (they go to a web page, add the name of the field, and click "submit"). Pull-down menus are also not difficult with a similar script to get all the possible values in the ENUM column.
The reason ENUM is a great field is that it stores all the row values in 2 bytes; there is a limit of 65,535 elements in the ENUM column. This uses a lot less space than the hack of just having a char or varchar field, which is what folks usually do when they do not want to add a join to a query by having another table.
Only one value at a time can be stored in an ENUM field; if you want to store more than one value in a column (which makes your database not relational, by the way), you can use the SET datatype.
One example of where I use the SET datatype in the not-for-profit database is for a person's role in the organization. For instance, a person can be a donor and a volunteer; or a major donor and have grantwriting skills and be a staff member. Or they can be none of these.
Because of the combinations that can be made, SET is limited to 64 values. Each value is stored in a maximum of 8 bytes -- not impressive for one value, but rather small when it could be holding up to 64 values. This is more than enough for an application that needs "roles within a not-for-profit organization," but not necessarily enough for an application that needs "albums in a photo database."
Most folks will either normalize their data and have a separate table, or make their own SET field by using a text field and matching. The latter has the benefit of utilizing the relevance of a FULLTEXT search, however, the tables are likely larger because they need to store all the text. And there is no data integrity; values can be spelled wrong.
It would be great if there were a LARGE SET value; although it may well be that the computation of joining is faster than querying a table with a large field (if LARGE SET had a limit of 128 items, instead of SET's 64, it would be stored in 16 bytes, which is pretty large if most rows only have 1 or 2 values).
I'd be interested to know where folks are using SET and ENUM, and where they are deliberately not using them.
Smart code
So, the other day I was asked by a developer to come up with a table to tally votes. Basically, there are 6 choices, and customers can vote once per day. I asked if there were any other constraints, and there were none. I specifically asked if they wanted once per calendar date, or 'it has to be 24 hours since the last vote'; they wanted 'once per calendar date'. And updating the current day's vote is not allowed. Once the vote is in, it cannot be changed.
So I came up with a simple table:
CREATE TABLE `ManOfMonth` (
`uid` int(10) unsigned NOT NULL default '0',
`voteDate` date NOT NULL default '0000-00-00',
`uidVoteFor` int(10) unsigned NOT NULL default '0',
PRIMARY KEY (`uid`,`voteDate`),
KEY `countVotes` (`uidVoteFor`)
)
There's no need for a timestamp, and you can select now() into the voteDate part, and MySQL truncates the data for you (we're using 4.1.12).
The smart part of this is that the primary key is the constraint on the database -- that is, a userid cannot vote more than once per day. So I told the developer, "When you're writing the code, don't bother running a select to see if they've already voted; just do the insert query, and check if there ends up being an error for a duplicate key."
Using the MySQL error messages in this way prevents unnecessary selects. This kind of coding should be common practice; I have a feeling it is not, particularly since I had to explain to the developer why the select was unnecessary. Then again, the developer also had wanted a table with uidVoteFor and the tally as 2 columns. This would be a smaller table, sure, and getting the result would be easy, but then there's still the accounting for whether or not the person had voted that day. The index on uidVoteFor helps make the tally query faster.

The She-BA