So, in a previous post I talked about smart code. Today I put myself square into a discussion about ENUM and SET. ENUM is an enumerated list of values; similar to a pull-down menu, where the only values allowed in that field are the ones defined, with the option of also having a null column.
The ENUM field takes the idea of normalizing the data and eliminates the need for a join on that data. It also makes data integrity easy -- if the value you're trying to enter is not in the ENUM column definition, MySQL throws an error.
ENUM is not a standard SQL data type. It is MySQL specific.
As an example, in the real world I run a database for an international not-for-profit. Whenever a donation comes in, it is associated with some form of solicitation -- either "October 2005 mailing" or "website donation" or "2005 Los Angeles House Party", etc.
To normalize this data in a pure relational database, we'd have to have a separate table, with 2 fields -- an ID, and the name of the solitication. We would add to this table whenever we have an event, or do a mailing. In the "donation" table, we would reference the solicitation via the solicitation ID, and set up referential integrity so that an ID that does not exist could not be entered into the table.
Instead, we take advantage of MySQL's ENUM column type. I have a script that will alter the table when we need to add a new solicitation, so even non-technical folk can do it (they go to a web page, add the name of the field, and click "submit"). Pull-down menus are also not difficult with a similar script to get all the possible values in the ENUM column.
The reason ENUM is a great field is that it stores all the row values in 2 bytes; there is a limit of 65,535 elements in the ENUM column. This uses a lot less space than the hack of just having a char or varchar field, which is what folks usually do when they do not want to add a join to a query by having another table.
Only one value at a time can be stored in an ENUM field; if you want to store more than one value in a column (which makes your database not relational, by the way), you can use the SET datatype.
One example of where I use the SET datatype in the not-for-profit database is for a person's role in the organization. For instance, a person can be a donor and a volunteer; or a major donor and have grantwriting skills and be a staff member. Or they can be none of these.
Because of the combinations that can be made, SET is limited to 64 values. Each value is stored in a maximum of 8 bytes -- not impressive for one value, but rather small when it could be holding up to 64 values. This is more than enough for an application that needs "roles within a not-for-profit organization," but not necessarily enough for an application that needs "albums in a photo database."
Most folks will either normalize their data and have a separate table, or make their own SET field by using a text field and matching. The latter has the benefit of utilizing the relevance of a FULLTEXT search, however, the tables are likely larger because they need to store all the text. And there is no data integrity; values can be spelled wrong.
It would be great if there were a LARGE SET value; although it may well be that the computation of joining is faster than querying a table with a large field (if LARGE SET had a limit of 128 items, instead of SET's 64, it would be stored in 16 bytes, which is pretty large if most rows only have 1 or 2 values).
I'd be interested to know where folks are using SET and ENUM, and where they are deliberately not using them.
You mentioned that you 'have
You mentioned that you 'have a script' to add new ENUM values. Is this a pure SQL statement, or are you extracting the current ENUM values and then spitting them back in to ALTER TABLE? I've been struggling to work out a pure SQL way to do this - it seems that one can only wipe out and replace all ENUM values, as opposed to just adding one to the exisiting list. Anyone?
Your remark about using ENUM
Your remark about using ENUM as integrity checking is incorrect (at least by default). Infact, here's an excerpt from the MySQL documentation.
The value may also be the empty string ("") or NULL under certain circumstances:
* If you insert an invalid value into an ENUM (that is, a string not present in the list of allowed values), the empty string is inserted instead as a special error value. This string can be distinguished from a 'normal' empty string by the fact that this string has the numerical value 0. More about this later.
* If an ENUM is declared NULL, NULL is also a legal value for the column, and the default value is NULL. If an ENUM is declared NOT NULL, the default value is the first element of the list of allowed values.
Here's an example to prove this:
mysql> show create table foo;
+-------+-------------------------------------------------------------------------------------------+
| Table | Create Table |
+-------+-------------------------------------------------------------------------------------------+
| foo | CREATE TABLE `foo` (
`some_enum` enum('one','two') NOT NULL default 'one'
) TYPE=MyISAM |
+-------+-------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)
mysql> select * from foo;
Empty set (0.00 sec)
mysql> insert into foo set some_enum='three';
Query OK, 1 row affected (0.00 sec)
mysql> select * from foo;
+-----------+
| some_enum |
+-----------+
| |
+-----------+
1 row in set (0.00 sec)
This works as intended and as reflected by the MySQL documentation.
It could be worse. We
It could be worse. We inherited some legacy (from back in 1999 when MySQL was first used in our system) databases that use ENUM. They were used correctly in that the values are fairly static, but the same ENUMS repeat in several tables. Rather than having normalized data that require a data change in one place, we have to locate all of the places that contain the ENUM and issue multiple ALTER statements to make a value change.
We've moved much of the structure off of the ENUMs, but not everything.
If it takes only a few
If it takes only a few seconds to do an ALTER TABLE on your table, the potential performance savings realized by avoiding a join in this case are certainly not enough to justify this abuse of ENUM. Use some indexes and create a new table. Then you'll be working Smarter, not Harder.
ALTER TABLE is a lot "harder" --on sensibility, maintenance, portability, and resources-- than an INSERT and a million joins.
What Kolbe says... you don't
What Kolbe says... you don't want to stick data into structure except for very specific (and limited) things. That's ENUM.
Regarding SET, I reckon it's always evil. There are functions like FIND_IN_SET() you can use in a WHERE clause, however it won't be able to use an index. It's very inefficient if your dataset grows. Handling SET columns is not really nice.
Essentially, joins are not as expensive as people often perceive, that 'denormalise for speed' thing is old dogma really. RDBMS were made to do joins! Normalisation, up to 3NF, is generally a good thing for your apps.
Great to know! I knew that
Great to know! I knew that that particular use was less than perfect, but I think it's usually obvious that static lists (like states of the US) can be done with Enum. I was going more for data that's *mostly* static, because with the right authentication, it can save a lot of querying.
You are absolutely right that the alter table makes the table not available; however, the alter table takes about 10 seconds, and though the database is production, it's used in spikes, so there's never been a complaint about that database being slow. For heavier usage sites, I'd agree.
I agree with most of what
I agree with most of what you say about ENUM and SET, but for the specific example you give us I strongly agree with Kolbe.
I use the ENUM type quite frequently, but only for static sets of possible values. For the SET type on the other hand almost all possible applications I found would have been better off with a normalized approach. You can still find two examples where I found a SET very useful in my blog post at http://www.futhark.ch/mysql/109.html
This is *not* a very good
This is *not* a very good use for ENUM. Having to alter table every time you add a new value is *not* a good idea. This requires a user with elevated privileges to be able to access the database on behalf of an unprivileged user, which is *not* a good idea.
Your example is a typical example of when ENUM should *not* be used. Doing ALTER table each time you need to add or change a solicitation-type makes the table unavailable for other use.
ENUM is suited for applications where the possible values rarely or never change. For example... gender, truth, smoking preference, etc. Building your application in such a way that tables need to regularly be altered is not fantastically extensible, as that can take inordinately long on large tables.
This seems an ideal case of when *not* to use ENUM.