MySQL UTF-8 charsets and collations explained
Which MySQL UTF-8 character set and collation should you choose for your database or table? When you run SHOW COLLATION in MySQL or MariaDB, you will see a large amount of available character sets and collations such as:
utf8_general_ci
utf8_general_mysql500_ci
utf8_unicode_ci
utf8_unicode_520_ci
utf8_bin
utf8mb4_general_ci
utf8mb4_unicode_ci
utf8mb4_unicode_520_ci
That is confusing. This article will attempt to provide the most important answers so that you will be able to make an informed and rational decision on which MySQL charset and collation you should use for your particular use cases.
What are the differences between the utf8 and utf8mb4 character sets?
utf8 uses a maximum of three bytes per character while utf8mb4 uses four bytes per character. While the utf8 charset is able to store Chinese, Japanese, and Korean characters (which are in the Basic Multilingual Plane), it may still not be able to store all the characters that you want. For example, with a utf8 charset, it is not possible to insert the Unicode character 'SNOWMAN' (U+2603) ☃, but this is possible with utf8mb4 charsets. More importantly, new emojis (such as 'ROBOT FACE' 🤖) are outside the Basic Multilingual Plane, so to store them, you will need to use a utf8mb4 charset.
Choosing utf8 vs. utf8mb4 will depend on what characters you want to store, but if in doubt, use a utf8mb4 charset because it is able to store all the characters in the utf8 charset and more.
What are the differences between the utf8_general_ci, utf8_unicode_ci, and utf8_bin collations?
The difference between them is how they compare characters. The binary (bin) collation blindly compares the binary value of characters (so it is case sensitive). The unicode collations follows the Unicode standard for comparisons. The general collation is faster but less correct than the unicode collation. If you need more correct comparisons (for example, between 'a' and 'á', or 'C' and 'Ç'), use the unicode collation, otherwise, you may be satisfied with the general collation.
What are the differences between the unicode and unicode_520 collations?
The unicode collation follows an older standard for character comparison, while the unicode_520
collation follows a newer (but arguably still outdated) Unicode standard for character comparison.
What is the utf8_general_mysql500_ci collation?
The utf8_general_mysql500_ci
collation was introduced for backward compatibility of old tables that used the old utf8_general_ci
from before MySQL 5.1.24. There is no good reason to use this in new tables.
Conclusion
After knowing all this, it may still be difficult to choose a charset and a collation. My suggestion is that you should always use utf8mb4 charsets over utf8 charsets, and when in doubt, use utf8mb4_unicode_520_ci
as it offers the greatest number of characters that you can store, and it sorts characters in the most correct way possible.
If you feel that some of the points were unclear, please share what you think can be improved. If you feel that you need a more in depth explanation of the differences between the charsets and the collations, do consult the MySQL or MariaDB documentation.