This Place is Taken: The UTF-8-Everywhere Manifesto

Sunday, May 6, 2012

The UTF-8-Everywhere Manifesto

The UTF-8-Everywhere Manifesto:


Text made easy once again

Purpose of this document

This document contains special characters. Without proper rendering support, you may see question marks, boxes, or other symbols.
To promote usage and support of the UTF-8 encoding, to convince that this should be the default choice of encoding for storing text strings in memory or on disk, for communication and all other uses. We believe that all other encodings of Unicode (or text, in general) belong to rare-edge cases of optimization and should be avoided by mainstream users.
In particular, we believe the very popular UTF-16 encoding (mistakenly used as synonym to ‘widechar’ and ‘Unicode’ in Windows world) has no place in library APIs (except specialized libraries dealing with text). If, at this point, you already think we are crazy, please skip straight the FAQ section.
This document recommends choosing UTF-8 as string storage in Windows applications, where this standard is less popular due to historical reasons and lack of the native UTF-8 support by the API. Yet, we believe that even on this platform the following arguments outweigh the lack of native support. Also, we recommend forgetting forever what are ‘ANSI codepages’ and what were they used for. It is in the customer’s bill of rights to mix any number of languages in any text string.
We recommend avoiding C++ application code that depends on _UNICODE define. This includes TCHAR/LPTSTR types on Windows and APIs defined as macros such as CreateWindow. We also recommend alternative ways to reach the goals of these APIs.
We also believe that if an application is not supposed to specialize in text, the infrastructure must make it possible for the program to be unaware of encoding issues. A file copy utility should not be written differently to support non-English file names. Joel’s great article on Unicode explains the encodings well for the beginners, but it lacks the most important part: how should a programmer proceed, if she does not care what is inside the string.

Background

In 1988, Joseph D. Becker published the first Unicode draft proposal. At the basis of his design was the assumption that 16 bits per character would suffice. In 1991, the first version of the Unicode standard was published, with code points limited to 16 bits. In the following years many systems added support for Unicode and switched to the UCS-2 encoding. It was especially attractive for new technologies, like Qt framework (1992), Windows NT 3.1 (1993) and Java (1995).
However, it was soon discovered that 16 bits per character will not do. In 1996, the UTF-16 encoding was created so existing systems would be able to work with non-16-bit characters. This effectively nullified the rationale behind choosing 16-bit encoding in the first place, namely being a fixed-width encoding. Currently Unicode spans over 109449 characters, with about 74500 of them being CJK Ideographs.
Microsoft has ever since mistakenly used ‘Unicode’ and ‘widechar’ as synonyms for ‘UCS-2’ and ‘UTF-16’. Furthermore, since UTF-8 can’t be set as the encoding for narrow string WinAPI, you must compile your code with _UNICODE rather than _MBCS. It educates Windows programmers that Unicode must be done with ‘widechars’. As a result of the mess, Windows C++ programmers are now among the most confused ones about what is the right thing to do about text.
In the Linux and the Web worlds, however, there’s a silent agreement that UTF-8 is the most correct encoding for Unicode on the planet Earth. Even though it gives a strong preferences to English and therefore to computer languages (such as C++, HTML, XML, etc) over any other text, it is seldom less efficient than UTF-16 for commonly used character sets.

The Facts

  • In both UTF-8 and UTF-16 encodings, characters may take up to 4 bytes (contrary to what Joel says).
  • UTF-8 is endianness independent. UTF-16 comes in two flavors: UTF-16LE and UTF-16BE (for different byte orders, respectively). Here we name them collectively as UTF-16.
  • Widechar is 2 bytes in size on some platforms, 4 on others.
  • UTF-8 and UTF-32 result the same order when sorted lexicographically. UTF-16 does not.
  • UTF-8 favors efficiency for English letters and other ASCII characters (one byte per character) while UTF-16 favors several Asian character sets (2 bytes instead of 3 in UTF-8). This is what made UTF-8 the favorite choice in the Web world, where English HTML/XML tags are intermixed with any-language text. Cyrillic, Hebrew and several other popular Unicode blocks are 2 bytes both in UTF-16 and UTF-8.
  • In the Linux world, narrow strings are considered UTF-8 by default almost everywhere. This way, for example, a file copy utility would not need to care about encodings. Once tested on ASCII strings for file name arguments, it would certainly work correctly for arguments in any language, as arguments are treated as cookies. The code of the file copy utility would not need to change a bit to support foreign languages. fopen() would accept Unicode seamlessly, and so would argv.
  • On Microsoft Windows, however, making a file copy utility that can accept file names in a mix of several different Unicode blocks requires advanced trickery. First, the application must be compiled as Unicode-aware. In this case, it cannot have main() function with standard-C parameters. It will then accept UTF-16 encoded argv. To convert a Windows program written with narrow text in mind to support Unicode, one has to refactor deep and keep in mind all string variables.
  • On Windows, SetCodePage() API enables receiving non-ASCII characters, but only from one ANSI codepage. An unimplemented parameter CF_UTF8 would enable doing the above, on Windows.

  • The standard library shipped with MSVC is poorly implemented. It forwards narrow-string parameters directly to the OS ANSI API. There’s no way to override this. Changing std::locale doesn’t work. It’s impossible to open a file with a Unicode name on MSVC using standard features of C++. The standard way to open a file is:
    std::fstream fout("abc.txt");
    
    The proper way to get around is by using Microsoft’s own hack that accepts wide-string parameter, which is a non-standard extension of STL.
  • There is no way to return Unicode from std::exception::what() but to use UTF-8.
  • UTF-16 is often misused as a fixed-width encoding, even by Windows native programs themselves: in plain Windows edit control (until Vista), it takes two backspaces to delete a character which takes 4 bytes in UTF-16. On Windows 7 the console displays that character as two invalid characters, regardless of the font used.
  • Many 3rd party libraries for Windows do not support Unicode: they accept narrow string parameters and pass them to the ANSI API. Sometimes, even for file names. In the general case, it’s impossible to work around this, as a string may not be representable completely in any ANSI code page (if contains characters from a mix of Unicode blocks). What is normally done on Windows for file names is getting an 8.3 path to the file (if it already exists) and feeding it into such library. It is not possible if the library is supposed to create a non-existing file. It is not possible if the path is so long that the 8.3 form is longer than MAX_PATH. It is not possible if the short-name generation is disabled.
  • UTF-16 is very popular today, even outside the Windows world. Qt, Java, C#, Python, the ICU—all use UTF-16 for internal string representation.

Our Conclusions

UTF-16 is the worst of both worlds—variable length and too wide. It exists for historical reasons, adds a lot of confusion and will hopefully die out.
Portability, cross-platform interoperability and simplicity are more important than interoperability with existing platform APIs. So, the best approach is to use UTF-8 narrow strings everywhere and convert them back and forth on Windows before calling APIs that accept strings. Performance is seldom an issue of any relevance when dealing with string-accepting system APIs. There is a huge advantage to using the same encoding everywhere, and we see no sufficient reason to do otherwise.
Speaking of performance, machines often use strings to communicate (e.g. HTTP headers, XML). Many see this as a mistake, but regardless of that it is nearly always done in English, giving UTF-8 advantage there. Using different encodings for different kinds of strings significantly increases complexity and consequent bugs.
In particular, we believe that adding wchar_t to C++ was a mistake, and so are the Unicode additions to C++11. What must be demanded from the implementations though, is that the narrow strings would be capable of storing any Unicode data. Then every std::string or char* parameter would be Unicode-compatible. ‘If this accepts text, it should be Unicode compatible’—and with UTF-8 it is also easy to do.
The standard facets have many design flaws. This includes std::numpunct, std::moneypunct and std::ctype not supporting variable-length encoded characters (non-ASCII UTF-8 and non-BMP UTF-16). They must be fixed:
  • decimal_point() and thousands_sep() should return a string rather than a single code unit. (By the way C locales do support this, albeit not customizable.)
  • toupper() and tolower() shall not be phrased in terms of code units, as it doesn’t work in Unicode. For example, ß shall be converted to SS and ffl to FFL.

How to do text on Windows

The following is how we recommend to everyone else for compile-time checked Unicode correctness, ease of use and better multi-platformness of the code. This substantially differs from what is usually recommended as the proper way of using Unicode on Windows. Yet, an in-depth research of these recommendations resulted in the same conclusion. So here goes:
  • Do not use wchar_t or std::wstring in any place other than adjacent point to APIs accepting UTF-16.
  • Don’t use _T("") or L"" literals in any place other than parameters to APIs accepting UTF-16.
  • Don’t use types, functions or their derivatives that are sensitive to the _UNICODE constant, such as LPTSTR or CreateWindow().
  • Yet, _UNICODE is always defined, to avoid passing narrow strings to WinAPI getting silently compiled.
  • std::string and char* anywhere in program are considered UTF-8 (if not said otherwise).

  • Only use Win32 functions that accept widechars (LPWSTR). Never those which accept LPTSTR or LPSTR. Pass parameters this way:
    ::SetWindowTextW(convert(someStdString or "string litteral").c_str())
    
    (The policy uses conversion functions described below.)

  • With MFC strings:
    CString someoneElse; // something that arrived from MFC.
    
    // Converted as soon as possible, before passing any further away from the API call:
    std::string s = str(boost::format("Hello %s\n") % convert(someoneElse));
    AfxMessageBox(convert(s).c_str(), L"Error", MB_OK);
    

Working with files, filenames and fstreams on Windows

  • Never produce text output files with non-UTF-8 content
  • Avoid using fopen() for RAII/OOD reasons. If necessary, use _wfopen() and WinAPI conventions above.
  • Never pass std::string or const char* filename arguments to fstream family. MSVC CRT does not support UTF-8 arguments, but has a non-standard extension which should be used as follows:

  • Convert std::string arguments to std::wstring with convert:
    std::ifstream ifs(convert("hello"), std::ios_base::binary);
    
    We’ll have to manually remove the convert, when MSVC’s attitude to fstream changes.
  • This code is not multi-platform and may have to be changed manually in the future.
  • Alternatively use a set of wrappers that hide the conversions.

Conversion functions

The policy uses the conversion functions from the CppCMS booster::nowide library, which can be downloaded as a separate package:
std::string convert(const wchar_t *s);
std::wstring convert(const char *s);
std::string convert(const std::wstring &s);
std::wstring convert(const std::string &s);
The library also provides a set of wrappers for commonly used standard C and C++ library functions that deal with files.
These functions and wrappers are easy to implement using Windows’ MultiByteToWideChar and WideCharToMultiByte functions. Any other (possibly faster) conversion routines can be used.

FAQ


  1. Q: Are you a linuxer? Is this a concealed religious fight against Windows?

    A: No, I grew up on windows, and I am a Windows fan. I believe they did a wrong choice in the text domain, because they did it earlier than others.

  2. Q: Are you an anglo-saxon? Do you secretly think English alphabet and culture are superior to any other?

    A: No, and my country is non-ASCII speaking. I do not think using encoding that uses single byte for ASCII chars is anglo-centrism, or has anything to do with human interaction. Even though one can argue that source codes of programs, web pages and XML files, OS file names and other computer-to-computer text interfaces should never have existed, as long as they do exist—text is not only for human readers.

  3. Q: Why do you guys care? I use C# and don’t have to know anything about encodings.

    A: You must still care when writing your text to files or communication channels. And, the fact you don't usually care about internally stored encodings is an achievement, which C++ programmers deserve to enjoy where possible.

  4. Q: Why not just let any programmer use her favorite encoding internally, as long as she knows how to use it?

    A: We have nothing against correct using of any encoding. However, it becomes a problem when the same type, such as std::string, means different things in different contexts. If it is ‘ANSI codepage’ for some. For others, it means ‘this code is broken and does not support non-English text’. In our programs, it means Unicode-aware UTF-8 string. This diversity is a source of many bugs and much misery: this additional complexity is something the world does not really need, and the result is many unicode-broken software industry-wide.

  5. Q: Why do you turn on the _UNICODE define, if you do not intend to use Windows’ LPTSTR/TCHAR/etc macros?

    A: This is a precaution against plugging a UTF-8 char* string into an ANSI-expecting API. We want it to generate a compiler error. It is the same kind of a hard-to-find bug as passing an argv[] string to fopen() on windows: it assumes user will never pass non-current-codepage filenames. You will be unlikely to find this bug in your QA, and it is a broken program behavior. Thanks to _UNICODE define, you get an error for that.

  6. Q: Isn’t it quite naïve to think that Microsoft will stop using widechars one day?

    A: Let’s first see when they start supporting CP_UTF8. Then, I see no reason why anybody would continue using widechar APIs. Also, adding support for CP_UTF8 would ‘unbreak’ some of existing unicode-broken programs and libraries.

  7. Q: Why would the Asians give up on UTF-16 encoding, which saves them 50% the memory per character?

    A: It does so only in artificially constructed examples containing only characters in the U+0800 to U+FFFF range. However, computer-to-computer text interfaces dominate any other. This includes XML, HTTP, filesystem paths and configuration files—all use almost only ASCII characters, and in fact UTF-8 is used just as often in those countries.
    For a dedicated storage of Chinese books, UTF-16 may still be used as a fair optimization. As soon as the text is retrieved from such storage, it should be converted to the standard compatible with the rest of the world. Anyway, if storage is at premium, a lossless compression will be used. In such case UTF-8 and UTF-16 will take roughly the same space. Furthermore, ‘in the said languages, a glyph conveys more information than a [L]atin character so it is justified for it to take more space.’ (Tronic, UTF-16 harmful).
    Here are the results of a simple experiment. The space used by HTML source of some web page (Japan article, retrieved from Japanese Wikipedia on 2012–01–01) is shown in the first column. The second column shows the results for text with markup removed, that is ‘select all, copy, paste into plain text file’.
    HTML Source (Δ UTF-8) Dense text (Δ UTF-8)
    UTF-8 767 KB (0%) 222 KB (0%)
    UTF-16 1 186 KB (+55%) 176 KB (−21%)
    UTF-8 zipped 179 KB (−77%) 83 KB (−63%)
    UTF-16LE zipped 192 KB (−75%) 76 KB (−66%)
    UTF-16BE zipped 194 KB (−75%) 77 KB (−65%)
    As can be seen, UTF-16 takes about 50% more space than UTF-8 on real data, saves just 20% for dense Asian text, and hardly competes with general purpose compression algorithms.

  8. Q: What do you think about BOMs?

    A: Another reason not to use UTF-16. UTF-8 has a BOM too, even though byte order is not an issue in this encoding. This is to manifest that this is a UTF-8 stream. If UTF-8 remains the only popular encoding (as it already is in the internet world), the BOM becomes redundant. In practice, many UTF-8 text files omit BOMs today.

  9. Q: What do you think about line endings?

    A: All files shall be read and written in binary mode since this guarantees interoperability—a program will always give the same output on any system. Since the C and C++ standards use \n as in-memory line endings, this will cause all files to be written in the POSIX convention. It may cause trouble when the file is opened in Notepad on Windows, however any decent text viewer understands such line endings. An example of such text viewer that comes bundled with all Windows installations is IE.

  10. Q: But what about performance of text processing algorithms, byte alignment, etc?

    A: Is it really better for UTF-16? Maybe so. ICU uses UTF-16 for historical reasons, thus it’s quite hard to measure. However, most of the times strings are treated as cookies, not sorted or reversed every second use. Smaller encoding is then favorable.

  11. Q: Isn’t UTF-8 merely an attempt to be compatible with ASCII? Why keep this old fossil?

    A: Maybe it was. Today, it is a better and more popular encoding of Unicode than any other.

  12. Q: Is it really a fault of UTF-16 that people misuse it, assuming it is 16 bits per character?

    A: Not really. But yes, safety is an important feature of every design.

  13. Q: If std::string means UTF-8, wouldn’t that confuse with code that stores plain text in std::strings?

    A: There’s no such thing as plain text. There is no reason to storing codepage-ANSI or ASCII-only text in a class named ‘string’.

  14. Q: Won’t the conversions between UTF-8 and UTF-16 when passing strings to Windows slow down my application?

    A: First, you will do some conversion either way. It’s either when calling the system, or when interacting with the rest of the world. Even if your interaction with the system is more frequent in your application, here is a little experiment.
    A typical use of the OS is to open files. This function executes in (184 ± 3)μs on my machine:
    void f(const wchar_t* name)
    {
        HANDLE f = CreateFile(name, GENERIC_WRITE, FILE_SHARE_READ, 0, CREATE_ALWAYS, 0, 0);
        DWORD written;
        WriteFile(f, "Hello world!\n", 13, &written, 0);
        CloseHandle(f);
    }
    
    While this runs in (186 ± 0.7)μs:
    void f(const char* name)
    {
        HANDLE f = CreateFile(convert(name).c_str(), GENERIC_WRITE, FILE_SHARE_READ, 0, CREATE_ALWAYS, 0, 0);
        DWORD written;
        WriteFile(f, "Hello world!\n", 13, &written, 0);
        CloseHandle(f);
    }
    
    (Run with name="D:\\a\\test\\subdir\\subsubdir\\this is the sub dir\\a.txt" in both cases. Averaged over 5 runs. Used an optimized convert that relies on std::string contiguous storage guarantee given by C++11.)
    This is just (1 ± 2)% overhead. Moreover, MultiByteToWideChar is almost surely suboptimal. Better UTF-8↔UTF-16 conversion functions exist.

  15. Q: How do I write UTF-8 string literal in my code?

    A: If you internationalize your software then all non-ASCII strings will be loaded from an external translation database, so it's not a problem.
    If you still want to embed a special character you can do it as follows. In C++11 you can do it as:
    u8"∃y ∀x ¬(x ≺ y)"
    With compilers that don't support ‘u8’ you can hard-code the UTF-8 code units as follows:
    "\xE2\x88\x83y \xE2\x88\x80x \xC2\xAC(x \xE2\x89\xBA y)"
    However the most straightforward way is to just write the string as-is and save the source file encoded in UTF-8:
    "∃y ∀x ¬(x ≺ y)"
    Unfortunately, MSVC converts it to some ANSI codepage, corrupting the string. To work around this, save the file in UTF-8 without BOM. MSVC will assume it’s in the correct codepage and won’t touch your string. However, it renders impossible to use Unicode identifiers and wide string literals (that you will not be using anyway).

  16. Q: I already use this approach and I want to make our vision come true. What can I do?

    A: Review your code and see what library is most painful to use in portable Unicode-aware code. Open a bug report to the authors.

Myths


  1. Indexing the nth character is O(n) in UTF-8. Not so for UTF-16/UTF-32.

    So is in UTF-16 which is a variable length encoding. It’s ϴ(n) operation even in UTF-32. Characters aren’t code points. For example both
    naïve (U+006E, U+0061, U+00EF, U+0076, U+0065)
    and
    naı̈ve (U+006E, U+0061, U+0131, U+0308, U+0076, U+0065)
    are 5 character words, while the former consists of five code points and the later of six.
    Code points are meaningful only in the context of Unicode algorithms, which by their nature always scan the string sequentially. Looking up the nth character is, on the other hand, has little meaning in Unicode (see below).

  2. Finding the length of a string is O(n) in UTF-8. Not so for UTF-16/UTF-32.

    The word ‘length’ is ambiguous, the comparison as stated is meaningless.
    • Length in code units: is always the size of the memory occupied by the string divided by the size of one code unit. This is the most frequently used operation when you work with strings.
    • Length in code points: it’s ϴ(n) in both UTF-8 and UTF-16, and it’s indeed constant-time in UTF-32. It’s needed only when you convert to UTF-32, probably inside some Unicode algorithm.
    • Length in characters: it’s ϴ(n) in any encoding since one character may be encoded with multiple code points and some code points may not correspond to characters.
    What you usually do care about is…
    • The size of the string as it appears on the screen: you’ll need to communicate with the rendering engine for this, which has to scan the whole string no matter what encoding is used.

  3. UTF-16 is good for processing.

    So said the ‘Unicode Technical Note #12, UTF-16 for Processing’. Unfortunately, the document fails to justify the points it makes. See detailed analysis of this paper (link, to be written).

About the authors

This manifesto is written by Pavel Radzivilovsky, Yakov Galka and Slava Novgorodov, as a result of much experience and research of real-world Unicode issues and mistakes done by real-world programmers. The goal is to improve awareness of text issues and to inspire industry-wide changes to make Unicode-aware programming easier, ultimately improving the experience of users of those programs written by human engineers. Neither of us is involved in the Unicode consortium.

Much of the text is inspired by discussions on StackOverflow initiated by Artyom Beilis, the author of Boost.Locale. You can leave comments/feedback there.

External links

No comments:

Post a Comment