Text made easy once again Purpose of this documentThis document contains special characters. Without proper rendering support, you may see question marks, boxes, or other symbols.To promote usage and support of the UTF-8 encoding, to convince that this should be the default choice of encoding for storing text strings in memory or on disk, for communication and all other uses. We believe that all other encodings of Unicode (or text, in general) belong to rare-edge cases of optimization and should be avoided by mainstream users. In particular, we believe the very popular UTF-16 encoding (mistakenly used as synonym to ‘widechar’ and ‘Unicode’ in Windows world) has no place in library APIs (except specialized libraries dealing with text). If, at this point, you already think we are crazy, please skip straight the FAQ section. This document recommends choosing UTF-8 as string storage in Windows applications, where this standard is less popular due to historical reasons and lack of the native UTF-8 support by the API. Yet, we believe that even on this platform the following arguments outweigh the lack of native support. Also, we recommend forgetting forever what are ‘ANSI codepages’ and what were they used for. It is in the customer’s bill of rights to mix any number of languages in any text string. We recommend avoiding C++ application code that depends on _UNICODE define. This includes TCHAR /LPTSTR types on Windows and APIs defined as macros such as CreateWindow . We also recommend alternative ways to reach the goals of these APIs.We also believe that if an application is not supposed to specialize in text, the infrastructure must make it possible for the program to be unaware of encoding issues. A file copy utility should not be written differently to support non-English file names. Joel’s great article on Unicode explains the encodings well for the beginners, but it lacks the most important part: how should a programmer proceed, if she does not care what is inside the string. BackgroundIn 1988, Joseph D. Becker published the first Unicode draft proposal. At the basis of his design was the assumption that 16 bits per character would suffice. In 1991, the first version of the Unicode standard was published, with code points limited to 16 bits. In the following years many systems added support for Unicode and switched to the UCS-2 encoding. It was especially attractive for new technologies, like Qt framework (1992), Windows NT 3.1 (1993) and Java (1995).However, it was soon discovered that 16 bits per character will not do. In 1996, the UTF-16 encoding was created so existing systems would be able to work with non-16-bit characters. This effectively nullified the rationale behind choosing 16-bit encoding in the first place, namely being a fixed-width encoding. Currently Unicode spans over 109449 characters, with about 74500 of them being CJK Ideographs. Microsoft has ever since mistakenly used ‘Unicode’ and ‘widechar’ as synonyms for ‘UCS-2’ and ‘UTF-16’. Furthermore, since UTF-8 can’t be set as the encoding for narrow string WinAPI, you must compile your code with _UNICODE rather than _MBCS . It educates Windows programmers that Unicode must be done with ‘widechars’. As a result of the mess, Windows C++ programmers are now among the most confused ones about what is the right thing to do about text.In the Linux and the Web worlds, however, there’s a silent agreement that UTF-8 is the most correct encoding for Unicode on the planet Earth. Even though it gives a strong preferences to English and therefore to computer languages (such as C++, HTML, XML, etc) over any other text, it is seldom less efficient than UTF-16 for commonly used character sets. The Facts
Our ConclusionsUTF-16 is the worst of both worlds—variable length and too wide. It exists for historical reasons, adds a lot of confusion and will hopefully die out.Portability, cross-platform interoperability and simplicity are more important than interoperability with existing platform APIs. So, the best approach is to use UTF-8 narrow strings everywhere and convert them back and forth on Windows before calling APIs that accept strings. Performance is seldom an issue of any relevance when dealing with string-accepting system APIs. There is a huge advantage to using the same encoding everywhere, and we see no sufficient reason to do otherwise. Speaking of performance, machines often use strings to communicate (e.g. HTTP headers, XML). Many see this as a mistake, but regardless of that it is nearly always done in English, giving UTF-8 advantage there. Using different encodings for different kinds of strings significantly increases complexity and consequent bugs. In particular, we believe that adding wchar_t to C++ was a mistake, and so are the Unicode additions to C++11. What must be demanded from the implementations though, is that the narrow strings would be capable of storing any Unicode data. Then every std::string or char* parameter would be Unicode-compatible. ‘If this accepts text, it should be Unicode compatible’—and with UTF-8 it is also easy to do.The standard facets have many design flaws. This includes std::numpunct , std::moneypunct and std::ctype not supporting variable-length encoded characters (non-ASCII UTF-8 and non-BMP UTF-16). They must be fixed:
How to do text on WindowsThe following is how we recommend to everyone else for compile-time checked Unicode correctness, ease of use and better multi-platformness of the code. This substantially differs from what is usually recommended as the proper way of using Unicode on Windows. Yet, an in-depth research of these recommendations resulted in the same conclusion. So here goes:
Working with files, filenames and fstreams on Windows
Conversion functionsThe policy uses the conversion functions from the CppCMS booster::nowide library, which can be downloaded as a separate package: The library also provides a set of wrappers for commonly used standard C and C++ library functions that deal with files.These functions and wrappers are easy to implement using Windows’ MultiByteToWideChar and WideCharToMultiByte functions. Any other (possibly faster) conversion routines can be used.FAQ
Myths
About the authorsThis manifesto is written by Pavel Radzivilovsky, Yakov Galka and Slava Novgorodov, as a result of much experience and research of real-world Unicode issues and mistakes done by real-world programmers. The goal is to improve awareness of text issues and to inspire industry-wide changes to make Unicode-aware programming easier, ultimately improving the experience of users of those programs written by human engineers. Neither of us is involved in the Unicode consortium.Much of the text is inspired by discussions on StackOverflow initiated by Artyom Beilis, the author of Boost.Locale. You can leave comments/feedback there. External links
|
Sunday, May 6, 2012
The UTF-8-Everywhere Manifesto
The UTF-8-Everywhere Manifesto:
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment