This Place is Taken

Sunday, May 6, 2012

The UTF-8-Everywhere Manifesto

The UTF-8-Everywhere Manifesto:


Text made easy once again

Purpose of this document

This document contains special characters. Without proper rendering support, you may see question marks, boxes, or other symbols.
To promote usage and support of the UTF-8 encoding, to convince that this should be the default choice of encoding for storing text strings in memory or on disk, for communication and all other uses. We believe that all other encodings of Unicode (or text, in general) belong to rare-edge cases of optimization and should be avoided by mainstream users.
In particular, we believe the very popular UTF-16 encoding (mistakenly used as synonym to ‘widechar’ and ‘Unicode’ in Windows world) has no place in library APIs (except specialized libraries dealing with text). If, at this point, you already think we are crazy, please skip straight the FAQ section.
This document recommends choosing UTF-8 as string storage in Windows applications, where this standard is less popular due to historical reasons and lack of the native UTF-8 support by the API. Yet, we believe that even on this platform the following arguments outweigh the lack of native support. Also, we recommend forgetting forever what are ‘ANSI codepages’ and what were they used for. It is in the customer’s bill of rights to mix any number of languages in any text string.
We recommend avoiding C++ application code that depends on _UNICODE define. This includes TCHAR/LPTSTR types on Windows and APIs defined as macros such as CreateWindow. We also recommend alternative ways to reach the goals of these APIs.
We also believe that if an application is not supposed to specialize in text, the infrastructure must make it possible for the program to be unaware of encoding issues. A file copy utility should not be written differently to support non-English file names. Joel’s great article on Unicode explains the encodings well for the beginners, but it lacks the most important part: how should a programmer proceed, if she does not care what is inside the string.

Background

In 1988, Joseph D. Becker published the first Unicode draft proposal. At the basis of his design was the assumption that 16 bits per character would suffice. In 1991, the first version of the Unicode standard was published, with code points limited to 16 bits. In the following years many systems added support for Unicode and switched to the UCS-2 encoding. It was especially attractive for new technologies, like Qt framework (1992), Windows NT 3.1 (1993) and Java (1995).
However, it was soon discovered that 16 bits per character will not do. In 1996, the UTF-16 encoding was created so existing systems would be able to work with non-16-bit characters. This effectively nullified the rationale behind choosing 16-bit encoding in the first place, namely being a fixed-width encoding. Currently Unicode spans over 109449 characters, with about 74500 of them being CJK Ideographs.
Microsoft has ever since mistakenly used ‘Unicode’ and ‘widechar’ as synonyms for ‘UCS-2’ and ‘UTF-16’. Furthermore, since UTF-8 can’t be set as the encoding for narrow string WinAPI, you must compile your code with _UNICODE rather than _MBCS. It educates Windows programmers that Unicode must be done with ‘widechars’. As a result of the mess, Windows C++ programmers are now among the most confused ones about what is the right thing to do about text.
In the Linux and the Web worlds, however, there’s a silent agreement that UTF-8 is the most correct encoding for Unicode on the planet Earth. Even though it gives a strong preferences to English and therefore to computer languages (such as C++, HTML, XML, etc) over any other text, it is seldom less efficient than UTF-16 for commonly used character sets.

The Facts

  • In both UTF-8 and UTF-16 encodings, characters may take up to 4 bytes (contrary to what Joel says).
  • UTF-8 is endianness independent. UTF-16 comes in two flavors: UTF-16LE and UTF-16BE (for different byte orders, respectively). Here we name them collectively as UTF-16.
  • Widechar is 2 bytes in size on some platforms, 4 on others.
  • UTF-8 and UTF-32 result the same order when sorted lexicographically. UTF-16 does not.
  • UTF-8 favors efficiency for English letters and other ASCII characters (one byte per character) while UTF-16 favors several Asian character sets (2 bytes instead of 3 in UTF-8). This is what made UTF-8 the favorite choice in the Web world, where English HTML/XML tags are intermixed with any-language text. Cyrillic, Hebrew and several other popular Unicode blocks are 2 bytes both in UTF-16 and UTF-8.
  • In the Linux world, narrow strings are considered UTF-8 by default almost everywhere. This way, for example, a file copy utility would not need to care about encodings. Once tested on ASCII strings for file name arguments, it would certainly work correctly for arguments in any language, as arguments are treated as cookies. The code of the file copy utility would not need to change a bit to support foreign languages. fopen() would accept Unicode seamlessly, and so would argv.
  • On Microsoft Windows, however, making a file copy utility that can accept file names in a mix of several different Unicode blocks requires advanced trickery. First, the application must be compiled as Unicode-aware. In this case, it cannot have main() function with standard-C parameters. It will then accept UTF-16 encoded argv. To convert a Windows program written with narrow text in mind to support Unicode, one has to refactor deep and keep in mind all string variables.
  • On Windows, SetCodePage() API enables receiving non-ASCII characters, but only from one ANSI codepage. An unimplemented parameter CF_UTF8 would enable doing the above, on Windows.

  • The standard library shipped with MSVC is poorly implemented. It forwards narrow-string parameters directly to the OS ANSI API. There’s no way to override this. Changing std::locale doesn’t work. It’s impossible to open a file with a Unicode name on MSVC using standard features of C++. The standard way to open a file is:
    std::fstream fout("abc.txt");
    
    The proper way to get around is by using Microsoft’s own hack that accepts wide-string parameter, which is a non-standard extension of STL.
  • There is no way to return Unicode from std::exception::what() but to use UTF-8.
  • UTF-16 is often misused as a fixed-width encoding, even by Windows native programs themselves: in plain Windows edit control (until Vista), it takes two backspaces to delete a character which takes 4 bytes in UTF-16. On Windows 7 the console displays that character as two invalid characters, regardless of the font used.
  • Many 3rd party libraries for Windows do not support Unicode: they accept narrow string parameters and pass them to the ANSI API. Sometimes, even for file names. In the general case, it’s impossible to work around this, as a string may not be representable completely in any ANSI code page (if contains characters from a mix of Unicode blocks). What is normally done on Windows for file names is getting an 8.3 path to the file (if it already exists) and feeding it into such library. It is not possible if the library is supposed to create a non-existing file. It is not possible if the path is so long that the 8.3 form is longer than MAX_PATH. It is not possible if the short-name generation is disabled.
  • UTF-16 is very popular today, even outside the Windows world. Qt, Java, C#, Python, the ICU—all use UTF-16 for internal string representation.

Our Conclusions

UTF-16 is the worst of both worlds—variable length and too wide. It exists for historical reasons, adds a lot of confusion and will hopefully die out.
Portability, cross-platform interoperability and simplicity are more important than interoperability with existing platform APIs. So, the best approach is to use UTF-8 narrow strings everywhere and convert them back and forth on Windows before calling APIs that accept strings. Performance is seldom an issue of any relevance when dealing with string-accepting system APIs. There is a huge advantage to using the same encoding everywhere, and we see no sufficient reason to do otherwise.
Speaking of performance, machines often use strings to communicate (e.g. HTTP headers, XML). Many see this as a mistake, but regardless of that it is nearly always done in English, giving UTF-8 advantage there. Using different encodings for different kinds of strings significantly increases complexity and consequent bugs.
In particular, we believe that adding wchar_t to C++ was a mistake, and so are the Unicode additions to C++11. What must be demanded from the implementations though, is that the narrow strings would be capable of storing any Unicode data. Then every std::string or char* parameter would be Unicode-compatible. ‘If this accepts text, it should be Unicode compatible’—and with UTF-8 it is also easy to do.
The standard facets have many design flaws. This includes std::numpunct, std::moneypunct and std::ctype not supporting variable-length encoded characters (non-ASCII UTF-8 and non-BMP UTF-16). They must be fixed:
  • decimal_point() and thousands_sep() should return a string rather than a single code unit. (By the way C locales do support this, albeit not customizable.)
  • toupper() and tolower() shall not be phrased in terms of code units, as it doesn’t work in Unicode. For example, ß shall be converted to SS and ffl to FFL.

How to do text on Windows

The following is how we recommend to everyone else for compile-time checked Unicode correctness, ease of use and better multi-platformness of the code. This substantially differs from what is usually recommended as the proper way of using Unicode on Windows. Yet, an in-depth research of these recommendations resulted in the same conclusion. So here goes:
  • Do not use wchar_t or std::wstring in any place other than adjacent point to APIs accepting UTF-16.
  • Don’t use _T("") or L"" literals in any place other than parameters to APIs accepting UTF-16.
  • Don’t use types, functions or their derivatives that are sensitive to the _UNICODE constant, such as LPTSTR or CreateWindow().
  • Yet, _UNICODE is always defined, to avoid passing narrow strings to WinAPI getting silently compiled.
  • std::string and char* anywhere in program are considered UTF-8 (if not said otherwise).

  • Only use Win32 functions that accept widechars (LPWSTR). Never those which accept LPTSTR or LPSTR. Pass parameters this way:
    ::SetWindowTextW(convert(someStdString or "string litteral").c_str())
    
    (The policy uses conversion functions described below.)

  • With MFC strings:
    CString someoneElse; // something that arrived from MFC.
    
    // Converted as soon as possible, before passing any further away from the API call:
    std::string s = str(boost::format("Hello %s\n") % convert(someoneElse));
    AfxMessageBox(convert(s).c_str(), L"Error", MB_OK);
    

Working with files, filenames and fstreams on Windows

  • Never produce text output files with non-UTF-8 content
  • Avoid using fopen() for RAII/OOD reasons. If necessary, use _wfopen() and WinAPI conventions above.
  • Never pass std::string or const char* filename arguments to fstream family. MSVC CRT does not support UTF-8 arguments, but has a non-standard extension which should be used as follows:

  • Convert std::string arguments to std::wstring with convert:
    std::ifstream ifs(convert("hello"), std::ios_base::binary);
    
    We’ll have to manually remove the convert, when MSVC’s attitude to fstream changes.
  • This code is not multi-platform and may have to be changed manually in the future.
  • Alternatively use a set of wrappers that hide the conversions.

Conversion functions

The policy uses the conversion functions from the CppCMS booster::nowide library, which can be downloaded as a separate package:
std::string convert(const wchar_t *s);
std::wstring convert(const char *s);
std::string convert(const std::wstring &s);
std::wstring convert(const std::string &s);
The library also provides a set of wrappers for commonly used standard C and C++ library functions that deal with files.
These functions and wrappers are easy to implement using Windows’ MultiByteToWideChar and WideCharToMultiByte functions. Any other (possibly faster) conversion routines can be used.

FAQ


  1. Q: Are you a linuxer? Is this a concealed religious fight against Windows?

    A: No, I grew up on windows, and I am a Windows fan. I believe they did a wrong choice in the text domain, because they did it earlier than others.

  2. Q: Are you an anglo-saxon? Do you secretly think English alphabet and culture are superior to any other?

    A: No, and my country is non-ASCII speaking. I do not think using encoding that uses single byte for ASCII chars is anglo-centrism, or has anything to do with human interaction. Even though one can argue that source codes of programs, web pages and XML files, OS file names and other computer-to-computer text interfaces should never have existed, as long as they do exist—text is not only for human readers.

  3. Q: Why do you guys care? I use C# and don’t have to know anything about encodings.

    A: You must still care when writing your text to files or communication channels. And, the fact you don't usually care about internally stored encodings is an achievement, which C++ programmers deserve to enjoy where possible.

  4. Q: Why not just let any programmer use her favorite encoding internally, as long as she knows how to use it?

    A: We have nothing against correct using of any encoding. However, it becomes a problem when the same type, such as std::string, means different things in different contexts. If it is ‘ANSI codepage’ for some. For others, it means ‘this code is broken and does not support non-English text’. In our programs, it means Unicode-aware UTF-8 string. This diversity is a source of many bugs and much misery: this additional complexity is something the world does not really need, and the result is many unicode-broken software industry-wide.

  5. Q: Why do you turn on the _UNICODE define, if you do not intend to use Windows’ LPTSTR/TCHAR/etc macros?

    A: This is a precaution against plugging a UTF-8 char* string into an ANSI-expecting API. We want it to generate a compiler error. It is the same kind of a hard-to-find bug as passing an argv[] string to fopen() on windows: it assumes user will never pass non-current-codepage filenames. You will be unlikely to find this bug in your QA, and it is a broken program behavior. Thanks to _UNICODE define, you get an error for that.

  6. Q: Isn’t it quite naïve to think that Microsoft will stop using widechars one day?

    A: Let’s first see when they start supporting CP_UTF8. Then, I see no reason why anybody would continue using widechar APIs. Also, adding support for CP_UTF8 would ‘unbreak’ some of existing unicode-broken programs and libraries.

  7. Q: Why would the Asians give up on UTF-16 encoding, which saves them 50% the memory per character?

    A: It does so only in artificially constructed examples containing only characters in the U+0800 to U+FFFF range. However, computer-to-computer text interfaces dominate any other. This includes XML, HTTP, filesystem paths and configuration files—all use almost only ASCII characters, and in fact UTF-8 is used just as often in those countries.
    For a dedicated storage of Chinese books, UTF-16 may still be used as a fair optimization. As soon as the text is retrieved from such storage, it should be converted to the standard compatible with the rest of the world. Anyway, if storage is at premium, a lossless compression will be used. In such case UTF-8 and UTF-16 will take roughly the same space. Furthermore, ‘in the said languages, a glyph conveys more information than a [L]atin character so it is justified for it to take more space.’ (Tronic, UTF-16 harmful).
    Here are the results of a simple experiment. The space used by HTML source of some web page (Japan article, retrieved from Japanese Wikipedia on 2012–01–01) is shown in the first column. The second column shows the results for text with markup removed, that is ‘select all, copy, paste into plain text file’.
    HTML Source (Δ UTF-8) Dense text (Δ UTF-8)
    UTF-8 767 KB (0%) 222 KB (0%)
    UTF-16 1 186 KB (+55%) 176 KB (−21%)
    UTF-8 zipped 179 KB (−77%) 83 KB (−63%)
    UTF-16LE zipped 192 KB (−75%) 76 KB (−66%)
    UTF-16BE zipped 194 KB (−75%) 77 KB (−65%)
    As can be seen, UTF-16 takes about 50% more space than UTF-8 on real data, saves just 20% for dense Asian text, and hardly competes with general purpose compression algorithms.

  8. Q: What do you think about BOMs?

    A: Another reason not to use UTF-16. UTF-8 has a BOM too, even though byte order is not an issue in this encoding. This is to manifest that this is a UTF-8 stream. If UTF-8 remains the only popular encoding (as it already is in the internet world), the BOM becomes redundant. In practice, many UTF-8 text files omit BOMs today.

  9. Q: What do you think about line endings?

    A: All files shall be read and written in binary mode since this guarantees interoperability—a program will always give the same output on any system. Since the C and C++ standards use \n as in-memory line endings, this will cause all files to be written in the POSIX convention. It may cause trouble when the file is opened in Notepad on Windows, however any decent text viewer understands such line endings. An example of such text viewer that comes bundled with all Windows installations is IE.

  10. Q: But what about performance of text processing algorithms, byte alignment, etc?

    A: Is it really better for UTF-16? Maybe so. ICU uses UTF-16 for historical reasons, thus it’s quite hard to measure. However, most of the times strings are treated as cookies, not sorted or reversed every second use. Smaller encoding is then favorable.

  11. Q: Isn’t UTF-8 merely an attempt to be compatible with ASCII? Why keep this old fossil?

    A: Maybe it was. Today, it is a better and more popular encoding of Unicode than any other.

  12. Q: Is it really a fault of UTF-16 that people misuse it, assuming it is 16 bits per character?

    A: Not really. But yes, safety is an important feature of every design.

  13. Q: If std::string means UTF-8, wouldn’t that confuse with code that stores plain text in std::strings?

    A: There’s no such thing as plain text. There is no reason to storing codepage-ANSI or ASCII-only text in a class named ‘string’.

  14. Q: Won’t the conversions between UTF-8 and UTF-16 when passing strings to Windows slow down my application?

    A: First, you will do some conversion either way. It’s either when calling the system, or when interacting with the rest of the world. Even if your interaction with the system is more frequent in your application, here is a little experiment.
    A typical use of the OS is to open files. This function executes in (184 ± 3)μs on my machine:
    void f(const wchar_t* name)
    {
        HANDLE f = CreateFile(name, GENERIC_WRITE, FILE_SHARE_READ, 0, CREATE_ALWAYS, 0, 0);
        DWORD written;
        WriteFile(f, "Hello world!\n", 13, &written, 0);
        CloseHandle(f);
    }
    
    While this runs in (186 ± 0.7)μs:
    void f(const char* name)
    {
        HANDLE f = CreateFile(convert(name).c_str(), GENERIC_WRITE, FILE_SHARE_READ, 0, CREATE_ALWAYS, 0, 0);
        DWORD written;
        WriteFile(f, "Hello world!\n", 13, &written, 0);
        CloseHandle(f);
    }
    
    (Run with name="D:\\a\\test\\subdir\\subsubdir\\this is the sub dir\\a.txt" in both cases. Averaged over 5 runs. Used an optimized convert that relies on std::string contiguous storage guarantee given by C++11.)
    This is just (1 ± 2)% overhead. Moreover, MultiByteToWideChar is almost surely suboptimal. Better UTF-8↔UTF-16 conversion functions exist.

  15. Q: How do I write UTF-8 string literal in my code?

    A: If you internationalize your software then all non-ASCII strings will be loaded from an external translation database, so it's not a problem.
    If you still want to embed a special character you can do it as follows. In C++11 you can do it as:
    u8"∃y ∀x ¬(x ≺ y)"
    With compilers that don't support ‘u8’ you can hard-code the UTF-8 code units as follows:
    "\xE2\x88\x83y \xE2\x88\x80x \xC2\xAC(x \xE2\x89\xBA y)"
    However the most straightforward way is to just write the string as-is and save the source file encoded in UTF-8:
    "∃y ∀x ¬(x ≺ y)"
    Unfortunately, MSVC converts it to some ANSI codepage, corrupting the string. To work around this, save the file in UTF-8 without BOM. MSVC will assume it’s in the correct codepage and won’t touch your string. However, it renders impossible to use Unicode identifiers and wide string literals (that you will not be using anyway).

  16. Q: I already use this approach and I want to make our vision come true. What can I do?

    A: Review your code and see what library is most painful to use in portable Unicode-aware code. Open a bug report to the authors.

Myths


  1. Indexing the nth character is O(n) in UTF-8. Not so for UTF-16/UTF-32.

    So is in UTF-16 which is a variable length encoding. It’s ϴ(n) operation even in UTF-32. Characters aren’t code points. For example both
    naïve (U+006E, U+0061, U+00EF, U+0076, U+0065)
    and
    naı̈ve (U+006E, U+0061, U+0131, U+0308, U+0076, U+0065)
    are 5 character words, while the former consists of five code points and the later of six.
    Code points are meaningful only in the context of Unicode algorithms, which by their nature always scan the string sequentially. Looking up the nth character is, on the other hand, has little meaning in Unicode (see below).

  2. Finding the length of a string is O(n) in UTF-8. Not so for UTF-16/UTF-32.

    The word ‘length’ is ambiguous, the comparison as stated is meaningless.
    • Length in code units: is always the size of the memory occupied by the string divided by the size of one code unit. This is the most frequently used operation when you work with strings.
    • Length in code points: it’s ϴ(n) in both UTF-8 and UTF-16, and it’s indeed constant-time in UTF-32. It’s needed only when you convert to UTF-32, probably inside some Unicode algorithm.
    • Length in characters: it’s ϴ(n) in any encoding since one character may be encoded with multiple code points and some code points may not correspond to characters.
    What you usually do care about is…
    • The size of the string as it appears on the screen: you’ll need to communicate with the rendering engine for this, which has to scan the whole string no matter what encoding is used.

  3. UTF-16 is good for processing.

    So said the ‘Unicode Technical Note #12, UTF-16 for Processing’. Unfortunately, the document fails to justify the points it makes. See detailed analysis of this paper (link, to be written).

About the authors

This manifesto is written by Pavel Radzivilovsky, Yakov Galka and Slava Novgorodov, as a result of much experience and research of real-world Unicode issues and mistakes done by real-world programmers. The goal is to improve awareness of text issues and to inspire industry-wide changes to make Unicode-aware programming easier, ultimately improving the experience of users of those programs written by human engineers. Neither of us is involved in the Unicode consortium.

Much of the text is inspired by discussions on StackOverflow initiated by Artyom Beilis, the author of Boost.Locale. You can leave comments/feedback there.

External links

Saturday, April 28, 2012

Should I Check E-Mail ?

Why I'm Sticking With Dropbox (Over Google Drive)

Why I'm Sticking With Dropbox (Over Google Drive):
There has been a massive build-up to the release of Google Drive, and while this new offering from the search giant was always going to be a big one, I firmly believe that there’s a really convincing argument why Dropbox is a better choice for storing your stuff online: privacy, and retaining rights over your content. I’m no lawyer, but you don’t have to be to understand why the implications of Google’s privacy policy are probably something you want to avoid.

What you’re giving Google when using Drive

Take a look at Google’s Terms of Service:
"Google Drive
Notice the highlighted portion that reads:
When you upload or otherwise submit content to our Services, you give Google (and those we work with) a worldwide license to use, host, store, reproduce, modify, create derivative works (such as those resulting from translations, adaptations or other changes we make so that your content works better with our Services), communicate, publish, publicly perform, publicly display and distribute such content.
Do you really want to sign over a worldwide license to use, modify, create derivative works, and publicly display or distribute for every document you upload to Google? My guess is your answer is no.

Dropbox FTW

Now let’s see what Dropbox’s terms say.
"The Dropbox Terms"
Their stance essentially is the complete opposite of Google’s. Notice the highlighted portion in the above image which reads:
You retain full ownership to your stuff. We don’t claim any ownership to any of it. These Terms do not grant us any rights to your stuff or intellectual property except for the limited rights that are needed to run the Services, as explained below.
Bravo, Dropbox! Well done for choosing a stance that supports my rights and privacy. That’s the kind of attitude more businesses should take.

Conclusion

Make up your mind yourself, but for me I know I’ll be sticking with Dropbox unless something radical changes.
Enjoy this post? You should totally follow me on Twitter!

RSS will never die

RSS will never die:
This was supposed to be a post about translating a URL into a [good] RSS feed. After reading The War on RSS and some of the passionate debate it kicked off on HackerNews I decided to write something else.
In short: RSS will never die.

The War on RSS part un

Propaganda
In May 2009 Steve Gillmor wrote on Techcrunch
It’s time to get completely off RSS and switch to Twitter. RSS just doesn’t cut it anymore. The River of News has become the East River of news, which means it’s not worth swimming in if you get my drift.
~ Rest in Peace RSS, Steve Gillmor on Techcrunch, May 2009
It sparked a meme. Suddenly everyone and their dog was convinced RSS was dead and we should all move on. Twitter will save us from something as horrible as a fourteen year old idea. That’s much too old for us web people.
In early 2011 RSS still wasn’t quite dead. “If RSS is dead, what’s next?“, a guy asked on Quora. This time, a very diplomatic answer came from the Robert Scoble (when I met him he said my startup idea is a fail just because it revolved around RSS):
First off, let’s define what dead means.
To me, anytime someone says a tech is dead it usually means that tech is not very interesting to discuss anymore, or isn’t seeing the most innovative companies doing new things with it
Essentially Scoble thinks RSS is dead because Google Reader stopped working out for him and nobody is innovating in the RSS space anymore.
Bummer.
Five months later he wrote about Feedly – an RSS reader for the iPad. Saying “don’t miss out and get Feedly on your iPad”. He called the idea of an RSS reader for the iPad stupid just 7 months prior.
Guess RSS isn’t that bad after all :)

The War on RSS part deux

Hogarth_Idle_Prentice_executed_at_Tyburn
This week – April 2012 – RSS still wasn’t quite dead. The War on RSS got a lot of passionate attention on HackerNews.
There’s a veritable explosion of companies removing RSS from their products … for whatever reason. Usually because it doesn’t directly benefit the bottom line – they prefer proprietary formats.
The next Mac OS – Mountain Lion – will likely ship without native RSS support. Gone from Safari (in favor of their proprietary Reader/Read Later thingy). Gone from Mail.
Somewhere in the last few versions Firefox removed the RSS icon from its usual place in the url bar.
Twitter removed public support for RSS feeds of user accounts. The feeds still exist – discovering them just takes a bit of trickery since they aren’t even mentioned in the HTML anymore.
Once upon a time even Facebook had support for profile RSS feeds. These have long been gone, so long in fact I don’t remember ever having seen them.
And there has never been native RSS support in Chrome. So much for that.
This time RSS is well and trully busted right? Took an arrow to the knee never to be heard from again.

RSS Will Never Die

Evolution of the Cylon
For a piece of tech that was declared dead and boring almost three years ago, RSS can stir up a suprisingly strong debate … mostly passionate users clinging on for dear life.
I asked Twitter whether anyone still uses RSS as a human. The replies started flying in as quickly as I pressed the submit button. 11 yes, 1 no-ish, 1 sort of no and 1 resounding no.
The data is skewed, yes. Only people passionate about enough to care replied and I am well aware that Normal Humans ™ don’t knowingly use RSS. That’s also quite a bit of responses for a random question posted to Twitter by some random guy.
It shows RSS will never die because of a simple reality: power users.
There is something called the 90-9-1 rule of online participation. At its core is the idea that 90% of  content comes from the top 1% of contributors.
Saying those top contributors are your power users is a pretty safe bet. And that’s why RSS is here to stay for at least a while longer – all those people doing most of the sharing? A lot of their stuff comes from RSS.

Why do people still use RSS anyway?

Old Desk
Ok, so the top 1% of that top 1% may have moved away from RSS and onto social media. Or at least that’s what everyone was claiming back in 2009 when Twitter was still something fresh, new and exciting. And most of all, much, much slower.
Twitter is not a replacement for RSS. Not by a long shot. It’s too busy!
My Twitter stream gets about 30 new messages every minute or two. This isn’t an environment to follow important-ish updates. Certainly not a place to look for 500+ word chunks of text that take ten minutes to read.
And god forbid anyone writes their blog only once a week, I’d miss 99% of their updates!
That’s where RSS comes in.
Not only does it take an hour for ten new posts to reach my Google Reader – when something does vanish, there is a sidebar full of subscriptions where I can see that, hey, there’s a bunch of stuff I want to read … eventually. No pressure. It’s all going to be here tomorrow, a week from now … even a month.
By the way, anything older than a week or two stops existing on Twitter.
When I want to read The Art of Manliness, I can just waltz over to Google Reader and check out the last few posts . No rush. The content is long, but it’s informative and it waits for me. There’s also no interruption or conversation. Just the curated best of what they have to say.
None of that on their Twitter though. Even though they only post every couple of hours, most of it is still reposts of old stuff and answering questions. I think there’s actually less than one new Actual Post ™ per day.
It gets worse for people, like me, who use Twitter as persons. Most of it is just random chitchat you don’t care about, sharing cool links from the web and generally everything but a RSS replacement for my personal blog.
Consequently, RSS offers bigger exposure to your content.
Looking at a recent personal post … tweeting three times creates 67 clickthroughs. Posting to RSS reached 145 readers, however Feedburner might be calculating that.
That’s a big difference!
RSS may have flopped for the regular user. It’s complex and kind of weird; but for that most important of readers – a fan - it will never really die.
And that’s before we even consider computers needing a simple and open way to follow websites’ updates.
Related articles
Enhanced by Zemanta



GMail: designer arrogance and the cult of minimalism

GMail: designer arrogance and the cult of minimalism:

Posted by jonoscript under Uncategorized | Tags: , , , , |
[5] Comments 
It looks like Google has finally pulled the plug on the old GMail UI. There’s no more “revert to the old look temporarily” button, so I guess they’re finally forcing us laggards onto the new theme. I’ve been a mostly happy GMail user since the very early days, but I strongly dislike the new UI.
As far as i can tell, this redesign is just change for the sake of change. I can’t see a single improvement! But I can spot three distinct un-provements *:

  1. The featureless white void: the old interface had colored borders and variations in background color which served to deliniate navigation from content and provide visual landmarks that helped me find my way around the page. It had visual ‘texture’. The new interface lacks that visual texture. Without borders or landmarks, everything blends together into a featureless sea of white and light grey. It requires more work for me to parse visually, to figure out what I’m looking at or to find the link I want to click.
    The old Gmail UI
    The new Gmail UI
    This is what happens when the cult of “minimalism” goes too far.
  2. The “importance” marker is now right next to the stars. I find the (algorithmically-applied) importance marker completely useless and would remove it if I could, but I use the stars quite heavily. In the old interface the importance marker was to the right, so I could ignore that column and scan the left column for stars. In the new interface, the two markers — being the same size, color, and location — blend together visually. I can no longer scan for stars; i have to look closely at each line to tell stars apart from importance markers.
  3. The new icons are inferior to the old text buttons. The text buttons were self-describing. The new icons are not. I’m not usually a fan of toolbar icons; they’re never as self-explanatory as their designers think they are, so they usually need text labels to be decipherable. At that point, why not cut out the middleman and just show the text label instead of the icon?
    Comparison of old and new Gmail toolbars
    But these icons are particularly bad. Again with the cult of minimalism: the icons are so streamlined and featureless that they all look the same: a row of meaningless, square, grey objects. When I want to mark something as spam, I used to be able to click the “spam” button. Now I have to mouse over each square grey object one at a time, looking for the one that pops up a “Report Spam” tooltip. (It’s the stop sign. Why a stop sign? I don’t know. Years of using GUIs have trained me to interpret a stop sign as an error message.)
Why were these changes made? I don’t know. According to the Gmail blog, the goals of the redesign included: to put mugshots of people into conversation view, to make the density adjustable, to make themes fancier, to make the left sidebar customizable, and to add an advanced search panel.
Assuming for the moment that these features were actually needed (which I think is arguable), the fact is that any of these features could have been added without making the interface a featureless white void or replacing helpful labels with cryptic icons.
Just today I read this blog post from a Google UX designer about “Change Aversion”, or the supposedly irrational tendency of users to fear change. The underlying attitude here is that users will like the new UI just fine once they try it, but they don’t want to give it a chance because they’re stubborn, like toddlers refusing to try an unfamiliar food.
I’ve certainly encountered this attitude before. Mozilla UX designers like to use the example of tabs-on-top: when we moved the tabs above the navigation bar in Firefox 4, many users balked at the change. But nobody could give a reason why tabs-on-top was worse — they just didn’t like it because it was unfamiliar.
The problem with this attitude is that sometimes the users may just be stubborn, but other times the users are encountering a real serious problem with the design; something they can feel is wrong, but can’t quite articulate precisely. Your users aren’t trained as designers, so they may not be able to argue their case convincingly in the language of design. If you dismiss all negative user feedback as mere stubbornness, you’ll miss important warning signs when you’re about to make a mistake. People have certainly been telling Google that they don’t like the new GMail interface, but it doesn’t seem like Google has been listening.
Change aversion might be a real thing, but designer arrogance is a real thing too.
* – “un-provements”: a word that I just made up because English lacks a word for discrete ways in which something has gotten worse. What would you say here? “three degradations”? “three backslides”? “three worsenings”?

Like this:


Be the first to like this post.