This Place is Taken

Saturday, May 19, 2012

Rude food: God’s own dish

by VIR SANGHVI
MAY 18, 2012

Whenever people ask me if I have a favourite cuisine or a favourite dish, I always say that I don’t because, frankly, there is so much good food out there that it is hard to choose. I’m reluctant also to say if I have a favourite among the many regional cuisines of India. But if you hold a gun to my head, then I will probably concede that – at least in my opinion – the greatest Indian cuisine is the food of Kerala.

Why Kerala? Well, partly because it is a personal preference. I just love the food. But there are also good rational reasons for my choice.

First of all, Malayali food is one of the few Indian cuisines that covers everything. There is terrific fish as you would expect from a coastal state. But the meat and poultry dishes are great too. How many other states can boast a good recipe for duck along with a brilliant one for crab? And then, there’s the outstanding vegetarian food. (Though frankly, Malayalis tend to get a little carried away with their drumsticks… And you have to be one of God’s own people to love tapioca as much as they do).

Secondly, Kerala is a synthesis of three of the greatest religions in the subcontinent – Hinduism, Christianity and Islam – and the cuisine reflects that. The Moplah cuisine of the Malabar Muslims has an Arab flavour, borrowed from the traders who regularly visited the region. The Syrian Christians are among the world’s oldest Christians (legend has it that they were converted by St Thomas, the apostle who was the Doubting Thomas of the Bible). So there were Christians in Kerala when many people in Europe were still living on trees. Syrian Christian cuisine is rich and varied and uses pork, beef and other ingredients you don’t always find in other Indian cuisines.

And the Hindu Malayali food is delicious, full of great vegetarian dishes that are distinguished by their lightness and subtlety.
Thirdly, it is the spices. Indian food is not only about the quality of the ingredients (as Western cuisine is); it is about combining the flavours of the spices. And Kerala is the spice garden of India. It has the most wonderful, fragrant spices and the food of all three communities – Muslim, Hindu and Christian – is distinguished by the skill with which spices are used.

When I first got into Malayali food in the 1980s, it was only available in South India and especially in Kerala itself. Because the cuisine is so complex, it revealed its secrets slowly. Even now, each time I try a new dish from Kerala, I am surprised by the dexterity with which meats, vegetables and spices have been cooked.

The mark of a great cuisine, I always think, is that the food that ordinary people eat – not just the banquet and party food made by great cooks – is interesting and memorable. My favourite Kerala dishes have always been the simplest ones.

Take Egg Roast. It is not, as you might imagine, a roasted egg. It is, in fact, a dish in which hard-boiled eggs are cooked in a wet masala. I first tried it for breakfast in the 1980s in Cochin. And now, I am addicted to it.

Why is it called a roast when it is actually a dry-ish curry? Nobody I spoke to seemed to know. One theory is that the egg is meant to be roasted, which makes no sense to me.

And where is it from? I asked Arun Kumar TR, the filmmaker-turned-chef who has now reinvigorated the excellent Zambar chain. Arun said that every little dhaba and restaurant in the Fort Cochin area served it. Perhaps, it was a Christian dish.

I checked Lathika George’s The Suriani Kitchen, a wonderful cookbook, full of Syrian Christian recipes and found that she had claimed it for her community. Like Arun, George also regards it as a dhaba dish, and provides a recipe.

She writes: “Perfectly browned onions are the base of this slowly stir-fried dish. This is a favourite in little tea shops and truck drivers’ haunts all over Kerala, usually served with parota or appams. Bakeries sometimes stuff the spiced eggs into triangles of flaky pastry and sell them as egg puffs…”

Chef Ananda Solomon denies that the Egg Roast was a Syrian Christian dish

So is it really a Syrian Christian dish? I asked the great Ananda Solomon, who is from Mangalore but whose house falls on the border with Kerala. Ananda denied that the Egg Roast was a Syrian Christian dish and insisted that it had Muslim origins. But he also said that it had now spread all over Kerala and was common to all communities. Ananda sometimes puts it on the menu of his Konkan Café in Bombay in the winter, but his version has a slight Mangalorean touch because he uses a little kokum for sourness. Ananda says that Moplah food has Mangalorean influences and insists that the name ‘Roast’ was imposed by the Brits who called anything that wasn’t a curry a roast.

Finally, I decided to check with Dinesh Nair, the foodie who is MD of the Leela chain. Though the Leela goes on and on about Megu, Le Cirque etc. (with justification), the chain’s real strength has always been its South Indian food. Dinesh’s father Captain Nair opened his first hotel in Bombay convinced that even if all else failed, it would still be a success solely on the strength of his wife Leela’s Malayali cooking. The original cooks and chefs were all chosen by Mrs Nair herself and her family continues to be obsessed with food. For instance, the idlis at the Bangalore Leela have been justly praised for their excellence. So, Dinesh insists that the rice for idlis at every Leela property is flown in from Bangalore to ensure the same level of perfection.

Dinesh Nair, insists that the rice for idlis at every Leela property is flown in from Bangalore to ensure the same level of perfection.

Dinesh directed me to the Leela’s legendary South Indian chef (legendary to other chefs – he keeps a low profile, otherwise) Purshotham. Like Ananda, Purshotham says that the dish is now too ubiquitous to be associated with any one community but agrees that its origins lie with the cuisine of the Muslims of the Malabar coast.

Though all Leela hotels do a fabulous Egg Roast, on par with anything I have eaten in the dhabas of Kerala, I thought I should also find out how the home-cooked version is made. So I checked with Hindu friends in Malabar how they made the dish in their own kitchens.

Their recipe was broadly the same except that, because my friends are from plantation families, they relied on spices more than the dhaba cooks did. Also, their version kept the onions slightly more solid at the end of the dish (not unlike the Lathika George recipe) while the Leela Egg Roast depended on the onions melting into the masala.

Broadly, the differences in all the recipes were to do with a few ingredients. Ananda used kokum, Lathika George used no garam masala. My Malayali friends used cinnamon bark and cloves to make a more fragrant Egg Roast. Purshotham used fennel, which nobody else did.

And they treated their eggs differently. Usually people put the whole hard-boiled eggs in. My friends halved the eggs so that they could soak in the flavours. Purshotham kept the eggs whole, but made deep gashes in the whites to let the masala permeate the inside. My friends poured the masalas over the eggs. Purshotham cooked the hard-boiled eggs in the masala for a few minutes.

Chef Purshotham says that the Egg Roast is now too ubiquitous to be associated with any one community.

I’ve included Purshotham’s recipe because so far, it has been secret and available only to Leela chefs. And I’ve included the recipe from my Malabar friends because it is a home-style method. Lathika George’s recipe is in her book and therefore, in the public domain.

But whichever recipe you select, you should end up with a mound of onion-rich masala, delicious, dark and fragrant, combining the teekha flavours of the Malabar coast, the spicy aromas of the plantations of Kerala and the earthy simplicity of small roadside dhabas. The eggs, yellowed with spice stains, should be poking shyly through the masala, imploring you to eat them.

You should scoop it all up with a freshly-made appam or a parota and let the taste of Kerala fill your mouth.

It is God’s own dish.

Chef Purshotham’s Egg Roast

4 portions (should serve 8)

Ingredients
16 hard boiled eggs (remove shell, make gashes)
800 gm onions, sliced
50 gm garlic pods, peeled and chopped
15 gm ginger, julienned
15 gm green chillies, slit
10 gm curry leaves
200 gm tomatoes, sliced
20 gm coriander powder
8 gm red chilli powder
3 gm turmeric powder
2 gm fennel seeds
3 gm black pepper, crushed
2 gm cumin powder, roasted
2 gm garam masala powder
Salt to taste
10 gm coriander leaves, chopped
60 ml vegetable oil or coconut oil

Method
* Heat oil in pan, add fennel seeds, garlic, ginger, onions, green chillies and curry leaves.
* Stir fry for few minutes on a low flame until the onions become translucent.
* Add red chilli powder, coriander powder, turmeric powder, cumin powder and salt and fry for a few minutes.
* Add tomatoes and saute until tomatoes are mashed.
* Now add eggs and stir fry without breaking the eggs.
* Add crushed black pepper, garam masala powder and chopped coriander leaves and cook on low flame until the oil separates from the sides of the pan and the eggs are slightly crisp.
* Adjust the seasoning and serve with appams.

The Home-Style Recipe

Ingredients

1 portion (serves 2)
4 hard boiled eggs
6-7 big onions, finely sliced (keep a fistful aside for later)
3 small tomatoes, diced (don’t dice them too small, it lets out too much moisture)
3-4 garlic cloves, finely chopped
Small piece of ginger, finely chopped
2-3 green chillies, slit into two (add more if you like it hot)
2-3 tsp cooking oil
3-4 cloves
2 pieces cinnamon bark
2 bay leaves
1/2 tsp turmeric powder
2 tsp coriander powder
1 tsp red chilli powder (less, if you don’t like it so hot)
1/2 tsp garam masala
Coriander leaves and curry leaves
Salt to taste

Method
* Heat the oil in a hot pan. Add garlic, ginger and onion (in that order) to the oil and fry them slowly over a slow fire till they are a little more than golden brown. Stir continuously so that the onions do not burn or caramelise.
* Now roughly grind the cloves, cinnamon and bay leaves with a mortar-pestle and add to the pan. Keep the fire low. Once you sense the aroma of the cloves, stir in the turmeric powder,
coriander powder and red chilli powder. Let the spices roast with the onions for a minute or so (be careful not to burn them).
* Throw in the green chilli slices and tomatoes. Stir till the
tomatoes are cooked, making sure the gravy is not too squishy. Stir in the garam masala.
* Your gravy is ready.
* Now, halve the boiled eggs, place them on your serving dish, and pour your gravy over them. Throw in some chopped
coriander and curry leaves. You can fry the onions you kept aside, and mix them in just before serving for added texture.
* The dish tastes better if you let it rest for an hour or so before serving as the eggs will take a while to absorb the flavours.

Sunday, May 13, 2012

The Sharp Dropoff In Worker Happiness

The Sharp Dropoff In Worker Happiness:
This blog is written by a member of our expert blogging community and expresses that expert's views alone.

A friend of mine resigned his long-time bank management job this week to take early retirement. I learned about it on Facebook.
As I began reading his announcement, I fully expected it to be an animated recounting of all the new hobbies he planned to pursue and exotic trips he intended to take. But it quickly became clear that this was no ordinary farewell note. He was truly upset about ending his career prematurely and wanted everyone close to him to understand why.
It was painful to discover that my former colleague had grown profoundly disheartened by the way his organization’s leadership had been treating him. With over two decades of service behind him, he called it quits simply because he couldn’t take it anymore.
“I felt like no one cared about me as a person there, and finally decided to extricate myself from the grind. I know many of you feel the same way now in your jobs…trapped and unappreciated.”
There was a sense of relief in his words, as if I was reading about someone who had been imprisoned, found an escape route, and wanted to show others the way to freedom.
“You may not be able to retire quite yet like me, but please do yourself a favor and look for something more satisfying. It might take a while (it took me eight months once I made the decision), but it’s been so worth it. If you're old like me, then think about early retirement. If you're young, look for a more satisfying, fulfilling career path. Don't let these companies drain off your sense of worth, pride, health, energy, honesty and ethics. Are you listening [XYZ Bank]*? Of course you're not.”
I share his words as another illustration that our common approach to workplace leadership is failing. And experts have been trying to tell us this for years.
New York’s Conference Board, a century-old research firm, began studying employee satisfaction and engagement 25 years ago. Their work shows that worker happiness has fallen every year since--in good economic times and bad. Today, over half of American workers effectively hate their jobs.
But it’s the past four years that have brought employee discontent to new and highly charged levels.
"People were already unhappy, but the recession years have made things much worse," says John Gibbons, formerly of the Conference Board and now Vice President of Research and Development at the Institute For Corporate Productivity. "Whether we realize it or not, workers have been under constant duress. Because of scarce resources, few opportunities for development and promotions--not to mention the fact that people often have been required to do the work of more than one person--a lot of our workforce is burnt out. Employees across the country feel overworked, under-rewarded and greatly unappreciated."
The recession has been hard on managers too, no doubt. Delivering great customer service, and achieving KPIs and revenue goals all have been a tremendous challenge during this extended period of limited means.
But it’s clear that many leaders have lost sight of what matters most to people at work. Appreciation. Support. Recognition. Respect. And when people feel disillusioned and virtually convinced things have to be better somewhere else, they do what my friend did. They quit.
According to the U.S. Labor Department, 2.1 million people resigned their jobs in February, the most in any month since the start of the Great Recession.
Dating back to mid-2011, numerous studies have reported that at least one-third of the American workforce planned to jump ship in 2012. Since very little action has yet to be taken on that threat, however, those predictions have come to be seen only as “Chicken Little exaggerations.” Business leaders, therefore, have grown less concerned.
But the government’s new “Job Opening And Labor Turnover Survey,” (JOLTS), holds the reminder why more employees haven’t (yet) departed. Jobs have remained scarce; 12.7 million people remain unemployed in the U.S. today, while only 3.5 million job openings exist. That translates into nearly four people chasing every one job--not including already employed workers seeking greener, and more respectful, pastures.
Simply because 2.1 million people were able to find new jobs, February’s mass exodus may prove to be the watershed moment when turnover becomes the problem it was predicted to be.
However, there still may be time for managers to re-recruit their employees before they leave. This won’t be easy and it will most definitely require a significant change in leadership practices. Here are three things leaders should learn quickly and never forget:
1. What makes people happiest in their jobs is all profoundly personal. “Do I work for an organization whose mission and methods I respect?” “Does my boss authentically advocate for me?” “Is the work I do meaningful?” “Am I afforded sufficient variety in my day?” “Do I feel valued and appreciated for all the work that I do?”
We know that all these matter more to people than their compensation--and workers generally don’t quit jobs when these basic needs are met. According to a worldwide Towers Watson study, the single highest driver of employee engagement is whether or not workers feel their managers are genuinely interested in their well-being. Today, only 40% of workers believe that.
2. People only thrive when they feel recognized and appreciated. In a recent Harvard Business Review article, "Why Appreciation Matters So Much," Tony Schwartz reminds us that all employees need to be praised, honored, and routinely acknowledged for their efforts and achievements. Consequently, leaders must allow themselves to manage more from their hearts.
Our brains are great at building strategies, managing capital, and analyzing data. But it’s the heart that connects us as human beings, and its what’s greatly lacking in American leadership today. This is what now must change.
3. Your employees will stay if you tell them directly you need them, care about them, and sincerely plan to support them. Any time someone quits a job for a reason other than money, they’re leaving in hope that things will be better somewhere else. So, everyone who works for you must be made to feel that they matter. Plan one-on-one meetings and re-discover the dreams each person has at work. Tell people directly how valuable they are to you. To be successful, all your future behavior must demonstrate to your employees that their best career move is to remain working for you.
Being human and treating one another with dignity and respect is something the heart already knows to do. Leaders would all do well to follow it.
*His former employer, one of the U.S.’s largest financial institutions.
Mark C. Crowley is a former National Sales Manager for WaMu Investments, where he was named its Leader of the Year. He’s the author of Lead From The Heart: Transformational Leadership For The 21st Century; follow him on Twitter at @MarkCCrowley.
[Image: Flickr user Vinoth Chandar]

Saturday, May 12, 2012

DropqQuest 2012 is Live. Get 1 GB free space

Dropquest 2012 Walkthrough

Begin
https://www.dropbox.com/dropquest2012

Prologue

Chapter 1
46637
64529
38645

Chapter 2
Go to https://www.dropbox.com/about
With each hint, move to a new person like how a chess knight would.
ChenLi Wang -> Ramsey H. - > Allison Louie -> Naveen Agrawal -> Emily Zhao
https://www.dropbox.com/dropquest2012/crane

Chapter 3
SMUDGES

Chapter 4
SOMA

Chapter 5
MADLIB

Chapter 6
Invert the colours, look at the ones that don't match their text descriptions. Use the M/D/Y and use that date on https://www.dropbox.com/events

Should be one of the following:
https://www.dropbox.com/events?ns=false&n=0&d=1-7-2003
https://www.dropbox.com/events?ns=false&n=0&d=3-21-2005
https://www.dropbox.com/events?ns=fa...0&d=11-28-2001

Chapter 7
The letters are in the soundcloud comments, use the legend to figure out the order by converting it to numbers.
https://www.dropbox.com/dropquest2012/LEADING
https://www.dropbox.com/dropquest2012/DEALING
https://www.dropbox.com/dropquest2012/ALIGNED

Chapter 8
TRIUMPHANT

Chapter 9
Rearrange file names to say, “Your next destination is the last page of the tour.”
Go to https://www.dropbox.com/tour/6
Send an invite to boxer@dropbox.com, savior@dropbox.com or flash@dropbox.com via https://www.dropbox.com/referrals

Chapter 10
MEXICO, ARGENTINA, KOREA

Chapter 11
SOUTHPOLE

Chapter 12
Restore Chapter12.txt to its previous version via https://www.dropbox.com/revisions/Dropquest 2012/Captain's Logs/Chapter 12.txt
Then go to https://www.dropbox.com/help and click on the shield icon of "Security and Privacy".

Chapter 14
Use this to help solve the sudoku puzzle: http://www.solvemysudoku.com
Re-arrange the slider puzzle according to the highlighted part of the sudoku puzzle (once you solve it)
Use this to help solve slider puzzle: http://analogbit.com/software/puzzletools

Chapter 15
Share a folder with the email listed on the page (should be the same as in the one you invited in Chapter 9).
Then go to https://www.dropbox.com/share and click on the larger rainbow icon next to "Sharing".

Chapter 17
SHANGHAI

Chapter 18
Go to https://www.dropbox.com/home/Dropquest 2012/Spring Cleaning
Put 1, 3, 6, 8, 9 (.jpg) into Category 1
Put 2, 4, 5, 7, 10 into Category 2
If you did this via the desktop app, use the Repair Dropquest Folder link on the lower left corner of the clue page to start over and do it properly from the web interface..

Chapter 19
For each column, subtract the smaller card value from the larger one. The answer should be one of the following links:

http://db.tt/94J964
http://db.tt/72j933
http://db.tt/Q2J9J4

Chapter 20
Ocean's Eleven
X-Men: First Class
Twenty-One
Super 8
28 days later
Sixth Sense

https://www.dropbox.com/dropquest2012/apollo13

Chapter 21
ABUSIVELY

Chapter 22
FACED, MACAU, BADGE, BASED, or ADAGE

Chapter 23
MACHU PICCHU

Endgame
COLOSSEUM

Sunday, May 6, 2012

The UTF-8-Everywhere Manifesto

The UTF-8-Everywhere Manifesto:

Text made easy once again

Purpose of this document

This document contains special characters. Without proper rendering support, you may see question marks, boxes, or other symbols.
To promote usage and support of the UTF-8 encoding, to convince that this should be the default choice of encoding for storing text strings in memory or on disk, for communication and all other uses. We believe that all other encodings of Unicode (or text, in general) belong to rare-edge cases of optimization and should be avoided by mainstream users.
In particular, we believe the very popular UTF-16 encoding (mistakenly used as synonym to ‘widechar’ and ‘Unicode’ in Windows world) has no place in library APIs (except specialized libraries dealing with text). If, at this point, you already think we are crazy, please skip straight the FAQ section.
This document recommends choosing UTF-8 as string storage in Windows applications, where this standard is less popular due to historical reasons and lack of the native UTF-8 support by the API. Yet, we believe that even on this platform the following arguments outweigh the lack of native support. Also, we recommend forgetting forever what are ‘ANSI codepages’ and what were they used for. It is in the customer’s bill of rights to mix any number of languages in any text string.
We recommend avoiding C++ application code that depends on _UNICODE define. This includes TCHAR/LPTSTR types on Windows and APIs defined as macros such as CreateWindow. We also recommend alternative ways to reach the goals of these APIs.
We also believe that if an application is not supposed to specialize in text, the infrastructure must make it possible for the program to be unaware of encoding issues. A file copy utility should not be written differently to support non-English file names. Joel’s great article on Unicode explains the encodings well for the beginners, but it lacks the most important part: how should a programmer proceed, if she does not care what is inside the string.

Background

In 1988, Joseph D. Becker published the first Unicode draft proposal. At the basis of his design was the assumption that 16 bits per character would suffice. In 1991, the first version of the Unicode standard was published, with code points limited to 16 bits. In the following years many systems added support for Unicode and switched to the UCS-2 encoding. It was especially attractive for new technologies, like Qt framework (1992), Windows NT 3.1 (1993) and Java (1995).
However, it was soon discovered that 16 bits per character will not do. In 1996, the UTF-16 encoding was created so existing systems would be able to work with non-16-bit characters. This effectively nullified the rationale behind choosing 16-bit encoding in the first place, namely being a fixed-width encoding. Currently Unicode spans over 109449 characters, with about 74500 of them being CJK Ideographs.
Microsoft has ever since mistakenly used ‘Unicode’ and ‘widechar’ as synonyms for ‘UCS-2’ and ‘UTF-16’. Furthermore, since UTF-8 can’t be set as the encoding for narrow string WinAPI, you must compile your code with _UNICODE rather than _MBCS. It educates Windows programmers that Unicode must be done with ‘widechars’. As a result of the mess, Windows C++ programmers are now among the most confused ones about what is the right thing to do about text.
In the Linux and the Web worlds, however, there’s a silent agreement that UTF-8 is the most correct encoding for Unicode on the planet Earth. Even though it gives a strong preferences to English and therefore to computer languages (such as C++, HTML, XML, etc) over any other text, it is seldom less efficient than UTF-16 for commonly used character sets.

The Facts

In both UTF-8 and UTF-16 encodings, characters may take up to 4 bytes (contrary to what Joel says).
UTF-8 is endianness independent. UTF-16 comes in two flavors: UTF-16LE and UTF-16BE (for different byte orders, respectively). Here we name them collectively as UTF-16.
Widechar is 2 bytes in size on some platforms, 4 on others.
UTF-8 and UTF-32 result the same order when sorted lexicographically. UTF-16 does not.
UTF-8 favors efficiency for English letters and other ASCII characters (one byte per character) while UTF-16 favors several Asian character sets (2 bytes instead of 3 in UTF-8). This is what made UTF-8 the favorite choice in the Web world, where English HTML/XML tags are intermixed with any-language text. Cyrillic, Hebrew and several other popular Unicode blocks are 2 bytes both in UTF-16 and UTF-8.
In the Linux world, narrow strings are considered UTF-8 by default almost everywhere. This way, for example, a file copy utility would not need to care about encodings. Once tested on ASCII strings for file name arguments, it would certainly work correctly for arguments in any language, as arguments are treated as cookies. The code of the file copy utility would not need to change a bit to support foreign languages. fopen() would accept Unicode seamlessly, and so would argv.
On Microsoft Windows, however, making a file copy utility that can accept file names in a mix of several different Unicode blocks requires advanced trickery. First, the application must be compiled as Unicode-aware. In this case, it cannot have main() function with standard-C parameters. It will then accept UTF-16 encoded argv. To convert a Windows program written with narrow text in mind to support Unicode, one has to refactor deep and keep in mind all string variables.
On Windows, SetCodePage() API enables receiving non-ASCII characters, but only from one ANSI codepage. An unimplemented parameter CF_UTF8 would enable doing the above, on Windows.
The standard library shipped with MSVC is poorly implemented. It forwards narrow-string parameters directly to the OS ANSI API. There’s no way to override this. Changing std::locale doesn’t work. It’s impossible to open a file with a Unicode name on MSVC using standard features of C++. The standard way to open a file is:
```
std::fstream fout("abc.txt");
```
The proper way to get around is by using Microsoft’s own hack that accepts wide-string parameter, which is a non-standard extension of STL.
There is no way to return Unicode from std::exception::what() but to use UTF-8.
UTF-16 is often misused as a fixed-width encoding, even by Windows native programs themselves: in plain Windows edit control (until Vista), it takes two backspaces to delete a character which takes 4 bytes in UTF-16. On Windows 7 the console displays that character as two invalid characters, regardless of the font used.
Many 3rd party libraries for Windows do not support Unicode: they accept narrow string parameters and pass them to the ANSI API. Sometimes, even for file names. In the general case, it’s impossible to work around this, as a string may not be representable completely in any ANSI code page (if contains characters from a mix of Unicode blocks). What is normally done on Windows for file names is getting an 8.3 path to the file (if it already exists) and feeding it into such library. It is not possible if the library is supposed to create a non-existing file. It is not possible if the path is so long that the 8.3 form is longer than MAX_PATH. It is not possible if the short-name generation is disabled.
UTF-16 is very popular today, even outside the Windows world. Qt, Java, C#, Python, the ICU—all use UTF-16 for internal string representation.

Our Conclusions

UTF-16 is the worst of both worlds—variable length and too wide. It exists for historical reasons, adds a lot of confusion and will hopefully die out.
Portability, cross-platform interoperability and simplicity are more important than interoperability with existing platform APIs. So, the best approach is to use UTF-8 narrow strings everywhere and convert them back and forth on Windows before calling APIs that accept strings. Performance is seldom an issue of any relevance when dealing with string-accepting system APIs. There is a huge advantage to using the same encoding everywhere, and we see no sufficient reason to do otherwise.
Speaking of performance, machines often use strings to communicate (e.g. HTTP headers, XML). Many see this as a mistake, but regardless of that it is nearly always done in English, giving UTF-8 advantage there. Using different encodings for different kinds of strings significantly increases complexity and consequent bugs.
In particular, we believe that adding wchar_t to C++ was a mistake, and so are the Unicode additions to C++11. What must be demanded from the implementations though, is that the narrow strings would be capable of storing any Unicode data. Then every std::string or char* parameter would be Unicode-compatible. ‘If this accepts text, it should be Unicode compatible’—and with UTF-8 it is also easy to do.
The standard facets have many design flaws. This includes std::numpunct, std::moneypunct and std::ctype not supporting variable-length encoded characters (non-ASCII UTF-8 and non-BMP UTF-16). They must be fixed:

decimal_point() and thousands_sep() should return a string rather than a single code unit. (By the way C locales do support this, albeit not customizable.)
toupper() and tolower() shall not be phrased in terms of code units, as it doesn’t work in Unicode. For example, ß shall be converted to SS and ﬄ to FFL.

How to do text on Windows

The following is how we recommend to everyone else for compile-time checked Unicode correctness, ease of use and better multi-platformness of the code. This substantially differs from what is usually recommended as the proper way of using Unicode on Windows. Yet, an in-depth research of these recommendations resulted in the same conclusion. So here goes:

Do not use wchar_t or std::wstring in any place other than adjacent point to APIs accepting UTF-16.
Don’t use _T("") or L"" literals in any place other than parameters to APIs accepting UTF-16.
Don’t use types, functions or their derivatives that are sensitive to the _UNICODE constant, such as LPTSTR or CreateWindow().
Yet, _UNICODE is always defined, to avoid passing narrow strings to WinAPI getting silently compiled.
std::string and char* anywhere in program are considered UTF-8 (if not said otherwise).
Only use Win32 functions that accept widechars (LPWSTR). Never those which accept LPTSTR or LPSTR. Pass parameters this way:
```
::SetWindowTextW(convert(someStdString or "string litteral").c_str())
```
(The policy uses conversion functions described below.)

With MFC strings:

CString someoneElse; // something that arrived from MFC.

// Converted as soon as possible, before passing any further away from the API call:
std::string s = str(boost::format("Hello %s\n") % convert(someoneElse));
AfxMessageBox(convert(s).c_str(), L"Error", MB_OK);

Working with files, filenames and fstreams on Windows

Never produce text output files with non-UTF-8 content
Avoid using fopen() for RAII/OOD reasons. If necessary, use _wfopen() and WinAPI conventions above.
Never pass std::string or const char* filename arguments to fstream family. MSVC CRT does not support UTF-8 arguments, but has a non-standard extension which should be used as follows:
Convert std::string arguments to std::wstring with convert:
```
std::ifstream ifs(convert("hello"), std::ios_base::binary);
```
We’ll have to manually remove the convert, when MSVC’s attitude to fstream changes.
This code is not multi-platform and may have to be changed manually in the future.
Alternatively use a set of wrappers that hide the conversions.

Conversion functions

The policy uses the conversion functions from the CppCMS booster::nowide library, which can be downloaded as a separate package:

std::string convert(const wchar_t *s);
std::wstring convert(const char *s);
std::string convert(const std::wstring &s);
std::wstring convert(const std::string &s);

The library also provides a set of wrappers for commonly used standard C and C++ library functions that deal with files.
These functions and wrappers are easy to implement using Windows’ MultiByteToWideChar and WideCharToMultiByte functions. Any other (possibly faster) conversion routines can be used.

FAQ

Q: Are you a linuxer? Is this a concealed religious fight against Windows?
A: No, I grew up on windows, and I am a Windows fan. I believe they did a wrong choice in the text domain, because they did it earlier than others.
Q: Are you an anglo-saxon? Do you secretly think English alphabet and culture are superior to any other?
A: No, and my country is non-ASCII speaking. I do not think using encoding that uses single byte for ASCII chars is anglo-centrism, or has anything to do with human interaction. Even though one can argue that source codes of programs, web pages and XML files, OS file names and other computer-to-computer text interfaces should never have existed, as long as they do exist—text is not only for human readers.
Q: Why do you guys care? I use C# and don’t have to know anything about encodings.
A: You must still care when writing your text to files or communication channels. And, the fact you don't usually care about internally stored encodings is an achievement, which C++ programmers deserve to enjoy where possible.
Q: Why not just let any programmer use her favorite encoding internally, as long as she knows how to use it?
A: We have nothing against correct using of any encoding. However, it becomes a problem when the same type, such as std::string, means different things in different contexts. If it is ‘ANSI codepage’ for some. For others, it means ‘this code is broken and does not support non-English text’. In our programs, it means Unicode-aware UTF-8 string. This diversity is a source of many bugs and much misery: this additional complexity is something the world does not really need, and the result is many unicode-broken software industry-wide.
Q: Why do you turn on the _UNICODE define, if you do not intend to use Windows’ LPTSTR/TCHAR/etc macros?
A: This is a precaution against plugging a UTF-8 char* string into an ANSI-expecting API. We want it to generate a compiler error. It is the same kind of a hard-to-find bug as passing an argv[] string to fopen() on windows: it assumes user will never pass non-current-codepage filenames. You will be unlikely to find this bug in your QA, and it is a broken program behavior. Thanks to _UNICODE define, you get an error for that.
Q: Isn’t it quite naïve to think that Microsoft will stop using widechars one day?
A: Let’s first see when they start supporting CP_UTF8. Then, I see no reason why anybody would continue using widechar APIs. Also, adding support for CP_UTF8 would ‘unbreak’ some of existing unicode-broken programs and libraries.

Q: Why would the Asians give up on UTF-16 encoding, which saves them 50% the memory per character?

A: It does so only in artificially constructed examples containing only characters in the U+0800 to U+FFFF range. However, computer-to-computer text interfaces dominate any other. This includes XML, HTTP, filesystem paths and configuration files—all use almost only ASCII characters, and in fact UTF-8 is used just as often in those countries.
For a dedicated storage of Chinese books, UTF-16 may still be used as a fair optimization. As soon as the text is retrieved from such storage, it should be converted to the standard compatible with the rest of the world. Anyway, if storage is at premium, a lossless compression will be used. In such case UTF-8 and UTF-16 will take roughly the same space. Furthermore, ‘in the said languages, a glyph conveys more information than a [L]atin character so it is justified for it to take more space.’ (Tronic, UTF-16 harmful).
Here are the results of a simple experiment. The space used by HTML source of some web page (Japan article, retrieved from Japanese Wikipedia on 2012–01–01) is shown in the first column. The second column shows the results for text with markup removed, that is ‘select all, copy, paste into plain text file’.

	HTML Source (Δ UTF-8)	Dense text (Δ UTF-8)
UTF-8	767 KB (0%)	222 KB (0%)
UTF-16	1 186 KB (+55%)	176 KB (−21%)
UTF-8 zipped	179 KB (−77%)	83 KB (−63%)
UTF-16LE zipped	192 KB (−75%)	76 KB (−66%)
UTF-16BE zipped	194 KB (−75%)	77 KB (−65%)

As can be seen, UTF-16 takes about 50% more space than UTF-8 on real data, saves just 20% for dense Asian text, and hardly competes with general purpose compression algorithms.

Q: What do you think about BOMs?
A: Another reason not to use UTF-16. UTF-8 has a BOM too, even though byte order is not an issue in this encoding. This is to manifest that this is a UTF-8 stream. If UTF-8 remains the only popular encoding (as it already is in the internet world), the BOM becomes redundant. In practice, many UTF-8 text files omit BOMs today.
Q: What do you think about line endings?
A: All files shall be read and written in binary mode since this guarantees interoperability—a program will always give the same output on any system. Since the C and C++ standards use \n as in-memory line endings, this will cause all files to be written in the POSIX convention. It may cause trouble when the file is opened in Notepad on Windows, however any decent text viewer understands such line endings. An example of such text viewer that comes bundled with all Windows installations is IE.
Q: But what about performance of text processing algorithms, byte alignment, etc?
A: Is it really better for UTF-16? Maybe so. ICU uses UTF-16 for historical reasons, thus it’s quite hard to measure. However, most of the times strings are treated as cookies, not sorted or reversed every second use. Smaller encoding is then favorable.
Q: Isn’t UTF-8 merely an attempt to be compatible with ASCII? Why keep this old fossil?
A: Maybe it was. Today, it is a better and more popular encoding of Unicode than any other.
Q: Is it really a fault of UTF-16 that people misuse it, assuming it is 16 bits per character?
A: Not really. But yes, safety is an important feature of every design.
Q: If std::string means UTF-8, wouldn’t that confuse with code that stores plain text in std::strings?
A: There’s no such thing as plain text. There is no reason to storing codepage-ANSI or ASCII-only text in a class named ‘string’.
Q: Won’t the conversions between UTF-8 and UTF-16 when passing strings to Windows slow down my application?
A: First, you will do some conversion either way. It’s either when calling the system, or when interacting with the rest of the world. Even if your interaction with the system is more frequent in your application, here is a little experiment.
A typical use of the OS is to open files. This function executes in (184 ± 3)μs on my machine:
```
void f(const wchar_t* name)
{
    HANDLE f = CreateFile(name, GENERIC_WRITE, FILE_SHARE_READ, 0, CREATE_ALWAYS, 0, 0);
    DWORD written;
    WriteFile(f, "Hello world!\n", 13, &written, 0);
    CloseHandle(f);
}
```
While this runs in (186 ± 0.7)μs:
```
void f(const char* name)
{
    HANDLE f = CreateFile(convert(name).c_str(), GENERIC_WRITE, FILE_SHARE_READ, 0, CREATE_ALWAYS, 0, 0);
    DWORD written;
    WriteFile(f, "Hello world!\n", 13, &written, 0);
    CloseHandle(f);
}
```
(Run with name="D:\\a\\test\\subdir\\subsubdir\\this is the sub dir\\a.txt" in both cases. Averaged over 5 runs. Used an optimized convert that relies on std::string contiguous storage guarantee given by C++11.)
This is just (1 ± 2)% overhead. Moreover, MultiByteToWideChar is almost surely suboptimal. Better UTF-8↔UTF-16 conversion functions exist.
Q: How do I write UTF-8 string literal in my code?
A: If you internationalize your software then all non-ASCII strings will be loaded from an external translation database, so it's not a problem.
If you still want to embed a special character you can do it as follows. In C++11 you can do it as:
u8"∃y ∀x ¬(x ≺ y)"
With compilers that don't support ‘u8’ you can hard-code the UTF-8 code units as follows:
"\xE2\x88\x83y \xE2\x88\x80x \xC2\xAC(x \xE2\x89\xBA y)"
However the most straightforward way is to just write the string as-is and save the source file encoded in UTF-8:
"∃y ∀x ¬(x ≺ y)"
Unfortunately, MSVC converts it to some ANSI codepage, corrupting the string. To work around this, save the file in UTF-8 without BOM. MSVC will assume it’s in the correct codepage and won’t touch your string. However, it renders impossible to use Unicode identifiers and wide string literals (that you will not be using anyway).
Q: I already use this approach and I want to make our vision come true. What can I do?
A: Review your code and see what library is most painful to use in portable Unicode-aware code. Open a bug report to the authors.

Myths

Indexing the nth character is O(n) in UTF-8. Not so for UTF-16/UTF-32.
So is in UTF-16 which is a variable length encoding. It’s ϴ(n) operation even in UTF-32. Characters aren’t code points. For example both
naïve (U+006E, U+0061, U+00EF, U+0076, U+0065)
and
naı̈ve (U+006E, U+0061, U+0131, U+0308, U+0076, U+0065)
are 5 character words, while the former consists of five code points and the later of six.
Code points are meaningful only in the context of Unicode algorithms, which by their nature always scan the string sequentially. Looking up the nth character is, on the other hand, has little meaning in Unicode (see below).
Finding the length of a string is O(n) in UTF-8. Not so for UTF-16/UTF-32.
The word ‘length’ is ambiguous, the comparison as stated is meaningless.
- Length in code units: is always the size of the memory occupied by the string divided by the size of one code unit. This is the most frequently used operation when you work with strings.
- Length in code points: it’s ϴ(n) in both UTF-8 and UTF-16, and it’s indeed constant-time in UTF-32. It’s needed only when you convert to UTF-32, probably inside some Unicode algorithm.
- Length in characters: it’s ϴ(n) in any encoding since one character may be encoded with multiple code points and some code points may not correspond to characters.
What you usually do care about is…
- The size of the string as it appears on the screen: you’ll need to communicate with the rendering engine for this, which has to scan the whole string no matter what encoding is used.
UTF-16 is good for processing.
So said the ‘Unicode Technical Note #12, UTF-16 for Processing’. Unfortunately, the document fails to justify the points it makes. See detailed analysis of this paper (link, to be written).

About the authors

This manifesto is written by Pavel Radzivilovsky, Yakov Galka and Slava Novgorodov, as a result of much experience and research of real-world Unicode issues and mistakes done by real-world programmers. The goal is to improve awareness of text issues and to inspire industry-wide changes to make Unicode-aware programming easier, ultimately improving the experience of users of those programs written by human engineers. Neither of us is involved in the Unicode consortium.

Much of the text is inspired by discussions on StackOverflow initiated by Artyom Beilis, the author of Boost.Locale. You can leave comments/feedback there.

External links

International Components for Unicode (ICU)
Joel on Unicode—‘The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets’
Boost.Locale—high quality localization facilities in a C++ way.
Should UTF-16 be considered harmful on stackoverflow started by Artyom Beilis

This Place is Taken

Saturday, May 19, 2012

Rude food: God’s own dish

Sunday, May 13, 2012

The Sharp Dropoff In Worker Happiness

Saturday, May 12, 2012

DropqQuest 2012 is Live. Get 1 GB free space

Sunday, May 6, 2012

The UTF-8-Everywhere Manifesto

Purpose of this document

Background

The Facts

Our Conclusions

How to do text on Windows

Working with files, filenames and fstreams on Windows

Conversion functions

FAQ

Q: Are you a linuxer? Is this a concealed religious fight against Windows?

Q: Are you an anglo-saxon? Do you secretly think English alphabet and culture are superior to any other?

Q: Why do you guys care? I use C# and don’t have to know anything about encodings.

Q: Why not just let any programmer use her favorite encoding internally, as long as she knows how to use it?

Q: Why do you turn on the `_UNICODE` define, if you do not intend to use Windows’ `LPTSTR`/`TCHAR`/etc macros?

Q: Isn’t it quite naïve to think that Microsoft will stop using widechars one day?

Q: Why would the Asians give up on UTF-16 encoding, which saves them 50% the memory per character?

Q: What do you think about BOMs?

Q: What do you think about line endings?

Q: But what about performance of text processing algorithms, byte alignment, etc?

Q: Isn’t UTF-8 merely an attempt to be compatible with ASCII? Why keep this old fossil?

Q: Is it really a fault of UTF-16 that people misuse it, assuming it is 16 bits per character?

Q: If `std::string` means UTF-8, wouldn’t that confuse with code that stores plain text in `std::string`s?

Q: Won’t the conversions between UTF-8 and UTF-16 when passing strings to Windows slow down my application?

Q: How do I write UTF-8 string literal in my code?

Q: I already use this approach and I want to make our vision come true. What can I do?

Myths

Indexing the nth character is O(n) in UTF-8. Not so for UTF-16/UTF-32.

Finding the length of a string is O(n) in UTF-8. Not so for UTF-16/UTF-32.

UTF-16 is good for processing.

About the authors

External links

Popular Posts