Names are complicated
What’s in a name? The answer turns out to vary quite widely around
the world. When an English-language form, either electronic or paper,
asks for a person’s name, it usually provides separate fields for
first and last name, and sometimes middle name or middle initial.
Aristotle Pagaltzis linked to a
post by Jim
Clark on
Thai names, demonstrating that this approach, or even the alternative “given name, family name”, falls down pretty quickly outside the English-speaking world.
Thai names consist of:
- A given name, similar to the English first name, except that it must come from a list of government-approved names;
- A family name, which is also government-regulated; all people with the same family name are related, and new Thai citizens must select an unused name. Like all non-namespaced identifiers (domain names, instant messenger handles, user names on popular web services), the good short ones are taken; and
- A chue len, which is typically translated as nickname, but according to Mr. Clark is more like an informal given name; it’s selected by one’s parents or close relatives early in life (though not necessarily at birth).
The obvious mapping of Thai name components onto English, (given name,
family name, chue len) → (first name, last name, nickname),
doesn’t work very well. Consider the Thai name Thaksin Shinawatra,
chue len Meow, the former prime minister. His (romanized; more on
that later) legal name is Thaksin Shinawatra. If addressing him
politely, I would refer to him as Khun Thaksin.1 Note that this
is {honorific} {given name}, not {honorific} {family name}; in other
words, Mr. Matthew as opposed to Mr. Sachs. His friends and
family will call him Meow, not Thaksin or Shinawatra.
A further wrinkle is that when sorting a list of Thai names, the given
name, not the family name, should be the sort key. Then there’s also
the matter that Thaksin Shinawatra, aka Meow isn’t really the
gentleman’s name at all; it’s
ทักษิณ
ชินวัตร, aka
แม้ว. There are several standard
romanizations for Thai, and whichever one the named individual prefers
is considered canonical. There are also other quirks involved in the
Thai script form of a name, like the lack of whitespace between the
honorific and the given name.
Non-Thai complications
Then there are the whole sets of different requirements for other kinds of names. The comments on Jim Clark’s blog entry, and this post by Richard Ishid, who’s in charge of i18n issues for the W3C, give some other good examples.
- Russian and Icelandic have gender suffixes on the family name (Fuzaylova for a woman, Fuzaylov for a man; Fjalar Jónsson vs. Katrín Jónsdóttir.)
- Russian has nicknames (which, like Thai “nicknames”, are much more widely used than English nicknames) which are usually (always?) systematically derivable from their given names; Vladimir → Vova.
- Scandanavian given names typically include spaces, and convention varies as to how acceptable it is to refer to Hans Christian Andersen as Hans vs. Hans Christian. This isn’t unheard of in the southern United States, either — Billy Jean, &c. In some parts of Europe, these multipart given names are hyphenated, as in the Austrian Hans-Christian or the French Jean-Claude.
- In France and Italy, names can have a comma which essentially divides a series of first names from a series of middle names; in France, the middle names are rarely used outside of legal contexts, while in Italy, the middle names aren’t used in legal contexts. A Mario, Alberto Giovanni Rossi would have a legal name of Mario Rossi in Italy, whereas a French Jean, Christophe Dupond would be commonly known as Jean Dupond but legally Jean, Christophe Dupond.
- Many countries use patronymics instead of stable family names, so a set of related people won’t have the same family name.
- Many Chinese take arbitrary western nicknames for ease of communicating with westerners.
- Chinese names also have generational markers, so a set of siblings will all have the same “middle” name, and names are written {family}{generational}{given} in Chinese script.
So what?
How much of this do we really need to worry about? When I say that
Thai names should be sorted by given name, should, of course, is a
horribly loaded term. If an American border control agent pulls
up a list of people who have entered the country at a particular
point, they probably want the sort key to be Thaksin, not
Shinawatra. Mapping (given, family) → (first, last) is also
probably fine for this application. So when, exactly, does the extra
information need to be preserved?
Some reasons that a system might be interested in a name, or parts of
a name, are:
- Correlating records with other systems
- Displaying people’s names
- Addressing people in writing (“Dear Mr. Sachs,”, “Welcome, Matthew!”) or on the phone
- Identifying people (“To look up your records, enter your name”)
- Searching for people (on, say, a social networking site)
- Sorting a list of people
For most English applications that don’t cater to a large
international audience, it might be “good enough” to either simply
have a flat name field where users can either enter arbitrary names or
at least their romanizations.2 A flat name field is much more
flexible. Since you probably need to support substring searches
anyway, it doesn’t lose anything as far as searching’s concerned.
If you want to sort by last name, or communicate with other systems
that take a (first name, last name) tuple, it might be good enough to
just split off the last whitespace-separated token and treat that as
the last name.3 If that’s not good enough, a pair of (first names,
last name) or (given names, family name) inputs may be called for, but
characters such as spaces and apostophes (O’Flannagan) should be
valid. If your application wants to try to automatically derive a
secondary form of address from the name entered, maybe it shouldn’t.
Is the ability to have form letters say Mr. Sachs as opposed to
Matthew Sachs really worth the faux pas of Mr. Shinawatra? I
guess it depends on how international your audience is; you could always ask for multiple forms of address.4
For applications that want to really get localized names right, like a
system-wide address book or a global social networking site, a more
complex approach is called for. For instance, the Mac OS X address
book framework knows about the address formats for various countries;
it could extend that functionality to support different name formats.
It has some rudimentary support for this, in that an individual
address book entry can have a set of name ordering flags associated
with it, either first name first or last name first (sic); name fields
are fixed at title, first name, middle name, last name,
suffix, nickname, maiden name, and phonetic (first, middle,
last) name.
Per-country address format support doesn’t change which fields exist,
but it changes the order they’re displayed in. Per-country name
format would need to be more complicated. A Name (which a person might have more than one of with different NameFormats) might consist of:
NameFormat, defining the (country, language) associated with the name (e.g. en.US and the set of available NameComponent)
- A list of (
NameComponent, Value, (optional) PhoneticValue)
The system could provide functions like:
int Name.compareWith(Name)
String Name.representation(NAME_REPRESENTATION) where NAME_REPRESENTATION is one of:
LEGAL_NAME
FORMAL_NAME
SHORT_FORMAL_NAME
INFORMAL_NAME
VERY_INFORMAL_NAME
Name Name.convertTo(NameFormat) would try to convert to a different name representation using automated rules for things like romanization.