Dropping diacritics

By Filip Salo; published on November 02, 2006.

For URLs, text indexing and some other stuff, you may need to drop diacritics from a string.

Let's open up the toolbox of the unicodedata module.

unicodedata.normalize("NFD", s) turns a unicode string s into "normal form D" (NFD), also known as canonical decomposition. In NFD, diacritics appear separately, modifying the preceding character. "Ö" becomes LATIN CAPITAL LETTER O followed by COMBINING DIAERESIS instead of LATIN CAPITAL LETTER O WITH DIAERESIS.

unicodedata.combining(c) is used to classify combining characters, and returns 0 for non-combining ones.

Using these two, dropping the diacritics is a piece of cake.

>>> from unicodedata import normalize, combining
>>> s = u"Çéñtùrÿ öf ïñtërnâtiônàlîzæt?øn."
>>> print u"".join(c for c in normalize("NFD", s) if not combining(c))
Century of internationalizætiøn.

(Somewhat) related links

Skip Montanaro's latscii codec
Diacritics - all you need to design a font with correct accents
Heavy metal umlaut (röck döts)
The unicodedata module documentation