Dropping diacritics
For URLs, text indexing and some other stuff, you may need to drop diacritics from a string.
Let's open up the toolbox of the unicodedata
module.
unicodedata.normalize("NFD", s)
turns a unicode string s
into "normal form D" (NFD), also known as canonical decomposition. In NFD, diacritics appear separately, modifying the preceding character. "Ö" becomes LATIN CAPITAL LETTER O followed by COMBINING DIAERESIS instead of LATIN CAPITAL LETTER O WITH DIAERESIS.
unicodedata.combining(c)
is used to classify combining characters, and returns 0
for non-combining ones.
Using these two, dropping the diacritics is a piece of cake.
>>> from unicodedata import normalize, combining
>>> s = u"Çéñtùrÿ öf ïñtërnâtiônàlîzæt?øn."
>>> print u"".join(c for c in normalize("NFD", s) if not combining(c))
Century of internationalizætiøn.