For URLs, text indexing and some other stuff, you may need to drop diacritics from a string.
Let's open up the toolbox of the
unicodedata.normalize("NFD", s) turns a unicode string
s into "normal form D" (NFD), also known as canonical decomposition. In NFD, diacritics appear separately, modifying the preceding character. "Ö" becomes LATIN CAPITAL LETTER O followed by COMBINING DIAERESIS instead of LATIN CAPITAL LETTER O WITH DIAERESIS.
unicodedata.combining(c) is used to classify combining characters, and returns
0 for non-combining ones.
Using these two, dropping the diacritics is a piece of cake.
>>> from unicodedata import normalize, combining >>> s = u"Çéñtùrÿ öf ïñtërnâtiônàlîzæt?øn." >>> print u"".join(c for c in normalize("NFD", s) if not combining(c)) Century of internationalizætiøn.