What can I do to prevent slugify
filter from stripping out non-ASCII alphanumeric characters? (I\'m using Django 1.0.2)
cnprog.com has Chinese character
I am interested in allowing only ASCII characters in the slug this is why I tried to benchmark some of the available tools for the same string:
Unicode Slugify:
In [5]: %timeit slugify('Παίζω τρέχω %^&*@# και γ%^(λώ la fd/o', only_ascii=True)
37.8 µs ± 86.7 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
'paizo-trekho-kai-glo-la-fdo'
Django Uuslug:
In [3]: %timeit slugify('Παίζω τρέχω %^&*@# και γ%^(λώ la fd/o')
35.3 µs ± 303 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
'paizo-trekho-kai-g-lo-la-fd-o'
Awesome Slugify:
In [3]: %timeit slugify('Παίζω τρέχω %^&*@# και γ%^(λώ la fd/o')
47.1 µs ± 1.94 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
'Paizo-trekho-kai-g-lo-la-fd-o'
Python Slugify:
In [3]: %timeit slugify('Παίζω τρέχω %^&*@# και γ%^(λώ la fd/o')
24.6 µs ± 122 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
'paizo-trekho-kai-g-lo-la-fd-o'
django.utils.text.slugify
with Unidecode:
In [15]: %timeit slugify(unidecode('Παίζω τρέχω %^&*@# και γ%^(λώ la fd/o'))
36.5 µs ± 89.7 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
'paizo-trekho-kai-glo-la-fdo'
There is a python package called unidecode that I've adopted for the askbot Q&A forum, it works well for the latin-based alphabets and even looks reasonable for greek:
>>> import unidecode
>>> from unidecode import unidecode
>>> unidecode(u'διακριτικός')
'diakritikos'
It does something weird with asian languages:
>>> unidecode(u'影師嗎')
'Ying Shi Ma '
>>>
Does this make sense?
In askbot we compute slugs like so:
from unidecode import unidecode
from django.template import defaultfilters
slug = defaultfilters.slugify(unidecode(input_text))