Fast transliteration for Arabic Text with Python

前端 未结 5 1060
难免孤独
难免孤独 2021-02-06 09:14

I always work on Arabic text files and to avoid problems with encoding I transliterate Arabic characters into English according to Buckwalter\'s scheme (http://www.qamus.org/tra

5条回答
  •  时光取名叫无心
    2021-02-06 09:38

    Whenever I use str.translate on unicode objects it returns the same exact object. Perhaps this is due to the change in behavior alluded to by Martijn Peters.

    If anyone else out there is struggling to transliterate unicode such as arabic to ascii, I've found that mapping ordinals to unicode literals works well.

    >>> buckArab = {"'":"ء", "|":"آ", "?":"أ", "&":"ؤ", "<":"إ", "}":"ئ", "A":"ا", "b":"ب", "p":"ة", "t":"ت", "v":"ث", "g":"ج", "H":"ح", "x":"خ", "d":"د", "*":"ذ", "r":"ر", "z":"ز", "s":"س", "$":"ش", "S":"ص", "D":"ض", "T":"ط", "Z":"ظ", "E":"ع", "G":"غ", "_":"ـ", "f":"ف", "q":"ق", "k":"ك", "l":"ل", "m":"م", "n":"ن", "h":"ه", "w":"و", "Y":"ى", "y":"ي", "F":"ً", "N":"ٌ", "K":"ٍ", "~":"ّ", "o":"ْ", "u":"ُ", "a":"َ", "i":"ِ"}
    >>> ordbuckArab = {ord(v.decode('utf8')): unicode(k) for (k, v) in buckArab.iteritems()}
    >>> ordbuckArab
    {1569: u"'", 1570: u'|', 1571: u'?', 1572: u'&', 1573: u'<', 1574: u'}', 1575: u'A', 1576: u'b', 1577: u'p', 1578: u't', 1579: u'v', 1580: u'g', 1581: u'H', 1582: u'x', 1583: u'd', 1584: u'*', 1585: u'r', 1586: u'z', 1587: u's', 1588: u'$', 1589: u'S', 1590: u'D', 1591: u'T', 1592: u'Z', 1593: u'E', 1594: u'G', 1600: u'_', 1601: u'f', 1602: u'q', 1603: u'k', 1604: u'l', 1605: u'm', 1606: u'n', 1607: u'h', 1608: u'w', 1609: u'Y', 1610: u'y', 1611: u'F', 1612: u'N', 1613: u'K', 1614: u'a', 1615: u'u', 1616: u'i', 1617: u'~', 1618: u'o'}
    >>> u'طعصط'.translate(ordbuckArab)
    u'TEST'
    

提交回复
热议问题