Thai line breaking: how to break Thai text effectively

后端 未结 2 548
时光取名叫无心
时光取名叫无心 2021-02-02 09:04

Situation with Thai text on a client site is that we can\'t control where exactly particular words/sentences are going to break between the lines (how web b

相关标签:
2条回答
  • 2021-02-02 09:33

    I know this thread was quite some time but I have something to say as a native Thai. I read lots of Thai web pages everyday and I feel the quality of Thai line breaking by the modern web browsers nowadays is perfectly acceptable.

    As I know, Google Chrome browser uses ICU4C, Internet Explorer uses Uniscribe API, and Firefox uses libthai to break Thai sentences into words. For Thai people I know, how these web browsers handle line breaks in Thai is perfectly acceptable for them. (actually we used to have this problem with very early version of Firefox (1.x) but that is resolved now.)

    Thai line breaking and word breaking, unlike western languages, is still considered an unsolved problem and is still actively tackled by many linguistics researchers. Currently there is no implementation that could perfectly break a sentence to Thai words. IBM ICU Boundary Analysis page contains some analysis on this problem.

    Many times, it has something to do with the context. For example, the phrase "ตากลม" can be correctly broken to "ตา","กลม" or "ตาก","ลม". Each way says totally different thing but Thai readers can still perfectly understand the intended meaning, given the context.

    Given that your local reviewers are already familiar with reading Thai websites, I think maybe they are too pushy on you to resolve this problem. This is common unsolvable problem for all Thai websites, web browsers, and even Microsoft Word.

    It is best to wait (or contribute to IBM ICU) until Thai sentence breaking implementation gets better. Let the web browsers handle this. I don't think trying to workaround this problem worth your valuable time. As as I know, even Thai website publishers here just don't care to get this one right.

    Should you need to publish a document with a perfect line/word breaking, you may consider other medium, such as PDF document in which you should have more control over the line breaks.

    Hope this helps :)

    0 讨论(0)
  • 2021-02-02 09:34

    The ICU and ICU4J libraries have a dictionary based word break iterator for Thai that you could use on the server side to inject breaking zero width spaces where appropriate.

    Or, you could use this to build a utility that could run at build time or on delivery of translations, if you knew the spacing requirements that far in advance.

    see ICU Boundary Analysis for more info. These libraries are available for C, C++, and Java.

    0 讨论(0)
提交回复
热议问题