Searching for all Unicode variation of hyphens in Python

后端 未结 1 1158
梦谈多话
梦谈多话 2021-01-18 03:54

I have been trying to extract certain text from PDF converted into text files. The PDF came from various sources and I don\'t know how they were generated.

The patte

相关标签:
1条回答
  • 2021-01-18 04:37

    The solution you ask for in the question title implies a whitelisting approach and means that you need to find the chars that you think are similar to hyphens.

    You may refer to the Punctuation, Dash Category, that Unicode cateogry lists all the Unicode hyphens possible.

    You may use a PyPi regex module and use \p{Pd} pattern to match any Unicode hyphen.

    Or, if you can only work with re, use

    [\u002D\u058A\u05BE\u1400\u1806\u2010-\u2015\u2E17\u2E1A\u2E3A\u2E3B\u2E40\u301C\u3030\u30A0\uFE31\uFE32\uFE58\uFE63\uFF0D]
    

    You may expand this list with other Unicode chars that contain minus in their Unicode names, see this list.

    A blacklisting approach means you do not want to match specific chars between the two pairs of digits. If you want to match any non-whitespace, you may use \S. If you want to match any punctuation or symbols, use (?:[^\w\s]|_).

    0 讨论(0)
提交回复
热议问题