Select strange characters on text, not working with LIKE operator

左心房为你撑大大i 提交于 2020-06-29 06:44:32

问题


I try to use this solution and this (for str_eval()) but seems other encode or other UTF8's Normalization Form, perhaps combining diacritical marks...

select distinct  logradouro, str_eval(logradouro)  
from logradouro where logradouro like '%CECi%';
--         logradouro         |          str_eval
------------------------------+----------------------------
-- AV CECi\u008DLIA MEIRELLES | AV CECi\u008DLIA MEIRELLES

PROBLEM: how to select all rows of the table where the problem exists?
That is, where \u occurs?

  • not works with like '%CECi\u%' neither like '%CECi\\u%'
  • works with like E'%CECi\u008D%' but is not generic

For Google, edited after solved question: this is a typical XY problem. In the original question (above) I used ~wrong hypothesis. All the solutions bellow are answers to the following (objective) question:

How to select only printable ASCII text?

"Printable ASCII" is a subset of UTF8, it is "all ASCII that is not a 'control character'".

The "non-printable" control characters are UNICODE hexadecimal 00 to 1F and 7F
(HTML entity � to  +  or decimal 0 to 31 + 127).

PS1: the zero (�) is the "end of text" mark of PostgreSQL text datatype internal representation, so not need to be checked, but no problems to include it in the range.

PS2: about the secondary question "how to convert a word with encode bug to a valid word?",
see an heuristic at my answer.


回答1:


This condition will exclude any strings that do not entirely consist of printable ASCII characters:

logradouro ~ '[^\u0020-\u007E]'



回答2:


Solving with workaround

select distinct  logradouro, str_eval(logradouro)
from logradouro where not(logradouro ~ E'^[a-zA-Z0-9_,;\\- \\.\\(\\)\\/"\'\\*]+$');

There is a systematic bug on encode, no way to convert to correct UTF8... Even converting, the problem is that "CECi\u008DLIA" is not "CECíLIA".

The solution is to use a kind of "heuristic spell corrector" on

regexp_replace(logradouro, E'[^a-zA-Z0-9_,;\\- \\.\\(\\)\\/"\'\\*]+', '!')

Example: the i! of "Ceci!lia" is corrected to í.


NOTICE. Any heuristic solution (or neural network) trained with a specific dataset (specific systematic error source) is a black box solution, valid only for that type of systematic error. There is no generalization for this type of problem.



来源:https://stackoverflow.com/questions/62416541/select-strange-characters-on-text-not-working-with-like-operator

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!