Select strange characters on text, not working with LIKE operator

问题

I try to use this solution and this (for str_eval()) but seems other encode or other UTF8's Normalization Form, perhaps combining diacritical marks...

select distinct  logradouro, str_eval(logradouro)  
from logradouro where logradouro like '%CECi%';
--         logradouro         |          str_eval
------------------------------+----------------------------
-- AV CECi\u008DLIA MEIRELLES | AV CECi\u008DLIA MEIRELLES

PROBLEM: how to select all rows of the table where the problem exists?
That is, where \u occurs?

not works with like '%CECi\u%' neither like '%CECi\\u%'
works with like E'%CECi\u008D%' but is not generic

For Google, edited after solved question: this is a typical XY problem. In the original question (above) I used ~wrong hypothesis. All the solutions bellow are answers to the following (objective) question:

How to select only printable ASCII text?

"Printable ASCII" is a subset of UTF8, it is "all ASCII that is not a 'control character'".

The "non-printable" control characters are UNICODE hexadecimal 00 to 1F and 7F
(HTML entity  to  +  or decimal 0 to 31 + 127).

PS1: the zero () is the "end of text" mark of PostgreSQL text datatype internal representation, so not need to be checked, but no problems to include it in the range.

PS2: about the secondary question "how to convert a word with encode bug to a valid word?",
see an heuristic at my answer.

回答1:

This condition will exclude any strings that do not entirely consist of printable ASCII characters:

logradouro ~ '[^\u0020-\u007E]'

回答2:

Solving with workaround

select distinct  logradouro, str_eval(logradouro)
from logradouro where not(logradouro ~ E'^[a-zA-Z0-9_,;\\- \\.\\(\\)\\/"\'\\*]+$');

There is a systematic bug on encode, no way to convert to correct UTF8... Even converting, the problem is that "CECi\u008DLIA" is not "CECíLIA".

The solution is to use a kind of "heuristic spell corrector" on

regexp_replace(logradouro, E'[^a-zA-Z0-9_,;\\- \\.\\(\\)\\/"\'\\*]+', '!')

Example: the i! of "Ceci!lia" is corrected to í.

NOTICE. Any heuristic solution (or neural network) trained with a specific dataset (specific systematic error source) is a black box solution, valid only for that type of systematic error. There is no generalization for this type of problem.

来源：https://stackoverflow.com/questions/62416541/select-strange-characters-on-text-not-working-with-like-operator

标签

postgresql

encode

detection