问题
I try to use this solution and this (for str_eval()
) but seems other encode or other UTF8's Normalization Form, perhaps combining diacritical marks...
select distinct logradouro, str_eval(logradouro)
from logradouro where logradouro like '%CECi%';
-- logradouro | str_eval
------------------------------+----------------------------
-- AV CECi\u008DLIA MEIRELLES | AV CECi\u008DLIA MEIRELLES
PROBLEM: how to select all rows of the table where the problem exists?
That is, where \u
occurs?
- not works with
like '%CECi\u%'
neitherlike '%CECi\\u%'
- works with
like E'%CECi\u008D%'
but is not generic
For Google, edited after solved question: this is a typical XY problem. In the original question (above) I used ~wrong hypothesis. All the solutions bellow are answers to the following (objective) question:
How to select only printable ASCII text?
"Printable ASCII" is a subset of UTF8, it is "all ASCII that is not a 'control character'".
The "non-printable" control characters are UNICODE hexadecimal 00 to 1F and 7F
(HTML entity �
to 
+ 
or decimal 0 to 31 + 127).
PS1: the zero (�
) is the "end of text" mark of PostgreSQL text datatype internal representation, so not need to be checked, but no problems to include it in the range.
PS2: about the secondary question "how to convert a word with encode bug to a valid word?",
see an heuristic at my answer.
回答1:
This condition will exclude any strings that do not entirely consist of printable ASCII characters:
logradouro ~ '[^\u0020-\u007E]'
回答2:
Solving with workaround
select distinct logradouro, str_eval(logradouro)
from logradouro where not(logradouro ~ E'^[a-zA-Z0-9_,;\\- \\.\\(\\)\\/"\'\\*]+$');
There is a systematic bug on encode, no way to convert to correct UTF8... Even converting, the problem is that "CECi\u008DLIA" is not "CECíLIA".
The solution is to use a kind of "heuristic spell corrector" on
regexp_replace(logradouro, E'[^a-zA-Z0-9_,;\\- \\.\\(\\)\\/"\'\\*]+', '!')
Example: the i!
of "Ceci!lia" is corrected to í
.
NOTICE. Any heuristic solution (or neural network) trained with a specific dataset (specific systematic error source) is a black box solution, valid only for that type of systematic error. There is no generalization for this type of problem.
来源:https://stackoverflow.com/questions/62416541/select-strange-characters-on-text-not-working-with-like-operator