Filter strings with regex before casting to numeric

倖福魔咒の 提交于 2019-12-23 23:08:22

问题


I have this code (was already there, isn't mine):

SELECT
    a.id_original_contrato AS contrato,
    ( CASE WHEN d.value~'^\\d+$' THEN d.value::integer ELSE 0 END ) AS monto,
    EXTRACT(YEAR FROM b.value)::integer AS anoinicio,
    EXTRACT(YEAR FROM c.value)::integer AS anofin

... etc (some JOIN's and WHERE's)

Let me explain: d.value comes from a table where value is character varying (200). The code will insert later the d.value (now called 'monto') in another table as a integer. Someone coded that regex in order to extract some chars or in other case (ELSE), define it as 0. Those values works when they are integer only. If I use a d.value like 76.44 it doesn't work due that regex, it always define it as 0.

Well, I have to change that code, because:

  • I need to store the d.value in the new table as numeric, not as integer anymore (In my new table the data type is numeric now)
  • But first, I need to correct that regex, because is messing my numeric numbers like 76.44, or 66,56 (dot or coma).

I'm not sure what that regex is doing. And how could I accomplish what need with a better or new regex?


回答1:


You should declare your version of Postgres and also the version that was used when the code was written (if you know that). The double backslash in \\d suggests an old version with standard_conforming_strings = off. The manual:

Beginning in PostgreSQL 9.1, the default is on (prior releases defaulted to off).

In modern versions with standard_conforming_strings = on, this string makes little sense as regular expression: '^\\d+$'. To detect strings consisting of one or more digits use either E'^\\d+$' (prefixed with E) or '^\d+$'. Details:

  • Insert text with single quotes in PostgreSQL

Integer literals also allow an optional leading sign for negative / positive numbers. And leading / dangling white space is also allowed (trimmed automatically) in Postgres.
So, this is the complete regular expression for integer:

CASE WHEN d.value ~ '^\s*[-+]?\d+\s*$' THEN d.value::int ELSE 0 END

The regular expression explained:

^ .. start of string
\s .. class shorthand for [[:space:]] (white space)
* .. quantifier for 0 or more times
[+-] .. character class consisting of + and -
? .. quantifier for 0 or 1 times
\d .. class shorthand for [[:digit:]] (digits)
+ .. quantifier for 1 or more times
\s* .. same as above
$ .. end of string

Now we know the basics. Read more in the manual I linked to. Consider the syntax rules for numeric string literals. And, while the states about legal numeric constants:

There cannot be any spaces or other characters embedded in the constant

That's because a numeric constant is not quoted, hence white space is not possible. Not applicable for casting strings. White space is tolerated: Leading, trailing and right after the exponent char.

So these are all legal strings for the cast to numeric:

'^\s*[-+]?\d*\.?\d+(?:[eE]\s*[-+]?\d+)?\s*$'

The only new element are parentheses (()) to denote the contained regular expression as atom. Since we are not interested in back references, use "non-capturing": (?:...) and append a question mark (?:[eE]\s*[-+]?\d+)? to mean: the "exponential" part can be added or not, as a whole.

Assuming dot (.) as decimal separator. You might use comma instead (,) or [,\.] to allow either. But only dot is legal for the cast.

Test:

SELECT '|' || txt || '|' As text_with_delim
     , txt ~ '^\s*[-+]?\d*\.?\d+([eE]\s*[-+]?\d+)?\s*$' As test
     , txt::numeric AS number
FROM   unnest ('{1, 123, 000, "  -1     ", +2, 1.2, .34, 5e6, " .5e   -6  "}'::text[]) txt;

Result:

 text_with_delim | test |  number
-----------------+------+-----------
 |1|             | t    |         1
 |123|           | t    |       123
 |000|           | t    |         0
 |  -1     |     | t    |        -1
 |+2|            | t    |         2
 |1.2|           | t    |       1.2
 |.34|           | t    |      0.34
 |5e6|           | t    |   5000000
 | .5e   -6  |   | t    | 0.0000005

Or you might have use for to_number() to convert strings of arbitrary given format.




回答2:


Choose a variant:

with v(value) as (
    values
    ('12,3'),
    ('12.3'),
    ('123'),
    ('123.'),
    ('.123'),
    ('1.2.3')
    )

select 
    value, 
    value ~ '^(\d+[,\.]\d+|\d+)$' as variant_a,
    value ~ '^(\d*[,\.]\d*|\d+)$' as variant_b,
    value ~ '^\d+[,\.]\d+$' as variant_c
from v;

 value | variant_a | variant_b | variant_c 
-------+-----------+-----------+-----------
 12,3  | t         | t         | t
 12.3  | t         | t         | t
 123   | t         | t         | f
 123.  | f         | t         | f
 .123  | f         | t         | f
 1.2.3 | f         | f         | f
(6 rows)

To convert a string with a dot or a comma to numeric use replace():

select replace(value, ',', '.')::numeric;   


来源:https://stackoverflow.com/questions/42214084/filter-strings-with-regex-before-casting-to-numeric

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!