hive regexp_extract weirdness

前端 未结 2 1145
清歌不尽
清歌不尽 2020-11-29 06:32

I am having some problems with regexp_extract:

I am querying on a tab-delimited file, the column I\'m checking has strings that look like this:

abc.d         


        
相关标签:
2条回答
  • 2020-11-29 06:37

    From the docs https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF it appears that regexp_extract() is a record/line extraction of the data you wish to extract.

    It seems to work on a first found (then quit) as opposed to global. Therefore the index references the capture group.

    0 = the entire match
    1 = capture group 1
    2 = capture group 2, etc ...

    Paraphrased from the manual:

    regexp_extract('foothebar', 'foo(.*?)(bar)', 2)
                                      ^    ^   
                   groups             1    2
    
    This returns 'bar'.
    

    So, in your case, to get the text after the dot, something like this might work:
    regexp_extract(name, '\.([^.]+)', 1)
    or this
    regexp_extract(name, '[.]([^.]+)', 1)

    edit

    I got re-interested in this, just a fyi, there could be a shortcut/workaround for you.

    It looks like you want a particular segment separated with a dot . character, which is almost like split.
    Its more than likely the regex engine used overwrites a group if it is quantified more than once.
    You can take advantage of that with something like this:

    Returns the first segment: abc.def.ghi
    regexp_extract(name, '^(?:([^.]+)\.?){1}', 1)

    Returns the second segment: abc.def.ghi
    regexp_extract(name, '^(?:([^.]+)\.?){2}', 1)

    Returns the third segment: abc.def.ghi
    regexp_extract(name, '^(?:([^.]+)\.?){3}', 1)

    The index doesn't change (because the index still referrs to capture group 1), only the regex repetition changes.

    Some notes:

    • This regex ^(?:([^.]+)\.?){n} has problems though.
      It requires there be something between dots in the segment or the regex won't match ....

    • It could be this ^(?:([^.]*)\.?){n} but this will match even if there is less than n-1 dots,
      including the empty string. This is probably not desireable.

    There is a way to do it where it doesn't require text between the dots, but still requires at least n-1 dots.
    This uses a lookahead assertion and capture buffer 2 as a flag.

    ^(?:(?!\2)([^.]*)(?:\.|$())){2} , everything else is the same.

    So, if it uses java style regex, then this should work.
    regexp_extract(name, '^(?:(?!\2)([^.]*)(?:\.|$())){2}', 1) change {2} to whatever 'segment' is needed (this does segment 2).

    and it still returns capture buffer 1 after the {N}'th iteration.

    Here it is broken down

    ^                # Begining of string
     (?:             # Grouping
        (?!\2)            # Assertion: Capture buffer 2 is UNDEFINED
        ( [^.]*)          # Capture buffer 1, optional non-dot chars, many times
        (?:               # Grouping
            \.                # Dot character
          |                 # or,
            $ ()              # End of string, set capture buffer 2 DEFINED (prevents recursion when end of string)
        )                 # End grouping
     ){3}            # End grouping, repeat group exactly 3 (or N) times (overwrites capture buffer 1 each time)
    

    If it doesn't do assertions, then this won't work!

    0 讨论(0)
  • 2020-11-29 06:44

    I think you have to make 'groups' no?

    select distinct regexp_extract(name, '([^.]+)', 1) from dummy;
    

    (untested)

    I think it behaves like the java library and this should work, let me know though.

    0 讨论(0)
提交回复
热议问题