hive regexp_extract weirdness

前端未结

关注

 2  1145

I am having some problems with regexp_extract:

I am querying on a tab-delimited file, the column I\'m checking has strings that look like this:

abc.d


                      
              相关标签:


      
      
        
          2条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  傲寒        
                
              
                            
                2020-11-29 06:37
              
            
            
                                                                       
From the docs https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF it appears that regexp_extract() is a record/line extraction of the data you wish to extract.  

It seems to work on a first found (then quit) as opposed to global. Therefore the index references the capture group.

0 = the entire match

1 = capture group 1

2 = capture group 2, etc ...  

Paraphrased from the manual: 

regexp_extract('foothebar', 'foo(.*?)(bar)', 2)
                                  ^    ^   
               groups             1    2

This returns 'bar'.


So, in your case, to get the text after the dot, something like this might work:

regexp_extract(name, '\.([^.]+)', 1)

or this

regexp_extract(name, '[.]([^.]+)', 1)  

edit  

I got re-interested in this, just a fyi, there could be a shortcut/workaround for you.  

It looks like you want a particular segment separated with a dot . character, which is almost like split.

Its more than likely the regex engine used overwrites a group if it is quantified more than once.

You can take advantage of that with something like this:  

Returns the first segment: abc.def.ghi

regexp_extract(name, '^(?:([^.]+)\.?){1}', 1)

Returns the second segment: abc.def.ghi

regexp_extract(name, '^(?:([^.]+)\.?){2}', 1)

Returns the third segment: abc.def.ghi

regexp_extract(name, '^(?:([^.]+)\.?){3}', 1)

The index doesn't change (because the index still referrs to capture group 1), only the regex repetition changes.

Some notes:  


This regex ^(?:([^.]+)\.?){n} has problems though.

It requires there be something between dots in the segment or the regex won't match ....  
It could be this ^(?:([^.]*)\.?){n} but this will match even if there is less than n-1 dots,

including the empty string. This is probably not desireable.


There is a way to do it where it doesn't require text between the dots, but still requires at least n-1 dots.

This uses a lookahead assertion and capture buffer 2 as a flag.  

^(?:(?!\2)([^.]*)(?:\.|$())){2} , everything else is the same.

So, if it uses java style regex, then this should work.

regexp_extract(name, '^(?:(?!\2)([^.]*)(?:\.|$())){2}', 1) change {2} to whatever 'segment' is needed (this does segment 2).

and it still returns capture buffer 1 after the {N}'th iteration.

Here it is broken down

^                # Begining of string
 (?:             # Grouping
    (?!\2)            # Assertion: Capture buffer 2 is UNDEFINED
    ( [^.]*)          # Capture buffer 1, optional non-dot chars, many times
    (?:               # Grouping
        \.                # Dot character
      |                 # or,
        $ ()              # End of string, set capture buffer 2 DEFINED (prevents recursion when end of string)
    )                 # End grouping
 ){3}            # End grouping, repeat group exactly 3 (or N) times (overwrites capture buffer 1 each time)


If it doesn't do assertions, then this won't work!  
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  离开以前        
                
              
                            
                2020-11-29 06:44
              
            
            
                                                                       
I think you have to make 'groups' no?

select distinct regexp_extract(name, '([^.]+)', 1) from dummy;


(untested)

I think it behaves like the java library and this should work, let me know though.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复