Transform bag of key-value tuples to map in Apache Pig

前端未结

关注

 2  1960

I am new to Pig and I want to convert a bag of tuples to a map with specific value in each tuple as key. Basically I want to change:

{(id1, value1),(id2, value


                      
              相关标签:


      
      
        
          2条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  情话喂你        
                
              
                            
                2021-01-02 12:08
              
            
            
                                                                       
I ran into the same situation so I submitted a patch that just got accepted: https://issues.apache.org/jira/browse/PIG-4638

This means that what you wanted is a core part starting with pig 0.16.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  天命终不由人        
                
              
                            
                2021-01-02 12:18
              
            
            
                                                                       
TOMAP takes a series of pairs and converts them into the map, so it is meant to be used like:

-- Schema: A:{foo:chararray, bar:int, bing:chararray, bang:int}
-- Data:     (John,          27,      Joe,            30)
B = FOREACH A GENERATE TOMAP(foo, bar, bing, bang) AS m ;
-- Schema: B:{m: map[]}
-- Data:     (John#27,Joe#30)


So as you can see the syntax does not support converting a bag to a map.  As far as I know there is no way to convert a bag in the format you have to map in pure pig.  However, you can definitively write a java UDF to do this.

NOTE: I'm not too experienced with java, so this UDF can easily be improved on (adding exception handling, what happens if a key added twice etc.).  However, it does accomplish what you need it to.

package myudfs;
import java.io.IOException;
import org.apache.pig.EvalFunc;

import java.util.Map;
import java.util.HashMap;
import java.util.Iterator;
import org.apache.pig.data.Tuple;
import org.apache.pig.data.DataBag;

public class ConvertToMap extends EvalFunc<Map>
{
    public Map exec(Tuple input) throws IOException {
        DataBag values = (DataBag)input.get(0);
        Map<Object, Object> m = new HashMap<Object, Object>();
        for (Iterator<Tuple> it = values.iterator(); it.hasNext();) {
            Tuple t = it.next();
            m.put(t.get(0), t.get(1));
        }
        return m;
    }
}


Once you compile the script into a jar, it can be used like:

REGISTER myudfs.jar ;
-- A is loading some sample data I made
A = LOAD 'foo.in' AS (foo:{T:(id:chararray, value:chararray)}) ;
B = FOREACH A GENERATE myudfs.ConvertToMap(foo) AS bar;


Contents of foo.in:

{(open,apache),(apache,hadoop)}
{(foo,bar),(bar,foo),(open,what)}


Output from B:

([open#apache,apache#hadoop])
([bar#foo,open#what,foo#bar])




Another approach is to use python to create the UDF:

myudfs.py

#!/usr/bin/python

@outputSchema("foo:map[]")
def BagtoMap(bag):
    d = {}
    for key, value in bag:
        d[key] = value
    return d


Which is used like this:

Register 'myudfs.py' using jython as myfuncs;
-- A is still just loading some of my test data
A = LOAD 'foo.in' AS (foo:{T:(key:chararray, value:chararray)}) ;
B = FOREACH A GENERATE myfuncs.BagtoMap(foo) ;


And produces the same output as the Java UDF.



BONUS:
Since I don't like maps very much,  here is a link explaining how the functionality of a map can be replicated with just key value pairs.  Since your key value pairs are in a bag, you'll need to do the map-like operations in a nested FOREACH:

-- A is a schema that contains kv_pairs, a bag in the form {(id, value)}
B = FOREACH A {
    temp = FOREACH kv_pairs GENERATE (key=='foo'?value:NULL) ;
    -- Output is like: ({(),(thevalue),(),()})

    -- MAX will pull the maximum value from the filtered bag, which is 
    -- value (the chararray) if the key matched. Otherwise it will return NULL.
    GENERATE MAX(temp) as kv_pairs_filtered ;
}

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复