How to use string kernels in scikit-learn?

前端未结

关注

 2  1906

I am trying to generate a string kernel that feeds a support vector classifier. I tried it with a function that calculates the kernel, something like that

de


                      
              相关标签:


      
      
        
          2条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  北荒        
                
              
                            
                2021-01-01 05:51
              
            
            
                                                                       
This is a limitation in scikit-learn that has proved hard to get rid of. You can try this workaround. Represent the strings in feature vectors with only one feature, which is really just an index into the table of strings.

>>> data = ["foo", "bar", "baz"]
>>> X = np.arange(len(data)).reshape(-1, 1)
>>> X
array([[0],
       [1],
       [2]])


Redefine the string kernel function to work on this representation:

>>> def string_kernel(X, Y):
...     R = np.zeros((len(x), len(y)))
...     for x in X:
...         for y in Y:
...             i = int(x[0])
...             j = int(y[0])
...             # simplest kernel ever
...             R[i, j] = data[i][0] == data[j][0]
...     return R
... 
>>> clf = SVC(kernel=string_kernel)
>>> clf.fit(X, ['no', 'yes', 'yes'])
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0,
  kernel=<function string_kernel at 0x7f5988f0bde8>, max_iter=-1,
  probability=False, random_state=None, shrinking=True, tol=0.001,
  verbose=False)


The downside to this is that to classify new samples, you have to add them to data, then construct new pseudo-feature vectors for them.

>>> data.extend(["bla", "fool"])
>>> clf.predict([[3], [4]])
array(['yes', 'no'], 
      dtype='|S3')


(You can get around this by doing more interpretation of your pseudo-features, e.g., looking into a different table for i >= len(X_train). But it's still cumbersome.)

This is an ugly hack, but it works (it's slightly less ugly for clustering because there the dataset doesn't change after fit). Speaking on behalf of the scikit-learn developers, I say a patch to fix this properly is welcome.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  甜味超标        
                
              
                            
                2021-01-01 05:57
              
            
            
                                                                       
I think that shogun library could be the solution, is also free and open source, I suggest review this example: https://github.com/shogun-toolbox/shogun/tree/develop/src/shogun/kernel/string
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复