How to convert a mixed-type Matrix to DataFrame in Julia recognising the column types

前端未结

关注

 4  2092

One nice feature of DataFrames is that it can store columns with different types and it can \"auto-recognise\" them, e.g.:

using DataFrames, DataStructures

df1


                      
              相关标签:


      
      
        
          4条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  死守一世寂寞        
                
              
                            
                2021-01-26 03:08
              
            
            
                                                                       
mat2df(mat) = 
    DataFrame([[mat[2:end,i]...] for i in 1:size(mat,2)], Symbol.(mat[1,:]))


Seems to work and is faster than @dan-getz's answer (at least for this data matrix) :)

using DataFrames, BenchmarkTools

dataMatrix = [
    "parName"   "region"    "forType"       "value";
    "vol"       "AL"        "broadL_highF"  3.3055628012;
    "vol"       "AL"        "con_highF"     2.1360975151;
    "vol"       "AQ"        "broadL_highF"  5.81984502;
    "vol"       "AQ"        "con_highF"     8.1462998309;
]

mat2df(mat) = 
    DataFrame([[mat[2:end,i]...] for i in 1:size(mat,2)], Symbol.(mat[1,:]))

function mat2dfDan(mat)
    s = join([join([dataMatrix[i,j] for j in indices(dataMatrix, 2)], '\t') 
                for i in indices(dataMatrix, 1)],'\n')

    DataFrames.inlinetable(s; separator='\t', header=true)
end


-

julia> @benchmark mat2df(dataMatrix)

BenchmarkTools.Trial: 
  memory estimate:  5.05 KiB
  allocs estimate:  75
  --------------
  minimum time:     18.601 μs (0.00% GC)
  median time:      21.318 μs (0.00% GC)
  mean time:        31.773 μs (2.50% GC)
  maximum time:     4.287 ms (95.32% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark mat2dfDan(dataMatrix)

BenchmarkTools.Trial: 
  memory estimate:  17.55 KiB
  allocs estimate:  318
  --------------
  minimum time:     69.183 μs (0.00% GC)
  median time:      81.326 μs (0.00% GC)
  mean time:        90.284 μs (2.97% GC)
  maximum time:     5.565 ms (93.72% GC)
  --------------
  samples:          10000
  evals/sample:     1

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  心在旅途        
                
              
                            
                2021-01-26 03:09
              
            
            
                                                                       
While I didn't find a complete solution, a partial one is to try to convert the individual columns ex-post:

"""
    convertDf!(df)

Try to convert each column of the converted df from Any to In64, Float64 or String (in that order).    
"""
function convertDf!(df)
    for c in names(df)
        try
          df[c] = convert(DataArrays.DataArray{Int64,1},df[c])
        catch
            try
              df[c] = convert(DataArrays.DataArray{Float64,1},df[c])
            catch
                try
                  df[c] = convert(DataArrays.DataArray{String,1},df[c])
                catch
                end
            end
        end
    end
end 


While surely incomplete, it is enough for my needs.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  忘掉有多难        
                
              
                            
                2021-01-26 03:10
              
            
            
                                                                       
While I think there may be a better way to go about the whole thing this should do what you want.

df = DataFrame()
for (ind,s) in enumerate(Symbol.(dataMatrix[1,:])) # convert first row to symbols and iterate through them.
    # check all types the same else assign to Any
    T = typeof(dataMatrix[2,ind])
    T = all(typeof.(dataMatrix[2:end,ind]).==T) ? T : Any
    # convert to type of second element then add to data frame
    df[s] = T.(dataMatrix[2:end,ind])
end

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  你的背包        
                
              
                            
                2021-01-26 03:27
              
            
            
                                                                       
Another method would be reuse the working solution i.e. convert the matrix into a string appropriate for DataFrames to consume. In code, this is:

using DataFrames

dataMatrix = [
    "parName"   "region"    "forType"       "value";
    "vol"       "AL"        "broadL_highF"  3.3055628012;
    "vol"       "AL"        "con_highF"     2.1360975151;
    "vol"       "AQ"        "broadL_highF"  5.81984502;
    "vol"       "AQ"        "con_highF"     8.1462998309;
]

s = join(
  [join([dataMatrix[i,j] for j in indices(dataMatrix, 2)]
  , '\t') for i in indices(dataMatrix, 1)], '\n')

df = DataFrames.inlinetable(s; separator='\t', header=true)


The resulting df has its column types guessed by DataFrame.

Unrelated, but this answer reminds me of the how a mathematician boils water joke.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复