How to convert a mixed-type Matrix to DataFrame in Julia recognising the column types

前端 未结 4 2084
深忆病人
深忆病人 2021-01-26 02:50

One nice feature of DataFrames is that it can store columns with different types and it can \"auto-recognise\" them, e.g.:

using DataFrames, DataStructures

df1          


        
相关标签:
4条回答
  • 2021-01-26 03:08
    mat2df(mat) = 
        DataFrame([[mat[2:end,i]...] for i in 1:size(mat,2)], Symbol.(mat[1,:]))
    

    Seems to work and is faster than @dan-getz's answer (at least for this data matrix) :)

    using DataFrames, BenchmarkTools
    
    dataMatrix = [
        "parName"   "region"    "forType"       "value";
        "vol"       "AL"        "broadL_highF"  3.3055628012;
        "vol"       "AL"        "con_highF"     2.1360975151;
        "vol"       "AQ"        "broadL_highF"  5.81984502;
        "vol"       "AQ"        "con_highF"     8.1462998309;
    ]
    
    mat2df(mat) = 
        DataFrame([[mat[2:end,i]...] for i in 1:size(mat,2)], Symbol.(mat[1,:]))
    
    function mat2dfDan(mat)
        s = join([join([dataMatrix[i,j] for j in indices(dataMatrix, 2)], '\t') 
                    for i in indices(dataMatrix, 1)],'\n')
    
        DataFrames.inlinetable(s; separator='\t', header=true)
    end
    

    -

    julia> @benchmark mat2df(dataMatrix)
    
    BenchmarkTools.Trial: 
      memory estimate:  5.05 KiB
      allocs estimate:  75
      --------------
      minimum time:     18.601 μs (0.00% GC)
      median time:      21.318 μs (0.00% GC)
      mean time:        31.773 μs (2.50% GC)
      maximum time:     4.287 ms (95.32% GC)
      --------------
      samples:          10000
      evals/sample:     1
    
    julia> @benchmark mat2dfDan(dataMatrix)
    
    BenchmarkTools.Trial: 
      memory estimate:  17.55 KiB
      allocs estimate:  318
      --------------
      minimum time:     69.183 μs (0.00% GC)
      median time:      81.326 μs (0.00% GC)
      mean time:        90.284 μs (2.97% GC)
      maximum time:     5.565 ms (93.72% GC)
      --------------
      samples:          10000
      evals/sample:     1
    
    0 讨论(0)
  • 2021-01-26 03:09

    While I didn't find a complete solution, a partial one is to try to convert the individual columns ex-post:

    """
        convertDf!(df)
    
    Try to convert each column of the converted df from Any to In64, Float64 or String (in that order).    
    """
    function convertDf!(df)
        for c in names(df)
            try
              df[c] = convert(DataArrays.DataArray{Int64,1},df[c])
            catch
                try
                  df[c] = convert(DataArrays.DataArray{Float64,1},df[c])
                catch
                    try
                      df[c] = convert(DataArrays.DataArray{String,1},df[c])
                    catch
                    end
                end
            end
        end
    end 
    

    While surely incomplete, it is enough for my needs.

    0 讨论(0)
  • 2021-01-26 03:10

    While I think there may be a better way to go about the whole thing this should do what you want.

    df = DataFrame()
    for (ind,s) in enumerate(Symbol.(dataMatrix[1,:])) # convert first row to symbols and iterate through them.
        # check all types the same else assign to Any
        T = typeof(dataMatrix[2,ind])
        T = all(typeof.(dataMatrix[2:end,ind]).==T) ? T : Any
        # convert to type of second element then add to data frame
        df[s] = T.(dataMatrix[2:end,ind])
    end
    
    0 讨论(0)
  • 2021-01-26 03:27

    Another method would be reuse the working solution i.e. convert the matrix into a string appropriate for DataFrames to consume. In code, this is:

    using DataFrames
    
    dataMatrix = [
        "parName"   "region"    "forType"       "value";
        "vol"       "AL"        "broadL_highF"  3.3055628012;
        "vol"       "AL"        "con_highF"     2.1360975151;
        "vol"       "AQ"        "broadL_highF"  5.81984502;
        "vol"       "AQ"        "con_highF"     8.1462998309;
    ]
    
    s = join(
      [join([dataMatrix[i,j] for j in indices(dataMatrix, 2)]
      , '\t') for i in indices(dataMatrix, 1)], '\n')
    
    df = DataFrames.inlinetable(s; separator='\t', header=true)
    

    The resulting df has its column types guessed by DataFrame.

    Unrelated, but this answer reminds me of the how a mathematician boils water joke.

    0 讨论(0)
提交回复
热议问题