One nice feature of DataFrames is that it can store columns with different types and it can \"auto-recognise\" them, e.g.:
using DataFrames, DataStructures
df1
mat2df(mat) =
DataFrame([[mat[2:end,i]...] for i in 1:size(mat,2)], Symbol.(mat[1,:]))
Seems to work and is faster than @dan-getz's answer (at least for this data matrix) :)
using DataFrames, BenchmarkTools
dataMatrix = [
"parName" "region" "forType" "value";
"vol" "AL" "broadL_highF" 3.3055628012;
"vol" "AL" "con_highF" 2.1360975151;
"vol" "AQ" "broadL_highF" 5.81984502;
"vol" "AQ" "con_highF" 8.1462998309;
]
mat2df(mat) =
DataFrame([[mat[2:end,i]...] for i in 1:size(mat,2)], Symbol.(mat[1,:]))
function mat2dfDan(mat)
s = join([join([dataMatrix[i,j] for j in indices(dataMatrix, 2)], '\t')
for i in indices(dataMatrix, 1)],'\n')
DataFrames.inlinetable(s; separator='\t', header=true)
end
-
julia> @benchmark mat2df(dataMatrix)
BenchmarkTools.Trial:
memory estimate: 5.05 KiB
allocs estimate: 75
--------------
minimum time: 18.601 μs (0.00% GC)
median time: 21.318 μs (0.00% GC)
mean time: 31.773 μs (2.50% GC)
maximum time: 4.287 ms (95.32% GC)
--------------
samples: 10000
evals/sample: 1
julia> @benchmark mat2dfDan(dataMatrix)
BenchmarkTools.Trial:
memory estimate: 17.55 KiB
allocs estimate: 318
--------------
minimum time: 69.183 μs (0.00% GC)
median time: 81.326 μs (0.00% GC)
mean time: 90.284 μs (2.97% GC)
maximum time: 5.565 ms (93.72% GC)
--------------
samples: 10000
evals/sample: 1
While I didn't find a complete solution, a partial one is to try to convert the individual columns ex-post:
"""
convertDf!(df)
Try to convert each column of the converted df from Any to In64, Float64 or String (in that order).
"""
function convertDf!(df)
for c in names(df)
try
df[c] = convert(DataArrays.DataArray{Int64,1},df[c])
catch
try
df[c] = convert(DataArrays.DataArray{Float64,1},df[c])
catch
try
df[c] = convert(DataArrays.DataArray{String,1},df[c])
catch
end
end
end
end
end
While surely incomplete, it is enough for my needs.
While I think there may be a better way to go about the whole thing this should do what you want.
df = DataFrame()
for (ind,s) in enumerate(Symbol.(dataMatrix[1,:])) # convert first row to symbols and iterate through them.
# check all types the same else assign to Any
T = typeof(dataMatrix[2,ind])
T = all(typeof.(dataMatrix[2:end,ind]).==T) ? T : Any
# convert to type of second element then add to data frame
df[s] = T.(dataMatrix[2:end,ind])
end
Another method would be reuse the working solution i.e. convert the matrix into a string appropriate for DataFrames to consume. In code, this is:
using DataFrames
dataMatrix = [
"parName" "region" "forType" "value";
"vol" "AL" "broadL_highF" 3.3055628012;
"vol" "AL" "con_highF" 2.1360975151;
"vol" "AQ" "broadL_highF" 5.81984502;
"vol" "AQ" "con_highF" 8.1462998309;
]
s = join(
[join([dataMatrix[i,j] for j in indices(dataMatrix, 2)]
, '\t') for i in indices(dataMatrix, 1)], '\n')
df = DataFrames.inlinetable(s; separator='\t', header=true)
The resulting df
has its column types guessed by DataFrame.
Unrelated, but this answer reminds me of the how a mathematician boils water joke.