Julia Dataframe combine specific calculations and tranpose

问题

I need to do something quite specific and i'm trying to do it the good way , especially i want it to be optimized .

So i have a DataFrame that look like this :

v = ["x","y","z"][rand(1:3, 10)]
df = DataFrame(Any[collect(1:10), v, rand(10)], [:USER_ID, :GENRE_MAIN, :TOTAL_LISTENED])

 Row │ USER_ID  GENRE_MAIN  TOTAL_LISTENED 
     │ Int64    String      Float64        
─────┼─────────────────────────────────────
   1 │       1  x                 0.237186
  12 │       1  y                 0.237186
  13 │       1  x                 0.254486
   2 │       2  z                 0.920804
   3 │       3  y                 0.140626
   4 │       4  x                 0.653306
   5 │       5  x                 0.83126
   6 │       6  x                 0.928973
   7 │       7  y                 0.519728
   8 │       8  x                 0.409969
   9 │       9  z                 0.798064
  10 │      10  x                 0.701332

I want to aggregate it by user (i have many rows per user_id ) and do many calculations

I need to calculate the top 1 ,2 ,3 ,4 ,5 genre, album name, artist name per user_id and its respective values (the total_listened that correspond) and it has to be like this :

USER_ID │ ALBUM1_NAME      │ ALBUM2_NAME  | ALBUM1_NAME_VALUE | ALBUM2_NAME_VALUES | ......│ GENRE1       │ GENRE2

One line per user_id .

I got this solution that fits 90% of what i wanted but i can't modify it to also include the values of total_listened:

using DataFrames, Pipe, Random, Pkg

Pkg.activate(".")
Pkg.add("DataFrames")
Pkg.add("Pipe")

Random.seed!(1234)

df = DataFrame(USER_ID=rand(1:10, 80),
               GENRE_MAIN=rand(string.("genre_", 1:6), 80),
               ALBUM_NAME=rand(string.("album_", 1:6), 80),
               ALBUM_ARTIST_NAME=rand(string.("artist_", 1:6), 80))

function top5(sdf, col, prefix)
    return @pipe groupby(sdf, col) |>
                 combine(_, nrow) |>
                 sort!(_, :nrow, rev=true) |>
                 first(_, 5) |>
                 vcat(_[!, 1], fill(missing, 5 - nrow(_))) |>
                 DataFrame([string(prefix, i) for i in 1:5] .=> _)
end

@pipe groupby(df, :USER_ID) |>
      combine(_,
              x -> top5(x, :GENRE_MAIN, "genre"),
              x -> top5(x, :ALBUM_NAME, "album"), 
              x -> top5(x, :ALBUM_ARTIST_NAME, "artist"))

An example :

for the user 1 of the DataFrame just up i want the result to be :

 Row │ USER_ID  GENRE1  GENRE2   GENRE1_VALUE GENRE2_VALUE   ......
     │ Int64    String  String    Float64     Float64      
─────┼─────────────────────────────────────────────────────
   1 │       1  x         y       0.491672    0.237186.     ......

I took only GENRE here , but i also want it for ALBUM_NAME, ALBUM_ARTIST_NAME

I also want after to do a top rank % , Order the users by total_listened and calculate their percentile. to rank them by top5% , top10%, top20% of the total I can calculate the tagetted quantile i want with

x = .05
quantile(df.TOTAL_LISTENED, x)

and then just put all the users's total_listened that is superior to this quantile but i don't know how to calculate it properly in the combine...

Thank you

回答1:

As commented in the previous post - I would recommend you to ask a specific question not to redo your whole project on StackOverflow (if you need such help https://discourse.julialang.org/ is a good place to discuss, especially that you need many steps of the analysis and they require a precise definition of what you want exactly - also it would be best if on https://discourse.julialang.org/ you shared your full data set, as the sampler you provide here is not enough to do a proper analysis later since it is too small).

Here is an example how to add totals columns (I assume that you want data to be ordered by the totals):

julia> using Random, DataFrames, Pipe

julia> Random.seed!(1234);

julia> df = DataFrame([rand(1:10, 100), rand('a':'k', 100), rand(100)],
                      [:USER_ID, :GENRE_MAIN, :TOTAL_LISTENED]);

julia> function top5(sdf, col, prefix)
           @pipe groupby(sdf, col) |>
                 combine(_, :TOTAL_LISTENED => sum => :SUM) |>
                 sort!(_, :SUM, rev=true) |>
                 first(_, 5) |>
                 vcat(_[!, 1], fill(missing, 5 - nrow(_)),
                      _[!, 2], fill(missing, 5 - nrow(_))) |>
                 DataFrame([[string(prefix, i) for i in 1:5];
                            [string(prefix, i, "_VALUE") for i in 1:5]] .=> _)
       end;

julia> @pipe groupby(df, :USER_ID) |>
             combine(_, x -> top5(x, :GENRE_MAIN, "genre"))
10×11 DataFrame
 Row │ USER_ID  genre1  genre2  genre3  genre4  genre5   genre1_VALUE  genre2_VALUE  genre3_VALUE  genre4_VALUE  genre5_VALUE    
     │ Int64    Char    Char    Char    Char    Char?    Float64       Float64       Float64       Float64       Float64?        
─────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │       1  d       b       j       e       i             2.34715      2.014         1.68587       0.693472        0.377869
   2 │       4  b       e       d       c       missing       0.90263      0.589418      0.263121      0.107839  missing         
   3 │       8  c       d       i       k       j             1.55335      1.40416       0.977785      0.779468        0.118024
   4 │       2  a       e       f       g       k             1.34841      0.901507      0.87146       0.797606        0.669002
   5 │      10  a       e       f       i       d             1.60554      1.07311       0.820425      0.757363        0.678598
   6 │       7  f       i       g       c       a             2.59654      1.49654       1.15944       0.670488        0.258173
   7 │       9  i       b       e       a       g             1.57373      0.954117      0.603848      0.338918        0.133201
   8 │       5  f       g       c       k       d             1.33899      0.722283      0.664457      0.54016         0.507337
   9 │       3  d       c       f       h       e             1.63695      0.919088      0.544296      0.531262        0.0540101
  10 │       6  d       g       f       j       i             1.68768      0.97688       0.333207      0.259212        0.0636912

来源：https://stackoverflow.com/questions/65168149/julia-dataframe-combine-specific-calculations-and-tranpose

标签

dataframe

julia