问题
I need to do something quite specific and i'm trying to do it the good way , especially i want it to be optimized .
So i have a DataFrame that look like this :
v = ["x","y","z"][rand(1:3, 10)]
df = DataFrame(Any[collect(1:10), v, rand(10)], [:USER_ID, :GENRE_MAIN, :TOTAL_LISTENED])
Row │ USER_ID GENRE_MAIN TOTAL_LISTENED
│ Int64 String Float64
─────┼─────────────────────────────────────
1 │ 1 x 0.237186
12 │ 1 y 0.237186
13 │ 1 x 0.254486
2 │ 2 z 0.920804
3 │ 3 y 0.140626
4 │ 4 x 0.653306
5 │ 5 x 0.83126
6 │ 6 x 0.928973
7 │ 7 y 0.519728
8 │ 8 x 0.409969
9 │ 9 z 0.798064
10 │ 10 x 0.701332
I want to aggregate it by user (i have many rows per user_id ) and do many calculations
I need to calculate the top 1 ,2 ,3 ,4 ,5 genre, album name, artist name per user_id and its respective values (the total_listened that correspond) and it has to be like this :
USER_ID │ ALBUM1_NAME │ ALBUM2_NAME | ALBUM1_NAME_VALUE | ALBUM2_NAME_VALUES | ......│ GENRE1 │ GENRE2
One line per user_id .
I got this solution that fits 90% of what i wanted but i can't modify it to also include the values of total_listened:
using DataFrames, Pipe, Random, Pkg
Pkg.activate(".")
Pkg.add("DataFrames")
Pkg.add("Pipe")
Random.seed!(1234)
df = DataFrame(USER_ID=rand(1:10, 80),
GENRE_MAIN=rand(string.("genre_", 1:6), 80),
ALBUM_NAME=rand(string.("album_", 1:6), 80),
ALBUM_ARTIST_NAME=rand(string.("artist_", 1:6), 80))
function top5(sdf, col, prefix)
return @pipe groupby(sdf, col) |>
combine(_, nrow) |>
sort!(_, :nrow, rev=true) |>
first(_, 5) |>
vcat(_[!, 1], fill(missing, 5 - nrow(_))) |>
DataFrame([string(prefix, i) for i in 1:5] .=> _)
end
@pipe groupby(df, :USER_ID) |>
combine(_,
x -> top5(x, :GENRE_MAIN, "genre"),
x -> top5(x, :ALBUM_NAME, "album"),
x -> top5(x, :ALBUM_ARTIST_NAME, "artist"))
An example :
for the user 1 of the DataFrame just up i want the result to be :
Row │ USER_ID GENRE1 GENRE2 GENRE1_VALUE GENRE2_VALUE ......
│ Int64 String String Float64 Float64
─────┼─────────────────────────────────────────────────────
1 │ 1 x y 0.491672 0.237186. ......
I took only GENRE here , but i also want it for ALBUM_NAME, ALBUM_ARTIST_NAME
I also want after to do a top rank % , Order the users by total_listened and calculate their percentile. to rank them by top5% , top10%, top20% of the total I can calculate the tagetted quantile i want with
x = .05
quantile(df.TOTAL_LISTENED, x)
and then just put all the users's total_listened that is superior to this quantile but i don't know how to calculate it properly in the combine...
Thank you
回答1:
As commented in the previous post - I would recommend you to ask a specific question not to redo your whole project on StackOverflow (if you need such help https://discourse.julialang.org/ is a good place to discuss, especially that you need many steps of the analysis and they require a precise definition of what you want exactly - also it would be best if on https://discourse.julialang.org/ you shared your full data set, as the sampler you provide here is not enough to do a proper analysis later since it is too small).
Here is an example how to add totals columns (I assume that you want data to be ordered by the totals):
julia> using Random, DataFrames, Pipe
julia> Random.seed!(1234);
julia> df = DataFrame([rand(1:10, 100), rand('a':'k', 100), rand(100)],
[:USER_ID, :GENRE_MAIN, :TOTAL_LISTENED]);
julia> function top5(sdf, col, prefix)
@pipe groupby(sdf, col) |>
combine(_, :TOTAL_LISTENED => sum => :SUM) |>
sort!(_, :SUM, rev=true) |>
first(_, 5) |>
vcat(_[!, 1], fill(missing, 5 - nrow(_)),
_[!, 2], fill(missing, 5 - nrow(_))) |>
DataFrame([[string(prefix, i) for i in 1:5];
[string(prefix, i, "_VALUE") for i in 1:5]] .=> _)
end;
julia> @pipe groupby(df, :USER_ID) |>
combine(_, x -> top5(x, :GENRE_MAIN, "genre"))
10×11 DataFrame
Row │ USER_ID genre1 genre2 genre3 genre4 genre5 genre1_VALUE genre2_VALUE genre3_VALUE genre4_VALUE genre5_VALUE
│ Int64 Char Char Char Char Char? Float64 Float64 Float64 Float64 Float64?
─────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ 1 d b j e i 2.34715 2.014 1.68587 0.693472 0.377869
2 │ 4 b e d c missing 0.90263 0.589418 0.263121 0.107839 missing
3 │ 8 c d i k j 1.55335 1.40416 0.977785 0.779468 0.118024
4 │ 2 a e f g k 1.34841 0.901507 0.87146 0.797606 0.669002
5 │ 10 a e f i d 1.60554 1.07311 0.820425 0.757363 0.678598
6 │ 7 f i g c a 2.59654 1.49654 1.15944 0.670488 0.258173
7 │ 9 i b e a g 1.57373 0.954117 0.603848 0.338918 0.133201
8 │ 5 f g c k d 1.33899 0.722283 0.664457 0.54016 0.507337
9 │ 3 d c f h e 1.63695 0.919088 0.544296 0.531262 0.0540101
10 │ 6 d g f j i 1.68768 0.97688 0.333207 0.259212 0.0636912
来源:https://stackoverflow.com/questions/65168149/julia-dataframe-combine-specific-calculations-and-tranpose