wide to long data table transformation with variables in columns and rows

问题

I have a csv with multiple tables with variables stored in both rows and columns.
About this csv:

I'd want to go "wide" to "long"
There are multiple "data frames" in one csv
There are different types of variables for each "data frames"

> df3
     V1          V2    V3     V4      V5     V6      V7    V8
1   nyc 123 main st month      1       2      3       4     5
2   nyc 123 main st     x  58568  567567 567909   35876 56943
3   nyc 123 main st     y   5345    3673   3453    3467   788
4   nyc 123 main st     z  53223  563894 564456   32409 56155
5                                                            
6    la  63 main st month      1       2      3       4     5
7    la  63 main st     a  87035 7467456   3363     863 43673
8    la  63 main st     b    345     456    345     678   345
9    la  63 main st     c  86690 7467000   3018     185 43328
10                                                           
11   sf 953 main st month      1       2      3       4     5
12   sf 953 main st     x 457456    3455 345345   56457  3634
13   sf 953 main st     b   5345    3673   3453    3467   788
14   sf 953 main st     z 452111    -218 341892   52990  2846

> df4
18 city     address month      x       y      z       a     b       c
19  nyc 123 main st     1  58568    5345  53223    null  null    null
20  nyc 123 main st     2 567567    3673 563894    null  null    null
21  nyc 123 main st     3 567909    3453 564456    null  null    null
22  nyc 123 main st     4  35876    3467  32409    null  null    null
23  nyc 123 main st     5  56943     788  56155    null  null    null
24   la  63 main st     1   null    null   null   87035   345   86690
25   la  63 main st     2   null    null   null 7467456   456 7467000
26   la  63 main st     3   null    null   null    3363   345    3018
27   la  63 main st     4   null    null   null     863   678     185
28   la  63 main st     5   null    null   null   43673   345   43328
29   sf 953 main st     1 457456    null 452111    null  5345    null
30   sf 953 main st     2   3455    null   -218    null  3673    null
31   sf 953 main st     3 345345    null 341892    null  3453    null
32   sf 953 main st     4  56457    null  52990    null  3467    null
33   sf 953 main st     5   3634    null   2846    null   788    null

The top is the data I have, the bottom is the transformation I want.

I'm most comfortable in R but I'm practicing Python, so any approach works.

回答1:

The sample data set provided by the OP suggests that all data frames within the csv file

have the same structure, i.e., the same number, names, and positions of columns
and the monthly columns V4to V8 refer to the same months 1 to 5 for all "sub frames".

If this is true then we can treat the whole csv file as one data frame and convert it to the desired format by reshaping using melt() and dcast() from the data.table package:

library(data.table)
setDT(df3)[, melt(.SD, id.vars = paste0("V", 1:3), na.rm = TRUE)][
  V3 != "month", dcast(.SD, V1 + V2 + rleid(variable) ~ forcats::fct_inorder(V3))][
    , setnames(.SD, 1:3, c("city", "address", "month"))]

    city     address month      x    y      z       a    b       c
 1:   la  63 main st     1     NA   NA     NA   87035  345   86690
 2:   la  63 main st     2     NA   NA     NA 7467456  456 7467000
 3:   la  63 main st     3     NA   NA     NA    3363  345    3018
 4:   la  63 main st     4     NA   NA     NA     863  678     185
 5:   la  63 main st     5     NA   NA     NA   43673  345   43328
 6:  nyc 123 main st     1  58568 5345  53223      NA   NA      NA
 7:  nyc 123 main st     2 567567 3673 563894      NA   NA      NA
 8:  nyc 123 main st     3 567909 3453 564456      NA   NA      NA
 9:  nyc 123 main st     4  35876 3467  32409      NA   NA      NA
10:  nyc 123 main st     5  56943  788  56155      NA   NA      NA
11:   sf 953 main st     1 457456   NA 452111      NA 5345      NA
12:   sf 953 main st     2   3455   NA   -218      NA 3673      NA
13:   sf 953 main st     3 345345   NA 341892      NA 3453      NA
14:   sf 953 main st     4  56457   NA  52990      NA 3467      NA
15:   sf 953 main st     5   3634   NA   2846      NA  788      NA

The fct_inorder() function from Hadley's forcats package is used here to order the columns by their first appearance instead of alphabetical order a, b, c, x, y, z.

Note that also the cities have been ordered alphabetically. If this is crcuial (but I doubt it is) the original order can be preserved as well by using

forcats::fct_inorder(V1) + V2 + rleid(variable) ~ forcats::fct_inorder(V3)

as dcast() formula.

Data

Unfortunately, the OP didn't supply the result of dput(df3) which made it unnecessarily difficult to reproduce the data set as printed in the question:

df3 <- readr::read_table(
  "     V1          V2    V3     V4      V5     V6      V7    V8
  1   nyc 123 main st month      1       2      3       4     5
  2   nyc 123 main st     x  58568  567567 567909   35876 56943
  3   nyc 123 main st     y   5345    3673   3453    3467   788
  4   nyc 123 main st     z  53223  563894 564456   32409 56155
  5                                                            
  6    la  63 main st month      1       2      3       4     5
  7    la  63 main st     a  87035 7467456   3363     863 43673
  8    la  63 main st     b    345     456    345     678   345
  9    la  63 main st     c  86690 7467000   3018     185 43328
  10                                                           
  11   sf 953 main st month      1       2      3       4     5
  12   sf 953 main st     x 457456    3455 345345   56457  3634
  13   sf 953 main st     b   5345    3673   3453    3467   788
  14   sf 953 main st     z 452111    -218 341892   52990  2846"
)
library(data.table)
setDT(df3)[, V2 := paste(X3, V2)][, c("X1", "X3") := NULL]
setDF(df3)[]

    V1          V2    V3     V4      V5     V6    V7    V8
1  nyc 123 main st month      1       2      3     4     5
2  nyc 123 main st     x  58568  567567 567909 35876 56943
3  nyc 123 main st     y   5345    3673   3453  3467   788
4  nyc 123 main st     z  53223  563894 564456 32409 56155
5              NA            NA      NA     NA    NA    NA
6   la  63 main st month      1       2      3     4     5
7   la  63 main st     a  87035 7467456   3363   863 43673
8   la  63 main st     b    345     456    345   678   345
9   la  63 main st     c  86690 7467000   3018   185 43328
10             NA            NA      NA     NA    NA    NA
11  sf 953 main st month      1       2      3     4     5
12  sf 953 main st     x 457456    3455 345345 56457  3634
13  sf 953 main st     b   5345    3673   3453  3467   788
14  sf 953 main st     z 452111    -218 341892 52990  2846

回答2:

It would help first if you had proper column names for your df, please insert column names once you read in the data.

I have use the following libraries, dplyr and stringr for this analysis and also renamed the first 3 columns:

df <- data.frame(stringsAsFactors=FALSE,
        city = c("nyc", "nyc", "nyc"),
     address = c("123 main st", "123 main st", "123 main st"),
       month = c("x", "y", "z"),
          X1 = c(58568L, 5345L, 53223L),
          X2 = c(567567L, 3673L, 563894L),
          X3 = c(567909L, 3453L, 564456L),
          X4 = c(35876L, 3467L, 32409L),
          X5 = c(56943L, 788L, 56155L)
)

df %>% gather(Type, Value, -c(city:month)) %>% 
        spread(month, Value) %>%
        mutate(month = str_sub(Type, 2, 2)) %>%
        select(-Type) %>%
        select(c(city, address, month, x:z))

city     address month      x    y      z
1  nyc 123 main st     1  58568 5345  53223
2  nyc 123 main st     2 567567 3673 563894
3  nyc 123 main st     3 567909 3453 564456
4  nyc 123 main st     4  35876 3467  32409
5  nyc 123 main st     5  56943  788  56155

来源：https://stackoverflow.com/questions/45555048/wide-to-long-data-table-transformation-with-variables-in-columns-and-rows

标签

python

dataframe

data-manipulation

data-munging