问题
I have a csv with multiple tables with variables stored in both rows and columns.
About this csv:
- I'd want to go "wide" to "long"
- There are multiple "data frames" in one csv
- There are different types of variables for each "data frames"
> df3
V1 V2 V3 V4 V5 V6 V7 V8
1 nyc 123 main st month 1 2 3 4 5
2 nyc 123 main st x 58568 567567 567909 35876 56943
3 nyc 123 main st y 5345 3673 3453 3467 788
4 nyc 123 main st z 53223 563894 564456 32409 56155
5
6 la 63 main st month 1 2 3 4 5
7 la 63 main st a 87035 7467456 3363 863 43673
8 la 63 main st b 345 456 345 678 345
9 la 63 main st c 86690 7467000 3018 185 43328
10
11 sf 953 main st month 1 2 3 4 5
12 sf 953 main st x 457456 3455 345345 56457 3634
13 sf 953 main st b 5345 3673 3453 3467 788
14 sf 953 main st z 452111 -218 341892 52990 2846
> df4
18 city address month x y z a b c
19 nyc 123 main st 1 58568 5345 53223 null null null
20 nyc 123 main st 2 567567 3673 563894 null null null
21 nyc 123 main st 3 567909 3453 564456 null null null
22 nyc 123 main st 4 35876 3467 32409 null null null
23 nyc 123 main st 5 56943 788 56155 null null null
24 la 63 main st 1 null null null 87035 345 86690
25 la 63 main st 2 null null null 7467456 456 7467000
26 la 63 main st 3 null null null 3363 345 3018
27 la 63 main st 4 null null null 863 678 185
28 la 63 main st 5 null null null 43673 345 43328
29 sf 953 main st 1 457456 null 452111 null 5345 null
30 sf 953 main st 2 3455 null -218 null 3673 null
31 sf 953 main st 3 345345 null 341892 null 3453 null
32 sf 953 main st 4 56457 null 52990 null 3467 null
33 sf 953 main st 5 3634 null 2846 null 788 null
The top is the data I have, the bottom is the transformation I want.
I'm most comfortable in R but I'm practicing Python, so any approach works.
回答1:
The sample data set provided by the OP suggests that all data frames within the csv file
- have the same structure, i.e., the same number, names, and positions of columns
- and the monthly columns
V4
toV8
refer to the same months 1 to 5 for all "sub frames".
If this is true then we can treat the whole csv file as one data frame and convert it to the desired format by reshaping using melt()
and dcast()
from the data.table
package:
library(data.table)
setDT(df3)[, melt(.SD, id.vars = paste0("V", 1:3), na.rm = TRUE)][
V3 != "month", dcast(.SD, V1 + V2 + rleid(variable) ~ forcats::fct_inorder(V3))][
, setnames(.SD, 1:3, c("city", "address", "month"))]
city address month x y z a b c 1: la 63 main st 1 NA NA NA 87035 345 86690 2: la 63 main st 2 NA NA NA 7467456 456 7467000 3: la 63 main st 3 NA NA NA 3363 345 3018 4: la 63 main st 4 NA NA NA 863 678 185 5: la 63 main st 5 NA NA NA 43673 345 43328 6: nyc 123 main st 1 58568 5345 53223 NA NA NA 7: nyc 123 main st 2 567567 3673 563894 NA NA NA 8: nyc 123 main st 3 567909 3453 564456 NA NA NA 9: nyc 123 main st 4 35876 3467 32409 NA NA NA 10: nyc 123 main st 5 56943 788 56155 NA NA NA 11: sf 953 main st 1 457456 NA 452111 NA 5345 NA 12: sf 953 main st 2 3455 NA -218 NA 3673 NA 13: sf 953 main st 3 345345 NA 341892 NA 3453 NA 14: sf 953 main st 4 56457 NA 52990 NA 3467 NA 15: sf 953 main st 5 3634 NA 2846 NA 788 NA
The fct_inorder()
function from Hadley's forcats
package is used here to order the columns by their first appearance instead of alphabetical order a, b, c, x, y, z.
Note that also the cities have been ordered alphabetically. If this is crcuial (but I doubt it is) the original order can be preserved as well by using
forcats::fct_inorder(V1) + V2 + rleid(variable) ~ forcats::fct_inorder(V3)
as dcast()
formula.
Data
Unfortunately, the OP didn't supply the result of dput(df3)
which made it unnecessarily difficult to reproduce the data set as printed in the question:
df3 <- readr::read_table(
" V1 V2 V3 V4 V5 V6 V7 V8
1 nyc 123 main st month 1 2 3 4 5
2 nyc 123 main st x 58568 567567 567909 35876 56943
3 nyc 123 main st y 5345 3673 3453 3467 788
4 nyc 123 main st z 53223 563894 564456 32409 56155
5
6 la 63 main st month 1 2 3 4 5
7 la 63 main st a 87035 7467456 3363 863 43673
8 la 63 main st b 345 456 345 678 345
9 la 63 main st c 86690 7467000 3018 185 43328
10
11 sf 953 main st month 1 2 3 4 5
12 sf 953 main st x 457456 3455 345345 56457 3634
13 sf 953 main st b 5345 3673 3453 3467 788
14 sf 953 main st z 452111 -218 341892 52990 2846"
)
library(data.table)
setDT(df3)[, V2 := paste(X3, V2)][, c("X1", "X3") := NULL]
setDF(df3)[]
V1 V2 V3 V4 V5 V6 V7 V8 1 nyc 123 main st month 1 2 3 4 5 2 nyc 123 main st x 58568 567567 567909 35876 56943 3 nyc 123 main st y 5345 3673 3453 3467 788 4 nyc 123 main st z 53223 563894 564456 32409 56155 5 NA NA NA NA NA NA 6 la 63 main st month 1 2 3 4 5 7 la 63 main st a 87035 7467456 3363 863 43673 8 la 63 main st b 345 456 345 678 345 9 la 63 main st c 86690 7467000 3018 185 43328 10 NA NA NA NA NA NA 11 sf 953 main st month 1 2 3 4 5 12 sf 953 main st x 457456 3455 345345 56457 3634 13 sf 953 main st b 5345 3673 3453 3467 788 14 sf 953 main st z 452111 -218 341892 52990 2846
回答2:
It would help first if you had proper column names for your df, please insert column names once you read in the data.
I have use the following libraries, dplyr
and stringr
for this analysis and also renamed the first 3 columns:
df <- data.frame(stringsAsFactors=FALSE,
city = c("nyc", "nyc", "nyc"),
address = c("123 main st", "123 main st", "123 main st"),
month = c("x", "y", "z"),
X1 = c(58568L, 5345L, 53223L),
X2 = c(567567L, 3673L, 563894L),
X3 = c(567909L, 3453L, 564456L),
X4 = c(35876L, 3467L, 32409L),
X5 = c(56943L, 788L, 56155L)
)
df %>% gather(Type, Value, -c(city:month)) %>%
spread(month, Value) %>%
mutate(month = str_sub(Type, 2, 2)) %>%
select(-Type) %>%
select(c(city, address, month, x:z))
city address month x y z
1 nyc 123 main st 1 58568 5345 53223
2 nyc 123 main st 2 567567 3673 563894
3 nyc 123 main st 3 567909 3453 564456
4 nyc 123 main st 4 35876 3467 32409
5 nyc 123 main st 5 56943 788 56155
来源:https://stackoverflow.com/questions/45555048/wide-to-long-data-table-transformation-with-variables-in-columns-and-rows