Divide or split dataframe into multiple dfs based on empty row and header title

问题

I have a dataframe which has multiple values in a single file. I want to divide it into multiple files around 25 from the file. Pattern for the file is where there is one blank row and a header title is there , it is a new df. I Have tried this Splitting dataframes in R based on empty rows but this does not take care of any blank row within the new df (V1 column 9th row). I want the data to be divided on empty row and a header title my data and code i have tried is given below . Also how can i put the header row as the Dataframe name in my newly created dfs.

 df = structure(list(V1 = c("Machine", "", "Machine", "V1", "03-09-2020", 
"", "Machine", "No", "Name", "a", "1", "2", "", "Machine", "No", 
""), V2 = c("Data", "", "run", "V2", "600119", "", "error", "SpNo", 
"", "a", "b", "c", "", "logs", "sp", ""), V3 = c("Editor", "", 
"information", "V3", "6", "", "messages", "OP", "", "", "b", 
"c", "", "", "op", ""), V4 = c("", "", "", "V4", "", "", "", 
"OP", "", "", "", "", "", "", "name", "")), class = "data.frame", row.names = c(NA, 
-16L))

dt <- df



## add column to indicate groups
dt$tbl_id <- cumsum(!nzchar(dt$V1) 

unique(dt$tbl_id)

## remove blank lines
dt <- dt[nzchar(dt$V1), ]

## split the data frame
dt_s <- split(dt[, -ncol(dt)], dt$tbl_id)

## use first line as header and reset row numbers
dt_s <- lapply(dt_s, function(x) {
  colnames(x) <- x[1, ]
  x <- x[-1, ]
  rownames(x) <- NULL
  x
})

any help will be highly useful . Also all the header title will be same in all the files. I am using lapply for the multiple file operations.

Expected output will be :-

Machine_run_nformation  <- read.table(text="
V1  V2  V3  V4
03-09-2020  600119  -   6

",header = T)

Machine_error_essages <- read.table(text="
No  SpNo    OP  OP_Name
-   -   a   a
1   -   b   b
2   -   c   c

",header = T)

Similar to these - there will be 25 outputs

回答1:

Maybe you can try

u <- rowSums(df == "")==ncol(df)
out <- split(subset(df,!u),cumsum(u)[!u])

which gives

> out
$`0`
       V1   V2     V3 V4
1 Machine Data Editor

$`1`
          V1     V2          V3 V4
3    Machine    run information
4         V1     V2          V3 V4
5 03-09-2020 600119           6

$`2`
        V1    V2       V3 V4
7  Machine error messages   
8       No  SpNo       OP OP
9     Name
10       a     a
11       1     b        b
12       2     c        c

$`3`
        V1   V2 V3   V4
14 Machine logs        
15      No   sp op name

回答2:

here is an approach using dplyr::group_split (which is in an experimental lifecycle).

df = structure(list(V1 = c("Machine", "", "Machine", "V1", "03-09-2020", 
                           "", "Machine", "No", "Name", "a", "1", "2", "", "Machine", "No", 
                           ""), V2 = c("Data", "", "run", "V2", "600119", "", "error", "SpNo", 
                                       "", "a", "b", "c", "", "logs", "sp", ""), V3 = c("Editor", "", 
                                                                                        "information", "V3", "6", "", "messages", "OP", "", "", "b", 
                                                                                        "c", "", "", "op", ""), V4 = c("", "", "", "V4", "", "", "", 
                                                                                                                       "OP", "", "", "", "", "", "", "name", "")), class = "data.frame", row.names = c(NA, 
                                                                                                                                                                                                       -16L))

df %>%
  dplyr::mutate(FLAG=rowSums(.=="")==ncol(.)) %>%
  dplyr::mutate(GRP=cumsum(FLAG)) %>%
  dplyr::filter(!FLAG) %>%
  dplyr::group_by(GRP) %>%
  dplyr::group_split() %>%
  lapply(function(f) dplyr::select(f,-FLAG,-GRP))

[[1]]
# A tibble: 1 x 4
  V1      V2    V3     V4   
  <chr>   <chr> <chr>  <chr>
1 Machine Data  Editor ""   

[[2]]
# A tibble: 3 x 4
  V1         V2     V3          V4   
  <chr>      <chr>  <chr>       <chr>
1 Machine    run    information ""   
2 V1         V2     V3          "V4" 
3 03-09-2020 600119 6           ""   

[[3]]
# A tibble: 6 x 4
  V1      V2      V3         V4   
  <chr>   <chr>   <chr>      <chr>
1 Machine "error" "messages" ""   
2 No      "SpNo"  "OP"       "OP" 
3 Name    ""      ""         ""   
4 a       "a"     ""         ""   
5 1       "b"     "b"        ""   
6 2       "c"     "c"        ""   

[[4]]
# A tibble: 2 x 4
  V1      V2    V3    V4    
  <chr>   <chr> <chr> <chr> 
1 Machine logs  ""    ""    
2 No      sp    "op"  "name"

来源：https://stackoverflow.com/questions/63718774/divide-or-split-dataframe-into-multiple-dfs-based-on-empty-row-and-header-title

标签

dataframe

dplyr

data.table