How to locate a structured region of data inside of a not structured data frame in R?

余生长醉 提交于 2019-12-13 08:45:04

问题


I have a certain kind of data frames that contain a subset of interest. The problem is that this subset, is non consistent between the different data frames. Nonetheless, in a more abstract level, follows a general structure: a rectangular region inside the data frame.

example1 <- data.frame(x = c("name", "129-2", NA, NA, "acc", 2, 3, 4, NA, NA), 
       y = c(NA, NA, NA, NA, "deb", 3, 2, 5, NA, NA),
       z = c(NA, NA, NA, NA, "asset", 1, 1, 2, NA, NA))

print(example1)

      x    y     z
1   name <NA>  <NA>
2  129-2 <NA>  <NA>
3   <NA> <NA>  <NA>
4   <NA> <NA>  <NA>
5    acc  deb asset
6      2    3     1
7      3    2     1
8      4    5     2
9   <NA> <NA>  <NA>
10  <NA> <NA>  <NA>

The example1 contain a clear rectangular región with a structure information:

5    acc  deb asset
6      2    3     1
7      3    2     1
8      4    5     2

As mentioned before, the region is not always consistent,

  1. the position of the columns are not always the same
  2. the name of the variables insde the subset of interest are not always the same

Here another example2:

example2 <- data.frame(x = c("name", "129-2", "wallabe #23", NA, NA, "acc", 2, 3, 4, NA ), 
       y = c(NA, NA, NA, NA, "balance", "deb", 3, 2, 5, NA),
       z = c(NA, NA, NA, NA, NA, "asset", 1, 1, 2, NA),
       u = c(NA, NA, NA, "currency:", NA, NA, NA, NA, NA, NA),
       i = c(NA, NA, NA, "USD", "result", "win", 2, 3, 1, NA),
       o = c(NA, NA, NA, NA, NA, "lose", 2, 2, 1, NA))

print(example2)
> example2
            x       y     z         u      i    o
1         name    <NA>  <NA>      <NA>   <NA> <NA>
2        129-2    <NA>  <NA>      <NA>   <NA> <NA>
3  wallabe #23    <NA>  <NA>      <NA>   <NA> <NA>
4         <NA>    <NA>  <NA> currency:    USD <NA>
5         <NA> balance  <NA>      <NA> result <NA>
6          acc     deb asset      <NA>    win lose
7            2       3     1      <NA>      2    2
8            3       2     1      <NA>      3    2
9            4       5     2      <NA>      1    1
10        <NA>    <NA>  <NA>      <NA>   <NA> <NA>

The example2 contain a not clear rectangular región:

6          acc     deb asset      <NA>    win lose
7            2       3     1      <NA>      2    2
8            3       2     1      <NA>      3    2
9            4       5     2      <NA>      1    1

One method to scan this dataframe to locate this kind of region inside of it?

Any idea is appreciated


回答1:


You might want to try the longest sequence with same amount of NAs:

findTable <- function(df){
  naSeq <- rowSums(is.na(df))          # How many NA per row
  myRle <- rle(naSeq )$length          # Find sequences length
  df[rep(myRle == max(myRle), myRle),] # Get longest sequence
}

findTable(example1)
    x   y     z
5 acc deb asset
6   2   3     1
7   3   2     1
8   4   5     2

findTable(example2)
    x   y     z    u   i    o
6 acc deb asset <NA> win lose
7   2   3     1 <NA>   2    2
8   3   2     1 <NA>   3    2
9   4   5     2 <NA>   1    1


来源:https://stackoverflow.com/questions/40007903/how-to-locate-a-structured-region-of-data-inside-of-a-not-structured-data-frame

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!