Convert unstructured csv file to a data frame

前端 未结 3 977
再見小時候
再見小時候 2021-01-13 16:07

I am learning R for text mining. I have a TV program schedule in form of CSV. The programs usually start at 06:00 AM and goes on until 05:00 AM the next day which is called

相关标签:
3条回答
  • 2021-01-13 16:11

    1) This sets up some functions and then consists of four transform(...) %>% subset(...) code fragments linked together using a magrittr pipeline. We assume DF is the output of the read.table in the question.

    First, load the zoo package so get access to na.locf. Define a Lead function which shifts each element by 1 position. Also define a datetime function which converts a date plus a h.m number to a datetime.

    Now convert the dates to "Date" class. The rows that are not dates will become NA. Use Lead to shift that vector by 1 position and then extract the NA positions effectively removing the weekday rows. Now use na.locf to fill in the dates and keep only rows with duplicated dates effectively removing the rows containing only a date. Next set Program as V1 and Synopsis as V2 except we must shift V2 using Lead since the Synopsis is on the second row of each pair. Keep only the odd positioned rows. Produce datetime and pick out desired columns.

    library(magrittr)
    library(zoo) # needed for na.locf
    
    Lead <- function(x, fill = NA) c(x[-1], fill)  # shift down and fill
    datetime <- function(date, time) {
                  time <- as.numeric(time)
                  as.POSIXct(sprintf("%s %.0f:%02f", date, time, 100 * (time %% 1))) + 
                          24 * 60 * 60 * (time < 6) # add day if time < 6
    }
    
    DF %>% 
    
       transform(date = as.Date(V1, "%d-%b-%y")) %>% 
       subset(Lead(is.na(date), TRUE)) %>%   # rm weekday rows
    
       transform(date = na.locf(date)) %>%  # fill in dates
       subset(duplicated(date)) %>% # rm date rows
    
       transform(Program = V2, Synopsis = Lead(V1)) %>% 
       subset(c(TRUE, FALSE)) %>%  # keep odd positioned rows only
    
       transform(Date = datetime(date, V1)) %>% 
       subset(select = c("Date", "Program", "Synopsis"))
    

    giving:

                     Date Program                            Synopsis
    1 2015-11-01 06:00:00     Tom  some information about the program
    2 2015-11-01 23:30:00   Jerry  some information about the program
    3 2015-11-02 05:00:00  Avatar  some information about the program
    4 2015-11-02 06:00:00     Tom  some information about the program
    5 2015-11-02 23:30:00   Jerry  some information about the program
    6 2015-11-03 05:00:00  Avatar  some information about the program
    

    2) dplyr and here it is using dplyr and the datetime function above. We could have replaced the transform and subset functions in (1) with dplyr mutate and filter and Lead with lead but for variety we do it another way:

    library(dplyr)
    library(zoo) # na.locf
    
    DF %>%
       mutate(date = as.Date(V1, "%d-%b-%t")) %>%
       filter(lead(is.na(date), default = TRUE)) %>% # rm weekday rows
       mutate(date = na.locf(date)) %>% # fill in dates
       group_by(date) %>%
       mutate(Program = V2, Synopsis = lead(V1)) %>%
       slice(seq(2, n(), by = 2)) %>%
       ungroup() %>%
       mutate(Date = datetime(date, V1)) %>%
       select(Date, Program, Synopsis)
    

    giving:

    Source: local data frame [6 x 3]
    
                     Date Program                            Synopsis
                   (time)   (chr)                               (chr)
    1 2015-11-01 06:00:00     Tom  some information about the program
    2 2015-11-01 23:30:00   Jerry  some information about the program
    3 2015-11-02 05:00:00  Avatar  some information about the program
    4 2015-11-02 06:00:00     Tom  some information about the program
    5 2015-11-02 23:30:00   Jerry  some information about the program
    6 2015-11-03 05:00:00  Avatar  some information about the program
    

    3) data.table This also uses na.locf from zoo and datetime defined in (1):

    library(data.table)
    library(zoo)
    
    dt <- data.table(DF)
    dt <- dt[, date := as.Date(V1, "%d-%b-%y")][
              shift(is.na(date), type = "lead", fill = TRUE)][, # rm weekday rows
              date := na.locf(date)][duplicated(date)][,  # fill in dates & rm date rows
              Synopsis := shift(V1, type = "lead")][seq(1, .N, 2)][, # align Synopsis
              c("Date", "Program") := list(datetime(date, V1), V2)][, 
              list(Date, Program, Synopsis)]
    

    giving:

    > dt
                      Date Program                            Synopsis
    1: 2015-11-01 06:00:00     Tom  some information about the program
    2: 2015-11-01 23:30:00   Jerry  some information about the program
    3: 2015-11-02 05:00:00  Avatar  some information about the program
    4: 2015-11-02 06:00:00     Tom  some information about the program
    5: 2015-11-02 23:30:00   Jerry  some information about the program
    6: 2015-11-03 05:00:00  Avatar  some information about the program
    

    UPDATE: Simplified (1) and added (2) and (3).

    0 讨论(0)
  • 2021-01-13 16:14

    An alternative solution with data.table:

    library(data.table)
    library(zoo)
    library(splitstackshape)
    
    txt <- textConnection("Sunday|\n 01-Nov-15|\n 6|Tom\n some information about the program|\n 23.3|Jerry\n some information about the program|\n 5|Avatar\n some information about the program|\nMonday|\n 02-Nov-15|\n 6|Tom\n some information about the program|\n 23.3|Jerry\n some information about the program|\n 5|Avatar\n some information about the program|")
    tv <- readLines(txt)
    DT <- data.table(tv)[, tv := gsub('[|]$', '', tv)]
    
    wd <- levels(weekdays(1:7, abbreviate = FALSE))
    
    DT <- DT[, temp := tv %chin% wd
             ][, day := tv[temp], by = 1:nrow(tvDT)
               ][, day := na.locf(day)
                 ][, temp := NULL
                   ][, idx := rleid(day)
                     ][, date := tv[2], by = idx
                       ][, .SD[-c(1,2)], by = idx]
    
    DT <- cSplit(DT, sep="|", "tv", "long")[, lbl := rep(c("Time","Program","Info")), by = idx]
    DT <- dcast(DT, idx + day + date + rowid(lbl) ~ lbl, value.var = "tv")[, lbl := NULL]
    
    DT <- DT[, datetime := as.POSIXct(paste(as.character(date), sprintf("%01.2f",as.numeric(as.character(Time)))), format = "%d-%b-%y %H.%M")
       ][, datetime := datetime + (+(datetime < shift(datetime, fill=datetime[1]) & datetime < 6) * 24 * 60 * 60)
         ][, .(datetime, Program, Info)]
    

    The result:

    > DT
                  datetime Program                               Info
    1: 2015-11-01 06:00:00     Tom some information about the program
    2: 2015-11-01 23:30:00   Jerry some information about the program
    3: 2015-11-02 05:00:00  Avatar some information about the program
    4: 2015-11-02 06:00:00     Tom some information about the program
    5: 2015-11-02 23:30:00   Jerry some information about the program
    6: 2015-11-03 05:00:00  Avatar some information about the program
    

    Explanation:

    1: read data, convert to a data.table & remove trailing |:

    txt <- textConnection("Sunday|\n 01-Nov-15|\n 6|Tom\n some information about the program|\n 23.3|Jerry\n some information about the program|\n 5|Avatar\n some information about the program|\nMonday|\n 02-Nov-15|\n 6|Tom\n some information about the program|\n 23.3|Jerry\n some information about the program|\n 5|Avatar\n some information about the program|")
    tv <- readLines(txt)
    DT <- data.table(tv)[, tv := gsub('[|]$', '', tv)]
    

    2: extract the weekdays into a new column

    wd <- levels(weekdays(1:7, abbreviate = FALSE)) # a vector with the full weekdays
    DT[, temp := tv %chin% wd
       ][, day := tv[temp], by = 1:nrow(tvDT)
         ][, day := na.locf(day)
           ][, temp := NULL]
    

    3: create an index per day & create a column with the dates

    DT[, idx := rleid(day)][, date := tv[2], by = idx]
    

    4: remove unnecessary lines

    DT <- DT[, .SD[-c(1,2)], by = idx]
    

    5: split the time and the program-name into separate rows & create a label column

    DT <- cSplit(DT, sep="|", "tv", "long")[, lbl := rep(c("Time","Program","Info")), by = idx]
    

    6: reshape into wide format using the 'rowid' function from the development version of data.table

    DT <- dcast(DT, idx + day + date + rowid(idx2) ~ idx2, value.var = "tv")[, idx2 := NULL]
    

    7: create a dattime column & set the late night time to the next day

    DT[, datetime := as.POSIXct(paste(as.character(date), sprintf("%01.2f",as.numeric(as.character(Time)))), format = "%d-%b-%y %H.%M")
       ][, datetime := datetime + (+(datetime < shift(datetime, fill=datetime[1]) & datetime < 6) * 24 * 60 * 60)]
    

    8: keep the needed columns

    DT <- DT[, .(datetime, Program, Info)]
    
    0 讨论(0)
  • 2021-01-13 16:34

    It's a bit of a mess, but it seems to work:

    df <- read.table(textConnection(txt <- "Sunday|\n 01-Nov-15|\n 6|Tom\n some information about the program|\n 23.3|Jerry\n some information about the program|\n 5|Avatar\n some information about the program|\nMonday|\n 02-Nov-15|\n 6|Tom\n some information about the program|\n 23.3|Jerry\n some information about the program|\n 5|Avatar\n some information about the program|"), header = F, sep = "|", stringsAsFactors = F)
    cat(txt)
    Sys.setlocale("LC_TIME", "English") # if needed
    weekdays <- format(seq.Date(Sys.Date(), Sys.Date()+6, 1), "%A")
    days <- split(df, cumsum(df$V1 %in% weekdays))
    lapply(days, function(dayDF) {
      tmp <- cbind.data.frame(V1=dayDF[2, 1], do.call(rbind, split(unlist(dayDF[-c(1:2), ]), cumsum(!dayDF[-(1:2), 2]==""))), stringsAsFactors = F)
      tmp[, 1] <- as.Date(tmp[, 1], "%d-%B-%y")
      tmp[, 2] <- as.numeric(tmp[, 2])
      tmp[, 5] <- NULL
      idx <- c(FALSE, diff(tmp[, 2])<0)
      tmp[idx, 1] <- tmp[idx, 1] + 1
      return(tmp)
    }) -> days
    days <- transform(do.call(rbind.data.frame, days), V1=as.POSIXct(paste(V1, sprintf("%.2f", V11)), format="%Y-%m-%d %H.%M"), V11=NULL)  
    names(days) <- c("Date", "Synopsis", "Program")
    rownames(days) <- NULL
    days[, c(1, 3, 2)]
    #                  Date Program                            Synopsis
    # 1 2015-11-01 06:00:00     Tom  some information about the program
    # 2 2015-11-01 23:30:00   Jerry  some information about the program
    # 3 2015-11-02 05:00:00  Avatar  some information about the program
    # 4 2015-11-02 06:00:00     Tom  some information about the program
    # 5 2015-11-02 23:30:00   Jerry  some information about the program
    # 6 2015-11-03 05:00:00  Avatar  some information about the program
    
    0 讨论(0)
提交回复
热议问题