R - find/replace line breaks using regex

后端 未结 1 1474
情深已故
情深已故 2021-01-23 08:35

I\'m trying to clean a bunch of .txt files in a folder using regex. I can\'t seem to get R to find line breaks.

This is the code I\'m using. It works for character subst

相关标签:
1条回答
  • 2021-01-23 08:51

    You can't do that with xfun::gsub_dir.

    Have a look at the source code:

    • The files are read in using read_utf8 that basically executes x = readLines(con, encoding = 'UTF-8', warn = FALSE),
    • Then, gsub is fed with these lines, and when all replacements are done,
    • The write_utf8 function concatenates the lines... with the LF, newline, symbol.

    You need to use some custom function for that, here is "quick and dirty" one that will replace all LF symbols with #:

    lbr_change_gsub_dir = function(newline = '\n', encoding = 'UTF-8', dir = '.', recursive = TRUE) {
     files = list.files(dir, full.names = TRUE, recursive = recursive)
     for (f in files) {
       x = readLines(f, encoding = encoding, warn = FALSE)
       cat(x, sep = newline, file = f)
     }
    }
    
    folder <- "C:\\MyFolder\\Here"
    lbr_change_gsub_dir(newline="#", dir=folder)
    

    If you want to be able to match multiline patterns, paste the lines collapeing them with newline and use any pattern you like:

    lbr_gsub_dir = function(pattern, replacement, perl = TRUE, newline = '\n', encoding = 'UTF-8', dir = '.', recursive = TRUE) {
     files = list.files(dir, full.names = TRUE, recursive = recursive)
     for (f in files) {
       x <- readLines(f, encoding = encoding, warn = FALSE)
       x <- paste(x, collapse = newline)
       x <- gsub(pattern, replacement, x, perl = perl)
       cat(x, file = f)
     }
    }
    
    folder <- "C:\\1"
    lbr_gsub_dir("(?m)\\d+\\R(.+)", "\\1", dir = folder)
    

    This will remove lines that follow digit only lines.

    0 讨论(0)
提交回复
热议问题