问题
I am doing a text analysis with R and needed to convert the first letters of the sentences into lowercase while keeping the other capitalized words the way they are. So I used the command
x <- gsub("(\\..*?[A-Z])", '\\L\\1', x, perl=TRUE)
which worked, but partially. The problem is that for the text analysis I had to convert the pdf files into txt format and now the txt files contain a lot of empty lines (page breaks, returns possibly), and therefore the command I used does not convert the capital letters that appear on the new lines. I was trying to eliminate the empty lines using different combinations in gsub with multiple \s, with \r, \n but nothing works. When I do the inspect(x) of the tm-package, the output looks in the following way:
[346]
[347] Thank you.
[348]
[349] Vice President of Investor Relations
[350]
I would be grateful if anyone could help me!
回答1:
Given your output, the empty lines appear to be separate character strings in a character vector. You need to filter those out using grep
:
empty_lines = grepl('^\\s*$', x)
x = x[! empty_lines]
Then you can perform your subsequent analysis, but you probably still need to concatenate the lines first to get a single character string:
x = paste(x, collapse = '\n')
回答2:
You can get the new lines using ^[A-Z]
and separate the two cases with an or sign |
x <- gsub("(\\..*?[A-Z]|^[A-Z])", '\\L\\1', x, perl=TRUE)
And you can get rid of empty lines either before or after the above step with
x <- x[x != ""]
来源:https://stackoverflow.com/questions/37786091/cant-remove-blank-lines-in-txt-file-with-r