can't remove blank lines in txt file with R

眉间皱痕 提交于 2019-12-24 07:16:34

问题


I am doing a text analysis with R and needed to convert the first letters of the sentences into lowercase while keeping the other capitalized words the way they are. So I used the command

     x <- gsub("(\\..*?[A-Z])", '\\L\\1', x, perl=TRUE)

which worked, but partially. The problem is that for the text analysis I had to convert the pdf files into txt format and now the txt files contain a lot of empty lines (page breaks, returns possibly), and therefore the command I used does not convert the capital letters that appear on the new lines. I was trying to eliminate the empty lines using different combinations in gsub with multiple \s, with \r, \n but nothing works. When I do the inspect(x) of the tm-package, the output looks in the following way:

[346]                                                                                                                                                                                                                                                  
[347]    Thank you.                                                                                                                                                                                                                                    
[348]                                                                                                                                                                                                                                                  
[349]    Vice President of Investor Relations                                                                                                                                                                                               
[350]   

I would be grateful if anyone could help me!


回答1:


Given your output, the empty lines appear to be separate character strings in a character vector. You need to filter those out using grep:

empty_lines = grepl('^\\s*$', x)
x = x[! empty_lines]

Then you can perform your subsequent analysis, but you probably still need to concatenate the lines first to get a single character string:

x = paste(x, collapse = '\n')



回答2:


You can get the new lines using ^[A-Z] and separate the two cases with an or sign |

x <- gsub("(\\..*?[A-Z]|^[A-Z])", '\\L\\1', x, perl=TRUE)

And you can get rid of empty lines either before or after the above step with

x <- x[x != ""]


来源:https://stackoverflow.com/questions/37786091/cant-remove-blank-lines-in-txt-file-with-r

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!