Removing rows containing special characters

北慕城南 提交于 2019-12-13 06:12:27

问题


I am working on filtering out a massive dataset that reads in as a list. I need to filter out special markings and am getting stuck on some of them. Here is what I currently have:

library(R.utils)
library(stringr)

gunzip("movies.list.gz") #open file
movies <- readLines("movies.list") #read lines in
movies <- gsub("[\t]", '', movies) #remove tabs (\t)
#movies <- gsub(, '', movies)
a <- movies[!grepl("\\{", movies)] # removed any line that contained special character {
b <- a[!grepl("\\(V)", a)] #remove porn?
c <- b[!grepl("\\(TV)", b)] #remove tv
d <- c[!grepl("\\(VG)", c)] #remove video games
e <- d[!grepl("\\(\\?\\?\\?\\?\\)", d)] #remove anyhting with unknown date ex (????)
f <- e[!grepl("\\#)", e)] 
g <- e[!grepl("\\!)", f)]


i <- data.frame(g)
i <- i[-c(1:15),]
i <- data.frame(i)
i$Date <- lapply(strsplit(as.character(i$i), "\\(....\\)"), "[", 2)
i$Title <- lapply(strsplit(as.character(i$i), "\\(....\\)"), "[", 1)

I still need to clean it up a bit, and remove the original column (i) but from the output you can see that it is not removing the special characters ! or #

> head(i)
                                i      Date                Title
1            "!Next?" (1994)1994-1995 1994-1995            "!Next?" 
2         "#1 Single" (2006)2006-???? 2006-????         "#1 Single" 
3 "#1MinuteNightmare" (2014)2014-???? 2014-???? "#1MinuteNightmare" 
4           "#30Nods" (2014)2014-2015 2014-2015           "#30Nods" 
5       "#7DaysLater" (2013)2013-???? 2013-????       "#7DaysLater" 
6            "#ATown" (2014)2014-???? 2014-????            "#ATown" 

What I actually want to do is remove the entire rows containing those special characters. Everything I have tried has thrown errors. Any suggestions?


回答1:


You could sub anything that is not alphanumeric or a "-" or "()" like this:

gsub("[^A-Za-z()-]", "", row)



回答2:


In order to remove the rows you can try something like the one below:

data[!grepl(pattern = "[#!]", x = data)]

In case you want to remove all the rows with special characters you can use the code suggested by @luke1018 using grepl:

data[!grepl(pattern = "[^A-Za-z0-9-()]", x = data)]


来源:https://stackoverflow.com/questions/36344180/removing-rows-containing-special-characters

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!