I am having some troubles with leading and trailing white space in a data.frame.
For example, I like to take a look at a specific row
in a data.fra
Probably the best way is to handle the trailing white spaces when you read your data file. If you use read.csv
or read.table
you can set the parameterstrip.white=TRUE
.
If you want to clean strings afterwards you could use one of these functions:
# Returns string without leading white space
trim.leading <- function (x) sub("^\\s+", "", x)
# Returns string without trailing white space
trim.trailing <- function (x) sub("\\s+$", "", x)
# Returns string without leading or trailing white space
trim <- function (x) gsub("^\\s+|\\s+$", "", x)
To use one of these functions on myDummy$country
:
myDummy$country <- trim(myDummy$country)
To 'show' the white space you could use:
paste(myDummy$country)
which will show you the strings surrounded by quotation marks (") making white spaces easier to spot.
Another related problem occurs if you have multiple spaces in between inputs:
> a <- " a string with lots of starting, inter mediate and trailing whitespace "
You can then easily split this string into "real" tokens using a regular expression to the split
argument:
> strsplit(a, split=" +")
[[1]]
[1] "" "a" "string" "with" "lots"
[6] "of" "starting," "inter" "mediate" "and"
[11] "trailing" "whitespace"
Note that if there is a match at the beginning of a (non-empty) string, the first element of the output is ‘""’, but if there is a match at the end of the string, the output is the same as with the match removed.
I tried trim(). It works well with white spaces as well as the '\n'.
x = '\n Harden, J.\n '
trim(x)
Another option is to use the stri_trim
function from the stringi
package which defaults to removing leading and trailing whitespace:
> x <- c(" leading space","trailing space ")
> stri_trim(x)
[1] "leading space" "trailing space"
For only removing leading whitespace, use stri_trim_left
. For only removing trailing whitespace, use stri_trim_right
. When you want to remove other leading or trailing characters, you have to specify that with pattern =
.
See also ?stri_trim
for more info.
I created a trim.strings ()
function to trim leading and/or trailing whitespace as:
# Arguments: x - character vector
# side - side(s) on which to remove whitespace
# default : "both"
# possible values: c("both", "leading", "trailing")
trim.strings <- function(x, side = "both") {
if (is.na(match(side, c("both", "leading", "trailing")))) {
side <- "both"
}
if (side == "leading") {
sub("^\\s+", "", x)
} else {
if (side == "trailing") {
sub("\\s+$", "", x)
} else gsub("^\\s+|\\s+$", "", x)
}
}
For illustration,
a <- c(" ABC123 456 ", " ABC123DEF ")
# returns string without leading and trailing whitespace
trim.strings(a)
# [1] "ABC123 456" "ABC123DEF"
# returns string without leading whitespace
trim.strings(a, side = "leading")
# [1] "ABC123 456 " "ABC123DEF "
# returns string without trailing whitespace
trim.strings(a, side = "trailing")
# [1] " ABC123 456" " ABC123DEF"
Use grep or grepl to find observations with white spaces and sub to get rid of them.
names<-c("Ganga Din\t", "Shyam Lal", "Bulbul ")
grep("[[:space:]]+$", names)
[1] 1 3
grepl("[[:space:]]+$", names)
[1] TRUE FALSE TRUE
sub("[[:space:]]+$", "", names)
[1] "Ganga Din" "Shyam Lal" "Bulbul"