I\'ve been trying to understand how to deal with the output of strsplit
a bit better. I often have data such as this that I wish to split:
myd
Try this:
> read.table(text = mydata, sep = "/", as.is = TRUE, fill = TRUE)
V1 V2 V3
1 144 4 5
2 154 2 NA
3 146 3 5
4 142 NA NA
5 143 4 NA
6 DNB NA NA
7 90 NA NA
If you want to treat DNB
as an NA then add the argument na.strings="DNB"
.
If you really want to use strsplit
then try this:
> do.call(rbind, lapply(strsplit(mydata, "/"), function(x) head(c(x,NA,NA), 3)))
[,1] [,2] [,3]
[1,] "144" "4" "5"
[2,] "154" "2" NA
[3,] "146" "3" "5"
[4,] "142" NA NA
[5,] "143" "4" NA
[6,] "DNB" NA NA
[7,] "90" NA NA
Note: Using alexis_laz's observation that x[i]
returns NA
if i
is not in 1:length(x)
the last line of code above could be simplified to:
t(sapply(strsplit(mydata, "/"), "[", 1:3))
You can assign the length inside sapply
, resulting in NA
where the current length is shorter than the assigned length.
s <- strsplit(mydata, "/")
sapply(s, function(x) { length(x) <- 3; x[2] })
# [1] "4" "2" "3" NA "4" NA NA
Then you can add a second indexing argument with mapply
m <- max(sapply(s, length))
mapply(function(x, y, z) { length(x) <- z; x[y] }, s, 2, m)
# [1] "4" "2" "3" NA "4" NA NA
You could use regex
(if it is allowed)
library(stringr)
str_extract(mydata , perl("(?<=\\d/)\\d+"))
#[1] "4" "2" "3" NA "4" NA NA
str_extract(mydata , perl("(?<=/\\d/)\\d+"))
#[1] "5" NA "5" NA NA NA NA
(at least regarding 1D vectors) [
seems to return NA
when "i > length(x)" whereas [[
returns an error.
x = runif(5)
x[6]
#[1] NA
x[[6]]
#Error in x[[6]] : subscript out of bounds
Digging a bit, do_subset_dflt (i.e. [
) calls ExtractSubset where we notice that when a wanted index ("ii") is "> length(x)" NA
is returned (a bit modified to be clean):
if(0 <= ii && ii < nx && ii != NA_INTEGER)
result[i] = x[ii];
else
result[i] = NA_INTEGER;
On the other hand do_subset2_dflt (i.e. [[
) returns an error if the wanted index ("offset") is "> length(x)" (modified a bit to be clean):
if(offset < 0 || offset >= xlength(x)) {
if(offset < 0 && (isNewList(x)) ...
else errorcall(call, R_MSG_subs_o_b);
}
where #define R_MSG_subs_o_b _("subscript out of bounds")
(I'm not sure about the above code snippets but they do seem relevant based on their returns)