Extracting breakpoints with intervals closed on the left

六眼飞鱼酱① 提交于 2019-11-28 09:43:43

问题


I'm looking at the example menu of the command cut() (example(cut)), specifically this part:

cut> aaa <- c(1,2,3,4,5,2,3,4,5,6,7)

cut> cut(aaa, 3)
[1] (0.994,3] (0.994,3] (3,5]     (3,5]     (3,5]     (0.994,3]
[7] (3,5]     (3,5]     (3,5]     (5,7.01]  (5,7.01] 
Levels: (0.994,3] (3,5] (5,7.01]

cut> cut(aaa, 3, dig.lab = 4, ordered = TRUE)
[1] (0.994,2.998] (0.994,2.998] (2.998,5.002] (2.998,5.002]
[5] (2.998,5.002] (0.994,2.998] (2.998,5.002] (2.998,5.002]
[9] (2.998,5.002] (5.002,7.006] (5.002,7.006]
Levels: (0.994,2.998] < (2.998,5.002] < (5.002,7.006]

cut> ## one way to extract the breakpoints
cut> labs <- levels(cut(aaa, 3))

cut> cbind(lower = as.numeric( sub("\\((.+),.*", "\\1", labs) ),
cut+       upper = as.numeric( sub("[^,]*,([^]]*)\\]", "\\1", labs) ))
     lower upper
[1,] 0.994  3.00
[2,] 3.000  5.00
[3,] 5.000  7.01

Where the intervals are closed on the right (as shown above), then it shows me a way to extract the breakpoints of the data using cbind()

Now, let's suppose my data will by cut, but indicating that the intervals are closed on the left.

cut(aaa, 3, dig.lab = 4, ordered = TRUE, right = FALSE)

How can I extract now my breakpoints using the same command cbind()? (If there are more ways, you're welcome)


回答1:


Just use something like the following for your pattern, and use gsub instead: "\\[|\\]|\\(|\\)".

An example.

out <- levels(cut(aaa, 3, dig.lab = 4, ordered = TRUE, right = FALSE))
gsub("\\[|\\]|\\(|\\)", "", out)
# [1] "0.994,2.998" "2.998,5.002" "5.002,7.006"

And, here's a quick way to read that data in:

read.csv(text = gsub("\\[|\\]|\\(|\\)", "", out), header = FALSE)
#      V1    V2
# 1 0.994 2.998
# 2 2.998 5.002
# 3 5.002 7.006

FYI: The same pattern would work whether the intervals are closed on the left or on the right. Using your original example:

labs <- levels(cut(aaa, 3))
labs
# [1] "(0.994,3]" "(3,5]"     "(5,7.01]" 
read.csv(text = gsub("\\[|\\]|\\(|\\)", "", labs), header = FALSE)
#      V1   V2
# 1 0.994 3.00
# 2 3.000 5.00
# 3 5.000 7.01

As for alternatives, since you just need to strip out the first and last character before you can use read.csv, you can also easily use substr without having to fuss with regular expressions (if that's not your thing):

substr(labs, 2, nchar(labs)-1)
# [1] "0.994,3" "3,5"     "5,7.01" 

Update: A totally different alternative

Since it is obvious that R has to calculate these values and store them as part of the function in order to generate the output you see, it is not too difficult to manipulate the function to get it to output different things.

Looking at the code for cut.default, you'll find the following as the last few lines:

if (codes.only) 
    code
else factor(code, seq_along(labels), labels, ordered = ordered_result)

It's really easy to change the last few lines to output a list that contains the output of cut as the first item, and the calculated ranges (from the cut function directly, not extracting from the pasted together factor labels.

For instance, in the Gist I've posted at this link, I've changed those lines as follows:

if (codes.only) 
  FIN <- code
else FIN <- factor(code, seq_along(labels), labels, ordered = ordered_result)
list(output = FIN, ranges = data.frame(lower = ch.br[-nb], upper = ch.br[-1L]))

Now, compare:

cut(aaa, 3)
#  [1] (0.994,3] (0.994,3] (3,5]     (3,5]     (3,5]     (0.994,3] (3,5]     (3,5]    
#  [9] (3,5]     (5,7.01]  (5,7.01] 
# Levels: (0.994,3] (3,5] (5,7.01]
CUT(aaa, 3)
# $output
# [1] (0.994,3] (0.994,3] (3,5]     (3,5]     (3,5]     (0.994,3] (3,5]     (3,5]    
# [9] (3,5]     (5,7.01]  (5,7.01] 
# Levels: (0.994,3] (3,5] (5,7.01]
# 
# $ranges
#   lower upper
# 1 0.994     3
# 2     3     5
# 3     5  7.01

And, right = FALSE:

cut(aaa, 3, dig.lab = 4, ordered = TRUE, right = FALSE)
#  [1] [0.994,2.998) [0.994,2.998) [2.998,5.002) [2.998,5.002) [2.998,5.002)
#  [6] [0.994,2.998) [2.998,5.002) [2.998,5.002) [2.998,5.002) [5.002,7.006)
# [11] [5.002,7.006)
# Levels: [0.994,2.998) < [2.998,5.002) < [5.002,7.006)
CUT(aaa, 3, dig.lab = 4, ordered = TRUE, right = FALSE)
# $output
#  [1] [0.994,2.998) [0.994,2.998) [2.998,5.002) [2.998,5.002) [2.998,5.002)
#  [6] [0.994,2.998) [2.998,5.002) [2.998,5.002) [2.998,5.002) [5.002,7.006)
# [11] [5.002,7.006)
# Levels: [0.994,2.998) < [2.998,5.002) < [5.002,7.006)

# $ranges
#   lower upper
# 1 0.994 2.998
# 2 2.998 5.002
# 3 5.002 7.006


来源:https://stackoverflow.com/questions/19689397/extracting-breakpoints-with-intervals-closed-on-the-left

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!