Using python I have created following data frame which contains similarity values:
cosinFcolor cosinEdge cosinTexture histoFcolor histoEdge histoTexture jaccard
1 0.770 0.489 0.388 0.57500000 0.5845137 0.3920000 0.00000000
2 0.067 0.496 0.912 0.13865546 0.6147309 0.6984127 0.00000000
3 0.514 0.426 0.692 0.36440678 0.4787535 0.5198413 0.05882353
4 0.102 0.430 0.739 0.11297071 0.5288008 0.5436508 0.00000000
5 0.560 0.735 0.554 0.48148148 0.8168083 0.4603175 0.00000000
6 0.029 0.302 0.558 0.08547009 0.3928234 0.4603175 0.00000000
I am trying to write a R script to generate another data frame that reflects the bins, but my condition of binning applies if the value is above 0.5 such that
Pseudocode:
if (cosinFcolor > 0.5 & cosinFcolor <= 0.6)
bin = 1
if (cosinFcolor > 0.6 & cosinFcolor <= 0.7)
bin = 2
if (cosinFcolor > 0.7 & cosinFcolor =< 0.8)
bin = 3
if (cosinFcolor > 0.8 & cosinFcolor <=0.9)
bin = 4
if (cosinFcolor > 0.9 & cosinFcolor <= 1.0)
bin = 5
else
bin = 0
Based on above logic, I want to build a data frame
cosinFcolor cosinEdge cosinTexture histoFcolor histoEdge histoTexture jaccard
1 3 0 0 1 1 0 0
How can I start this as a script, or should I do this in python? I am trying to get familiar with R after finding out how powerful it is/number of machine learning packages it has. My goal is to build a classifier but first I need be familiar with R :)
Another cut answer that takes into account extrema:
dat <- read.table("clipboard", header=TRUE)
cuts <- apply(dat, 2, cut, c(-Inf,seq(0.5, 1, 0.1), Inf), labels=0:6)
cuts[cuts=="6"] <- "0"
cuts <- as.data.frame(cuts)
cosinFcolor cosinEdge cosinTexture histoFcolor histoEdge histoTexture jaccard
1 3 0 0 1 1 0 0
2 0 0 5 0 2 2 0
3 1 0 2 0 0 1 0
4 0 0 3 0 1 1 0
5 1 3 1 0 4 0 0
6 0 0 1 0 0 0 0
Explanation
The cut function splits into bins depending on the cuts you specify. So let's take 1:10 and split it at 3, 5 and 7.
cut(1:10, c(3, 5, 7))
[1] <NA> <NA> <NA> (3,5] (3,5] (5,7] (5,7] <NA> <NA> <NA>
Levels: (3,5] (5,7]
You can see how it has made a factor where the levels are those in between the breaks. Also notice it doesn't include 3 (there's an include.lowest
argument which will include it). But these are terrible names for groups, let's call them group 1 and 2.
cut(1:10, c(3, 5, 7), labels=1:2)
[1] <NA> <NA> <NA> 1 1 2 2 <NA> <NA> <NA>
Better, but what's with the NAs? They are outside our boundaries and not counted. To count them, in my solution, I added -infinity and infinity, so all points would be included. Notice that as we have more breaks, we'll need more labels:
x <- cut(1:10, c(-Inf, 3, 5, 7, Inf), labels=1:4)
[1] 1 1 1 2 2 3 3 4 4 4
Levels: 1 2 3 4
Ok, but we didn't want 4 (as per your problem). We wanted all the 4s to be in group 1. So let's get rid of the entries which are labelled '4'.
x[x=="4"] <- "1"
[1] 1 1 1 2 2 3 3 1 1 1
Levels: 1 2 3 4
This is slightly different to what I did before, notice I took away all the last labels at the end before, but I've done it this way here so you can better see how cut
works.
Ok, the apply
function. So far, we've been using cut on a single vector. But you want it used on a collection of vectors: each column of your data frame. That's what the second argument of apply
does. 1 applies the function to all rows, 2 applies to all columns. Apply the cut
function to each column of your data frame. Everything after cut
in the apply function are just arguments to cut
, which we discussed above.
Hope that helps.
You can also use findInterval
:
findInterval(seq(0, 1, l=20), seq(0.5, 1, by=0.1))
## [1] 0 0 0 0 0 0 0 0 0 1 1 2 2 3 4 4 5 5
With cut it's easy as pie
dtf <- read.table(
textConnection(
"cosinFcolor cosinEdge cosinTexture histoFcolor histoEdge histoTexture jaccard
1 0.770 0.489 0.388 0.57500000 0.5845137 0.3920000 0.00000000
2 0.067 0.496 0.912 0.13865546 0.6147309 0.6984127 0.00000000
3 0.514 0.426 0.692 0.36440678 0.4787535 0.5198413 0.05882353
4 0.102 0.430 0.739 0.11297071 0.5288008 0.5436508 0.00000000
5 0.560 0.735 0.554 0.48148148 0.8168083 0.4603175 0.00000000
6 0.029 0.302 0.558 0.08547009 0.3928234 0.4603175 0.00000000"), sep = " ",
header = TRUE)
dtf$bin <- cut(dtf$cosinFcolor, breaks = c(0, seq(0.5, 1, by = .1)), labels = 0:5)
dtf
cosinFcolor cosinEdge cosinTexture histoFcolor histoEdge histoTexture jaccard bin
1 0.770 0.489 0.388 0.57500000 0.5845137 0.3920000 0.00000000 3
2 0.067 0.496 0.912 0.13865546 0.6147309 0.6984127 0.00000000 0
3 0.514 0.426 0.692 0.36440678 0.4787535 0.5198413 0.05882353 1
4 0.102 0.430 0.739 0.11297071 0.5288008 0.5436508 0.00000000 0
5 0.560 0.735 0.554 0.48148148 0.8168083 0.4603175 0.00000000 1
6 0.029 0.302 0.558 0.08547009 0.3928234 0.4603175 0.00000000 0
Here's another solution using the bin_data()
function from the mltools package.
Binning one vector
library(mltools)
cosinFcolor <- c(0.77, 0.067, 0.514, 0.102, 0.56, 0.029)
binned <- bin_data(cosinFcolor, bins=c(0, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0), boundaryType = "[lorc")
binned
[1] (0.7, 0.8] [0, 0.5] (0.5, 0.6] [0, 0.5] (0.5, 0.6] [0, 0.5]
Levels: [0, 0.5] < (0.5, 0.6] < (0.6, 0.7] < (0.7, 0.8] < (0.8, 0.9] < (0.9, 1]
# Convert to numbers 0, 1, ...
as.integer(binned) - 1L
Binning each column in the data.frame
df <- read.table(textConnection(
"cosinFcolor cosinEdge cosinTexture histoFcolor histoEdge histoTexture jaccard
0.770 0.489 0.388 0.57500000 0.5845137 0.3920000 0.00000000
0.067 0.496 0.912 0.13865546 0.6147309 0.6984127 0.00000000
0.514 0.426 0.692 0.36440678 0.4787535 0.5198413 0.05882353
0.102 0.430 0.739 0.11297071 0.5288008 0.5436508 0.00000000
0.560 0.735 0.554 0.48148148 0.8168083 0.4603175 0.00000000
0.029 0.302 0.558 0.08547009 0.3928234 0.4603175 0.00000000"
), sep = " ", header = TRUE)
for(col in colnames(df)) df[[col]] <- as.integer(bin_data(df[[col]], bins=c(0, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0), boundaryType = "[lorc")) - 1L
df
cosinFcolor cosinEdge cosinTexture histoFcolor histoEdge histoTexture jaccard
1 3 0 0 1 1 0 0
2 0 0 5 0 2 2 0
3 1 0 2 0 0 1 0
4 0 0 3 0 1 1 0
5 1 3 1 0 4 0 0
6 0 0 1 0 0 0 0
来源:https://stackoverflow.com/questions/11963508/define-and-apply-custom-bins-on-a-dataframe