I have a dataset containing 100000 rows of data. I tried to do some countif
operations in Excel, but it was prohibitively slow. So I am wondering if this kind o
Here an example with 100000 rows (occupations are set here from A to Z):
> a = data.frame(sex=sample(c("M", "F"), 100000, replace=T), occupation=sample(LETTERS, 100000, replace=T))
> sum(a$sex == "M" & a$occupation=="A")
[1] 1882
returns the number of males with occupation "A".
EDIT
As I understand from your comment, you want the counts of all possible combinations of sex and occupation. So first create a dataframe with all combinations:
combns = expand.grid(c("M", "F"), LETTERS)
and loop with apply
to sum for your criteria and append the results to combns
:
combns = cbind (combns, apply(combns, 1, function(x)sum(a$sex==x[1] & a$occupation==x[2])))
colnames(combns) = c("sex", "occupation", "count")
The first rows of your result look as follows:
sex occupation count
1 M A 1882
2 F A 1869
3 M B 1866
4 F B 1904
5 M C 1979
6 F C 1910
Does this solve your problem?
OR:
Much easier solution suggested by thelatemai:
table(a$sex, a$occupation)
A B C D E F G H I J K L M N O
F 1869 1904 1910 1907 1894 1940 1964 1907 1918 1892 1962 1933 1886 1960 1972
M 1882 1866 1979 1904 1895 1845 1946 1905 1999 1994 1933 1950 1876 1856 1911
P Q R S T U V W X Y Z
F 1908 1907 1883 1888 1943 1922 2016 1962 1885 1898 1889
M 1928 1938 1916 1927 1972 1965 1946 1903 1965 1974 1906
Given a dataset
df <- data.frame( sex = c('M', 'M', 'F', 'F', 'M'),
occupation = c('analyst', 'dentist', 'dentist', 'analyst', 'cook') )
you can subset rows
df[df$sex == 'M',] # To get all males
df[df$occupation == 'analyst',] # All analysts
etc.
If you want to get number of rows, just call the function nrow
such as
nrow(df[df$sex == 'M',])
Easy peasy. Your data frame will look like this:
df <- data.frame(sex=c('M','F','M'),
occupation=c('Student','Analyst','Analyst'))
You can then do the equivalent of a COUNTIF
by first specifying the IF
part, like so:
df$sex == 'M'
This will give you a boolean vector, i.e. a vector of TRUE
and FALSE
. What you want is to count the observations for which the condition is TRUE
. Since in R TRUE
and FALSE
double as 1 and 0 you can simply sum()
over the boolean vector. The equivalent of COUNTIF(sex='M')
is therefore
sum(df$sex == 'M')
Should there be rows in which the sex
is not specified the above will give back NA
. In that case, if you just want to ignore the missing observations use
sum(df$sex == 'M', na.rm=TRUE)
library(matrixStats)
> data <- rbind(c("M", "F", "M"), c("Student", "Analyst", "Analyst"))
> rowCounts(data, value = 'M') # output = 2 0
> rowCounts(data, value = 'F') # output = 1 0
Table is the obvious choice, but it returns an object of class table
which takes a few annoying steps to transform back into a data.frame
So, if you're OK using dplyr, you use the command tally
:
library(dplyr)
df = data.frame(sex=sample(c("M", "F"), 100000, replace=T), occupation=sample(c('Analyst', 'Student'), 100000, replace=T)
df %>% group_by_all() %>% tally()
# A tibble: 4 x 3
# Groups: sex [2]
sex occupation `n()`
<fct> <fct> <int>
1 F Analyst 25105
2 F Student 24933
3 M Analyst 24769
4 M Student 25193