I have some code that displays the abundance of phyla, and genus within that phyla, as a stacked bar graph. I edited the code such that all the NA elements appear at the to
Welcome to stackoverflow. You're doing some tricky stuff here! I think it's hard to do this in a function and the biggest snag is putting the NAs at the top. Using just tidyverse
piping, I was able to put this together.
This is your base set up + a little prep for folks without phyloseq
# how to install if needed
#source('http://bioconductor.org/biocLite.R')
#biocLite('phyloseq')
library(tidyverse)
library(phyloseq)
library(scales)
library(RColorBrewer)
data("GlobalPatterns")
# filter phyloseq data
TopNOTUs <- names(sort(taxa_sums(GlobalPatterns), TRUE)[1:100])
gp.ch <- prune_species(TopNOTUs, GlobalPatterns)
# create dataframe
mdf <- psmelt(gp.ch)
First I collapse the records into counts n
prep <-
mdf %>%
mutate(Genus = fct_explicit_na(Genus, "NA")) %>%
# summarizes data
count(Phylum, Genus) %>% # returns n as a count
mutate(
group = paste(Phylum, Genus, sep = "-"),
Phylum = fct_reorder(Phylum, n, sum),
has_genus = Genus != "NA"
) %>%
# this step helps with the factor ordering
arrange(Phylum, has_genus, n) %>%
mutate(group = fct_inorder(group)) %>%
# I then find some totals & an rank based on the value of n
group_by(Phylum) %>%
mutate(
ord = row_number(),
total = n()
) %>%
ungroup()
# Phylum Genus n group has_genus ord total
# <fct> <fct> <int> <chr> <lgl> <int> <int>
# Tenericutes NA 52 Tenericutes-NA FALSE 1 2
# Tenericutes Clostridium 26 Tenericutes-Clostridium TRUE 2 2
# Actinobacteria NA 130 Actinobacteria-NA FALSE 1 3
# Actinobacteria Rothia 26 Actinobacteria-Rothia TRUE 2 3
# Actinobacteria Bifidobacter~ 78 Actinobacteria-Bifidobact~ TRUE 3 3
Then I use the factor values to populate the hcl()
function (similar to your hue_pal()
df <-
prep %>%
mutate(
group = fct_inorder(group), # ordering in the stack
hue = as.integer(Phylum)*25,
light_base = 1-(ord)/(total+2),
light = floor(light_base * 100)
) %>%
# if the genus is missing, use white, otherwise create a hexcode
mutate(hex = ifelse(!has_genus, "#ffffff", hcl(h = hue, l = light)))
Then the plot
ggplot(df, aes(Phylum, n)) +
geom_col(aes(fill = group), colour = "grey") +
scale_fill_manual(values = df$hex, breaks = (df$group)) +
ggtitle("Phylum and Genus Frequency") +
ylab("Frequency") +
theme(plot.title = element_text(hjust = 0.5))
For your second question, keep all of the above code for prep
and df
and then join these to your original mdf
table. The purpose of the df
table is only to generate the colors and prep
is a helper table. There should be a 1:1 between genus
and hex
. Including the sample
column in prep
returns 780 rows instead of 30 and there is no longer a 1:1. This is why you are not getting the results you would like. (I think it is the ord
column that gets thrown off). So use the above and then add this. I included a set.seed()
and sample_frac()
to make the changes more obvious. I also rotated it for readability.
set.seed(1234)
final_df <-
mdf %>%
sample_frac(0.9) %>%
mutate(
Genus = fct_explicit_na(Genus, "NA"),
# these 2 lines will sort in descending order by Proteobacteria
rank = as.integer(Phylum == "Proteobacteria" & Genus != "NA"), # T/F == 1/0
Sample = fct_reorder(Sample, rank, mean)
) %>%
count(Phylum, Genus, Sample, rank) %>%
left_join(df %>% select(-n))
ggplot(final_df, aes(Sample, n)) +
geom_col(aes(fill = group), position="fill") +#
scale_fill_manual("Genus", values = df$hex, breaks = (df$group)) +
ggtitle("Phylum and Genus Frequency") +
ylab("Frequency") +
scale_y_continuous(labels = percent, expand = expand_scale(0)) +
coord_flip() +
theme(plot.title = element_text(hjust = 0.5))