How to factor sub group by category?

前端 未结 1 1875
星月不相逢
星月不相逢 2021-01-16 06:11

I have some code that displays the abundance of phyla, and genus within that phyla, as a stacked bar graph. I edited the code such that all the NA elements appear at the to

相关标签:
1条回答
  • 2021-01-16 06:18

    Welcome to stackoverflow. You're doing some tricky stuff here! I think it's hard to do this in a function and the biggest snag is putting the NAs at the top. Using just tidyverse piping, I was able to put this together.

    This is your base set up + a little prep for folks without phyloseq

    # how to install if needed
    #source('http://bioconductor.org/biocLite.R')
    #biocLite('phyloseq')
    library(tidyverse)
    library(phyloseq)
    library(scales)
    library(RColorBrewer)
    data("GlobalPatterns")
    
    # filter phyloseq data
    TopNOTUs <- names(sort(taxa_sums(GlobalPatterns), TRUE)[1:100])
    gp.ch <- prune_species(TopNOTUs, GlobalPatterns)
    
    # create dataframe
    mdf <- psmelt(gp.ch)
    

    First I collapse the records into counts n

    prep <-
      mdf %>%
      mutate(Genus = fct_explicit_na(Genus, "NA")) %>% 
      # summarizes data
      count(Phylum, Genus) %>% # returns n as a count
      mutate(
        group = paste(Phylum, Genus, sep = "-"),
        Phylum = fct_reorder(Phylum, n, sum),
        has_genus = Genus != "NA"
      ) %>% 
      # this step helps with the factor ordering
      arrange(Phylum, has_genus, n) %>% 
      mutate(group = fct_inorder(group)) %>% 
      # I then find some totals & an rank based on the value of n
      group_by(Phylum) %>% 
      mutate(
        ord = row_number(),
        total = n()
      ) %>% 
      ungroup()
    
    #  Phylum         Genus             n group                      has_genus   ord total
    #  <fct>          <fct>         <int> <chr>                      <lgl>     <int> <int>
    #  Tenericutes    NA               52 Tenericutes-NA             FALSE         1     2
    #  Tenericutes    Clostridium      26 Tenericutes-Clostridium    TRUE          2     2
    #  Actinobacteria NA              130 Actinobacteria-NA          FALSE         1     3
    #  Actinobacteria Rothia           26 Actinobacteria-Rothia      TRUE          2     3
    #  Actinobacteria Bifidobacter~    78 Actinobacteria-Bifidobact~ TRUE          3     3
    

    Then I use the factor values to populate the hcl() function (similar to your hue_pal()

    df <-
      prep %>% 
      mutate(
        group = fct_inorder(group), # ordering in the stack
        hue = as.integer(Phylum)*25,
        light_base = 1-(ord)/(total+2),
        light = floor(light_base * 100)
      ) %>% 
      # if the genus is missing, use white, otherwise create a hexcode
      mutate(hex = ifelse(!has_genus, "#ffffff", hcl(h = hue, l = light)))
    

    Then the plot

    ggplot(df, aes(Phylum, n)) + 
      geom_col(aes(fill = group), colour = "grey") +
      scale_fill_manual(values = df$hex, breaks = (df$group)) +
      ggtitle("Phylum and Genus Frequency") +
      ylab("Frequency") +
      theme(plot.title = element_text(hjust = 0.5))
    

    For your second question, keep all of the above code for prep and df and then join these to your original mdf table. The purpose of the df table is only to generate the colors and prep is a helper table. There should be a 1:1 between genus and hex. Including the sample column in prep returns 780 rows instead of 30 and there is no longer a 1:1. This is why you are not getting the results you would like. (I think it is the ord column that gets thrown off). So use the above and then add this. I included a set.seed() and sample_frac() to make the changes more obvious. I also rotated it for readability.

    set.seed(1234)
    final_df <- 
      mdf %>% 
      sample_frac(0.9) %>% 
      mutate(
        Genus = fct_explicit_na(Genus, "NA"),
        # these 2 lines will sort in descending order by Proteobacteria
        rank = as.integer(Phylum == "Proteobacteria" & Genus != "NA"), # T/F == 1/0
        Sample = fct_reorder(Sample, rank, mean)
      ) %>% 
      count(Phylum, Genus, Sample, rank) %>% 
      left_join(df %>% select(-n))
    
    
    ggplot(final_df, aes(Sample, n)) + 
      geom_col(aes(fill = group), position="fill") +#
      scale_fill_manual("Genus", values = df$hex, breaks = (df$group)) +
      ggtitle("Phylum and Genus Frequency") +
      ylab("Frequency") +
      scale_y_continuous(labels = percent, expand = expand_scale(0)) +
      coord_flip() +
      theme(plot.title = element_text(hjust = 0.5))
    

    0 讨论(0)
提交回复
热议问题