how to convert a data.frame to tree structure object such as dendrogram

前端 未结 2 1858
生来不讨喜
生来不讨喜 2020-12-24 09:39

I have a data.frame object. For a simple example:

> data.frame(x=c(\'A\',\'A\',\'B\',\'B\',\'B\'), y=c(\'Ab\',\'Ac\',\'Ba\', \'Ba\',\'Bd\'), z=c(\'Abb\',\         


        
相关标签:
2条回答
  • 2020-12-24 10:14

    data.frame to Newick

    I did my PhD in computational phylogenetics and somewhere along the way I produced this code, that I used once or twice when I got some data in this nonstandard format (in phylogenetic sense). The script traverses the dataframe as if it were a tree ... and pastes stuff along the way into a Newick string, which is a standard format and can be then transformed in any kind of tree object.

    I guess the script could be optimized (I used it so rarely that more work on it would reduce the overall efficiency), but at least it is better to share than to let it collect dust laying around on my harddrive.

        ## recursion function
        traverse <- function(a,i,innerl){
            if(i < (ncol(df))){
                alevelinner <- as.character(unique(df[which(as.character(df[,i])==a),i+1]))
                desc <- NULL
                if(length(alevelinner) == 1) (newickout <- traverse(alevelinner,i+1,innerl))
                else {
                    for(b in alevelinner) desc <- c(desc,traverse(b,i+1,innerl))
                    il <- NULL; if(innerl==TRUE) il <- a
                    (newickout <- paste("(",paste(desc,collapse=","),")",il,sep=""))
                }
            }
            else { (newickout <- a) }
        }
    
        ## data.frame to newick function
        df2newick <- function(df, innerlabel=FALSE){
            alevel <- as.character(unique(df[,1]))
            newick <- NULL
            for(x in alevel) newick <- c(newick,traverse(x,1,innerlabel))
            (newick <- paste("(",paste(newick,collapse=","),");",sep=""))
        }
    

    The main function df2newick() takes two arguments:

    • df which is the dataframe to be transformed (object of class data.frame)
    • innerlabel which tells the function to write labels for inner nodes (bulean)

    To demonstrate it on your example:

        df <- data.frame(x=c('A','A','B','B','B'), y=c('Ab','Ac','Ba', 'Ba','Bd'), z=c('Abb','Acc','Bad', 'Bae','Bdd'))
        myNewick <- df2newick(df)
        #[1] "((Abb,Acc),((Bad,Bae),Bdd));"
    

    Now you could read it into a object of class phylo with read.tree() from ape

        library(ape)
        mytree <- read.tree(text=myNewick)
        plot(mytree)
    

    If you want to add inner node labels to the Newick string, you can use this:

        myNewick <- df2newick(df, TRUE)
        #[1] "((Abb,Acc)A,((Bad,Bae)Ba,Bdd)B);"
    

    Hope this is useful (and maybe my PhD wasn't a complete waist of time ;-)


    Additional note for your dataframe format:

    As you can observe the df2newick function ignores inner modes with one child (which is anyway best to be used with most phylogenetic methods ... was only relevant to me). The df objects that I originally got and used with this script were of this format:

        df <- data.frame(x=c('A','A','B','B','B'), y=c('Abb','Acc','Ba', 'Ba','Bdd'), z=c('Abb','Acc','Bad', 'Bae','Bdd'))
    

    Very similar to yours ... but the "inner singe child nodes" just had the same name as their children, but you have different inner names for this nodes too, and the names get ignored ... might not be relevant but you can just ignore a part of the recursion function, like this:

        traverse <- function(a,i,innerl){
            if(i < (ncol(df))){
                alevelinner <- as.character(unique(df[which(as.character(df[,i])==a),i+1]))
                desc <- NULL
                ##if(length(alevelinner) == 1) (newickout <- traverse(alevelinner,i+1,innerl))
                ##else {
                    for(b in alevelinner) desc <- c(desc,traverse(b,i+1,innerl))
                    il <- NULL; if(innerl==TRUE) il <- a
                    (newickout <- paste("(",paste(desc,collapse=","),")",il,sep=""))
                ##}
            }
            else { (newickout <- a) }
        }
    

    and you would get something like this:

        [1] "(((Abb)Ab,(Acc)Ac)A,((Bad,Bae)Ba,(Bdd)Bd)B);"
    

    This really looks odd to me, but I add it just in case, cause it really includes now all the information from your original dataframe.

    0 讨论(0)
  • 2020-12-24 10:15

    I don't know much about the internal structure of dendrograms in R, but the following code will create a nested list structure that has the hierarchy that I think you look for:

    stree = function(x,level=0) {
    #x is a string vector
    #resultis a hierarchical structure of lists (that contains lists, etc.)
    #the names of the lists are the node values.
    
    level = level+1
    if (length(x)==1) {
        result = list()
        result[[substring(x[1],level)]]=list()
        return(result)
    }
    result=list()
    this.level = substring(x,level,level)
    next.levels = unique(this.level)
    for (p in next.levels) {
        if (p=="") {
            result$p = list()
        } else {
            ids = which(this.level==p)
            result[[p]] = stree(x[ids],level)
        }
    }
    result
    }
    

    it operates on a vector of strings. so in case of your dataframe you'd need to call stree(as.character(df[,3]))

    Hope this helps.

    0 讨论(0)
提交回复
热议问题