Merging data from many files and plot them

99封情书 提交于 2019-12-01 13:43:28

Edited to clean up some typos and address the multiple K value issue.

I'm going to assume that you've placed all your .csv files in a single directory (and there's nothing else in this directory). I will also assume that each .csv really do have the same structure (same number of columns, in the same order). I would begin by generating a list of the file names:

myCSVs <- list.files("path/to/directory")

Then I would 'loop' over the list of file names using lapply, reading each file into a data frame using read.csv:

setwd("path/to/directory")
#This function just reads in the file and
# appends a column with the K val taken from the file
# name. You may need to tinker with the particulars here.
myFun <- function(fn){
     tmp <- read.csv(fn)
     tmp$K <- strsplit(fn,".",fixed = TRUE)[[1]][1]
     tmp
}
dataList <- lapply(myCSVs, FUN = myFun,...)

Depending on the structure of your .csv's you may need to pass some additional arguments to read.csv. Finally, I would combine this list of data frames into a single data frame:

myData <- do.call(rbind, dataList)

Then you should have all your data in a single data frame, myData, that you can pass to ggplot.

As for the statistical aspect of your question, it's a little difficult to offer an opinion without concrete examples of your data. Once you've figured the programming part out, you could ask a separate question that provides some sample data (either here, or on stats.stackexchange.com) and folks will be able to suggest some visualization or analysis techniques that may help.

I am not familiar with the background of your question, but I hope I can understand your request.

Your command:

ggplot(data = data) + geom_point(aes(x= id, y=sqrt(diff/count)))

is doing the xyplot for the relationship of normalized difference ~ cycle. You mentioned that "in theory the greater id, the lower diff should be". So this plot is validating the assumption. Actually there is another way to do this with a number: spearman correlation coefficient, which can be computed with cor(x, y, method='spearman').

You mentioned that "plot content of my files in order to let me decide, which value of K is the best (for which in general the diff is the lowest". So probably you need to load all these files with sth like "sapply(read.csv(...), simplify=T)" to load all the data, and after that you should convert all loaded file into some format with FOUR columns include K, Id, diff and count. Then you can visualize the dataset in a three dimension with functions (levelplot) within latticeExtra package (sorry, I don't know how to do this with ggplot2), or you can use a color-coded way to do this in 2-d using geom_tile function of ggplot2, or, you can use facet to visualize the data in a grid way.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!