问题
I have written application that is analyzing data and writing results in CSV file. It contains three columns: id, diff and count.
1. id is the id of the cycle - in theory the greater id, the lower diff should be
2. Diff is the sum of
(Estimator - RealValue)^2for each observation in the cycle
3 count is number of observation during cycle
For 15 different values of parameter K, I am generating CSV file with name: %K%.csv , where %K% is the used value. My total number of files is 15.
What I would like to do, is to write in R simple loop, that will be able to plot content of my files in order to let me decide, which value of K is the best (for which in general the diff is the lowest.
For single file I am doing something like
ggplot(data = data) + geom_point(aes(x= id, y=sqrt(diff/count)))
Does it make sense what I am trying to do ? Please note that statistics is completely not my domain, nor is R (but you probably could figure out this already).
Is there any better approach I can choose? And from theoretical point of view, am I doing what I am expecting to do?
I Would be very greateful for any comments, hints, critic and answers
回答1:
Edited to clean up some typos and address the multiple K value issue.
I'm going to assume that you've placed all your .csv files in a single directory (and there's nothing else in this directory). I will also assume that each .csv really do have the same structure (same number of columns, in the same order). I would begin by generating a list of the file names:
myCSVs <- list.files("path/to/directory")
Then I would 'loop' over the list of file names using lapply
, reading each file into a data frame using read.csv
:
setwd("path/to/directory")
#This function just reads in the file and
# appends a column with the K val taken from the file
# name. You may need to tinker with the particulars here.
myFun <- function(fn){
tmp <- read.csv(fn)
tmp$K <- strsplit(fn,".",fixed = TRUE)[[1]][1]
tmp
}
dataList <- lapply(myCSVs, FUN = myFun,...)
Depending on the structure of your .csv's you may need to pass some additional arguments to read.csv
. Finally, I would combine this list of data frames into a single data frame:
myData <- do.call(rbind, dataList)
Then you should have all your data in a single data frame, myData
, that you can pass to ggplot
.
As for the statistical aspect of your question, it's a little difficult to offer an opinion without concrete examples of your data. Once you've figured the programming part out, you could ask a separate question that provides some sample data (either here, or on stats.stackexchange.com) and folks will be able to suggest some visualization or analysis techniques that may help.
回答2:
I am not familiar with the background of your question, but I hope I can understand your request.
Your command:
ggplot(data = data) + geom_point(aes(x= id, y=sqrt(diff/count)))
is doing the xyplot for the relationship of normalized difference ~ cycle. You mentioned that "in theory the greater id, the lower diff should be". So this plot is validating the assumption. Actually there is another way to do this with a number: spearman correlation coefficient, which can be computed with cor(x, y, method='spearman').
You mentioned that "plot content of my files in order to let me decide, which value of K is the best (for which in general the diff is the lowest". So probably you need to load all these files with sth like "sapply(read.csv(...), simplify=T)" to load all the data, and after that you should convert all loaded file into some format with FOUR columns include K, Id, diff and count. Then you can visualize the dataset in a three dimension with functions (levelplot) within latticeExtra package (sorry, I don't know how to do this with ggplot2), or you can use a color-coded way to do this in 2-d using geom_tile function of ggplot2, or, you can use facet to visualize the data in a grid way.
来源:https://stackoverflow.com/questions/7311372/merging-data-from-many-files-and-plot-them