问题
I have hundreds of text files with the following information in each file:
*****Auto-Corelation Results******
1 .09 -.19 .18 non-Significant
*****STATISTICS FOR MANN-KENDELL TEST******
S= 609
VAR(S)= 162409.70
Z= 1.51
Random : No trend at 95%
*****SENs STATISTICS ******
SEN SLOPE = .24
Now, I want to read all these files, and "collect" Sen's Statistics from each file (eg. .24
) and compile into one file along with the corresponding file names. I have to do it in R.
I have worked with CSV files but not sure how to use text files.
This is the code I am using now:
require(gtools)
GG <- grep("*.txt", list.files(), value = TRUE)
GG<-mixedsort(GG)
S <- sapply(seq(GG), function(i){
X <- readLines(GG[i])
grep("SEN SLOPE", X, value = TRUE)
})
spl <- unlist(strsplit(S, ".*[^.0-9]"))
SenStat <- as.numeric(spl[nzchar(spl)])
SenStat<-data.frame( SenStat,file = GG)
write.table(SenStat, "sen.csv",sep = ", ",row.names = FALSE)
The current code is not able to read all values correctly and giving this error:
Warning message:
NAs introduced by coercion
Also I am not getting the file names the other column of Output. Please help!
Diagnosis 1
The code is reading the = sign as well. This is the output of print(spl)
[1] "" "5.55" "" "-.18" "" "3.08" "" "3.05" "" "1.19" "" "-.32"
[13] "" ".22" "" "-.22" "" ".65" "" "1.64" "" "2.68" "" ".10"
[25] "" ".42" "" "-.44" "" ".49" "" "1.44" "" "=-1.07" "" ".38"
[37] "" ".14" "" "=-2.33" "" "4.76" "" ".45" "" ".02" "" "-.11"
[49] "" "=-2.64" "" "-.63" "" "=-3.44" "" "2.77" "" "2.35" "" "6.29"
[61] "" "1.20" "" "=-1.80" "" "-.63" "" "5.83" "" "6.33" "" "5.42"
[73] "" ".72" "" "-.57" "" "3.52" "" "=-2.44" "" "3.92" "" "1.99"
[85] "" ".77" "" "3.01"
Diagnosis 2
Found the problem I think. The negative sign is a bit tricky. In some files it is
SEN SLOPE =-1.07
SEN SLOPE = -.11
Because of the gap after =, I am getting NAs for the first one, but the code is reading the second one. How can I modify the regex to fix this? Thanks!
回答1:
Assume "text.txt"
is one of your text files. Read into R with readLines
, you can use grep
to find the line containing SEN SLOPE
. With no further arguments, grep
returns the index number(s) for the element where the regular expression was found. Here we find that it's the 11th line. Add the value = TRUE
argument to get the line as it reads.
x <- readLines("text.txt")
grep("SEN SLOPE", x)
## [1] 11
( gg <- grep("SEN SLOPE", x, value = TRUE) )
## [1] "SEN SLOPE = .24"
To find all the .txt
files in the working directory we can use list.files
with a regular expression.
list.files(pattern = "*.txt")
## [1] "text.txt"
LOOPING OVER MULTIPLE FILES
I created a second text file, text2.txt
with a different SEN SLOPE
value to illustrate how I might apply this method over multiple files. We can use sapply
, followed by strsplit
, to get the spl
values that are desired.
GG <- list.files(pattern = "*.txt")
S <- sapply(seq_along(GG), function(i){
X <- readLines(GG[i])
ifelse(length(X) > 0, grep("SEN SLOPE", X, value = TRUE), NA)
## added 04/23/14 to account for empty files (as per comment)
})
spl <- unlist(strsplit(S, split = ".*((=|(\\s=))|(=\\s|\\s=\\s))"))
## above regex changed to capture up to and including "=" and
## surrounding space, if any - 04/23/14 (as per comment)
SenStat <- as.numeric(spl[nzchar(spl)])
Then we can put the results into a data frame and send it to a file with write.table
( SenStatDf <- data.frame(SenStat, file = GG) )
## SenStat file
## 1 0.46 text2.txt
## 2 0.24 text.txt
We can write it to a file with
write.table(SenStatDf, "myFile.csv", sep = ", ", row.names = FALSE)
UPDATED 07/21/2014:
Since the result is being written to a file, this can be made much more simple (and faster) with
( SenStatDf <- cbind(
SenSlope = c(lapply(GG, function(x){
y <- readLines(x)
z <- y[grepl("SEN SLOPE", y)]
unlist(strsplit(z, split = ".*=\\s+"))[-1]
}), recursive = TRUE),
file = GG
) )
# SenSlope file
# [1,] ".46" "test2.txt"
# [2,] ".24" "test.txt"
And then written and read into R with
write.table(SenStatDf, "myFile.txt", row.names = FALSE)
read.table("myFile.txt", header = TRUE)
# SenSlope file
# 1 1.24 test2.txt
# 2 0.24 test.txt
回答2:
First make a sample text file:
cat('*****Auto-Corelation Results******
1 .09 -.19 .18 non-Significant
*****STATISTICS FOR MANN-KENDELL TEST******
S= 609
VAR(S)= 162409.70
Z= 1.51
Random : No trend at 95%
*****SENs STATISTICS ******
SEN SLOPE = .24',file='samp.txt')
Then read it in:
tf <- readLines('samp.txt')
Now extract the appropriate line:
sen_text <- grep('SEN SLOPE',tf,value=T)
And then get the value past the equals sign:
sen_value <- as.numeric(unlist(strsplit(sen_text,'='))[2])
Then combine these results for each of your files (no file structure mentioned in the original question)
回答3:
If you're text files are always of that format, (eg Sen Slope is always on line 11) and the text is identical over all your files you can do what you need in just two lines.
char_vector <- readLines("Path/To/Document/sample.txt")
statistic <- as.numeric(strsplit(char_vector[11]," ")[[1]][5])
That will give you 0.24.
You then iterate over all your files via an apply statement or a for loop.
For clarity:
> char_vector[11]
[1] "SEN SLOPE = .24"
and
> strsplit(char_vector[11]," ")
[[1]]
[1] "SEN" "SLOPE" "=" "" ".24"
Thus you want [[1]] [5] of the result from strsplit.
回答4:
Step1: Save complete fileNames
in a single variable:
fileNames <- dir(dataDir,full.names=TRUE)
Step2: Lets read and process one of the files, and ensure that it is giving correct results:
data.frame(
file=basename(fileNames[1]),
SEN.SLOPE= as.numeric(tail(
strsplit(grep('SEN SLOPE',readLines(fileNames[1]),value=T),"=")[[1]],1))
)
Step3: Do this on all the fileNames
do.call(
rbind,
lapply(fileNames,
function(fileName) data.frame(
file=basename(fileName),
SEN.SLOPE= as.numeric(tail(
strsplit(grep('SEN SLOPE',
readLines(fileName),value=T),"=")[[1]],1)
)
)
)
)
Hope this helps!!
来源:https://stackoverflow.com/questions/23038367/how-do-i-read-information-from-text-files