问题
I am trying to read a binary file in R containing a simple 2D array of 360x180 values. For reference, the binary file can be found here:
http://transcom.project.asu.edu/download/transcom03/smoothmap.fix.2.bin
Here is what the readme for this .bin says:
The file 'smoothmap.fix.2.bin' contains a single real, binary array dimensioned 360 x 180. The array contains the numbers 1 through 22, denoting each of the 22 basis functions in the TransCom 3 experiment. This file was written on an SGI Origin 2000 hosting UNIX.
And my code:
to.read <- file("smoothmap.fix.2.bin", "rb")
raw.transcom <- readBin(to.read, integer(), n = 360*180, size = 4, endian = "big")
transcom <- matrix(raw.transcom, 180, 360, byrow = F)
Now raw.transcom contains only junk values:
unique(raw.transcom)
[1] 259200 0 1101004800 1082130432 1092616192 1097859072 1100480512 1102053376 1086324736
[10] 1077936128 1101529088 1095761920 1096810496 1099956224 1091567616 1084227584 1090519040 1094713344
[19] 1099431936 1073741824 1093664768 1088421888 1065353216 1098907648
Why would that be?
I've been looking at this for an hour now and I'm stumped. Played around with endian-ness settings and the 'size' in readBin, but that did not help.
How can I read in this file correctly?
回答1:
Well, I didn't have time to poke at the "R" way to do this, but I do have access to GDL and found this, so I threw together:
Data = read_binary('smoothmap.fix.2.bin',DATA_TYPE=4,ENDIAN='big');
Data = Data[1:64800]
Data = reform(Data,[360,180])
openw,unit,'testfile.dat',/get_lun
printf,unit,Data
free_lun,unit
and managed to generate: http://rud.is/dl/testfile.dat.gz
If you grab that and do:
x <- as.numeric(scan("testfile.dat.gz", "numeric"))
length(x)
## [1] 64800
table(x)
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
## 7951 1643 1189 796 868 1688 864 2345 2487 509 733 1410 5144 2388 2433 4111 7617 2450 1671 2058 9161 2334 2950
It definitely looks like it's got the right values for the definition you specified and you can turn that into a matrix.
Check back, though, as I now need to figure out how to do this in R :-)
UPDATE
GOT IT!
I'm rly kinda glad I found the IDL code to verify the R results.
x <- readBin("smoothmap.fix.2.bin", "raw", file.size("smoothmap.fix.2.bin"))
x <- x[-(1:4)]
x <- x[-((length(x)-3):length(x))]
table(readBin(rawConnection(x), "numeric", 360*180, 4, endian="big"))
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
## 7951 1643 1189 796 868 1688 864 2345 2487 509 733 1410 5144 2388 2433 4111 7617 2450 1671 2058 9161 2334 2950
Ideally, we'd check for first 4 and last 4 bytes being equal, but this hack shld get you through.
Putting it all together
Added validation bits of code…
#' Read in a binary array, likely written with IDL
#'
#' @param x path to file (auto-expanded & tested for existence)
#' @param n number of `float` elements to read in
#' @param endian endian-ness (default `big`)
#' @return numeric vector of length `n`
read_binary_float <- function(x, n, endian="big") {
x <- normalizePath(path.expand(x))
x <- readBin(con = x, what = "raw", n = file.size(x))
first4 <- x[1:4] # extract front bits
last4 <- x[(length(x)-3):length(x)] # extract back bits
# convert both to long ints
f4c <- rawConnection(first4)
on.exit(close(f4c), add=TRUE)
f4 <- readBin(con = f4c, what = "integer", n = 1, size = 4L, endian=endian)
l4c <- rawConnection(last4)
on.exit(close(l4c), add=TRUE)
l4 <- readBin(con = l4c, what = "integer", n = 1, size = 4L, endian=endian)
# validation
stopifnot(f4 == l4) # check front/back are equal
stopifnot(f4 == n*4) # check if `n` matches expected record count
# strip off front and back bits
x <- x[-(1:4)]
x <- x[-((length(x)-3):length(x))]
# slurp it all in
rc <- rawConnection(x)
on.exit(close(rc), add=TRUE)
readBin(con = rc, what = "numeric", n = n, size = 4L, endian=endian)
}
Quick example:
library(magrittr)
read_binary_float("smoothmap.fix.2.bin", 360*180) %>%
matrix(nrow = 360, ncol = 180) %>%
image()
This file seems to conform to the Fortran "unformatted I/O" spec : https://docs.oracle.com/cd/E19957-01/805-4939/6j4m0vnc4/index.html : which confirmed the
"# records" | record | record | … | record | "# records"
we saw. So the function could be generalized to support more than just float
conversion:
read_binary_array <- function(x, type=c("byte", "integer", "float"), endian="big") {
type <- match.arg(trimws(tolower(type)), c("byte", "integer", "float"))
type_size <- unname(c("byte"=1, "integer"=4, "float"=4)[type])
x <- normalizePath(path.expand(x))
x <- readBin(con = x, what = "raw", n = file.size(x))
first4 <- x[1:4]
last4 <- x[(length(x)-3):length(x)]
f4c <- rawConnection(first4)
on.exit(close(f4c), add=TRUE)
f4 <- readBin(con = f4c, what = "integer", n = 1, size = 4L, endian=endian)
l4c <- rawConnection(last4)
on.exit(close(l4c), add=TRUE)
l4 <- readBin(con = l4c, what = "integer", n = 1, size = 4L, endian=endian)
stopifnot(f4 == l4) # check front/back are equal
stopifnot((f4 %% type_size == 0)) # shld have nothing left over
n_rec <- f4 / type_size
message(sprintf("Reading in %s records...", scales::comma(n_rec)))
x <- x[-(1:4)]
x <- x[-((length(x)-3):length(x))]
rc <- rawConnection(x)
on.exit(close(rc), add=TRUE)
what <- switch(type, byte="raw", integer="integer", float="numeric")
dat <- readBin(con = rc, what = what, n = n_rec, size = type_size, endian=endian)
dat
}
回答2:
This is incomplete, posted for progress.
There might be an undocumented "feature" in the data file, in that the first eight bytes are not part of the data. (The file is 259208, but 360*180*4==259200
.) However, I do find something else interesting:
d <- readBin(file("~/Downloads/smoothmap.fix.2.bin", "rb"), integer(), n = 360*180, size = 4, endian = "big")
head(d)
# [1] 259200 0 0 0 0 0
I'm going to infer that the first 4-byte integer (259200
) indicates the size of the data, so I suggest we can discard it. You might argue that you have the proper length of a vector here, but that's because you forced readBin
to stop loading data. From ?readBin:
n: integer. The (maximal) number of records to be read. You can use an over-estimate here, but not too large as storage is reserved for 'n' items.
So it should be safe to read beyond your expected file size, it will handle EOF by itself. I'll increase by 10, arbitrarily:
length(d)
# [1] 64800
d <- readBin(file("~/Downloads/smoothmap.fix.2.bin", "rb"), integer(), n = 360*180+10, size = 4, endian = "big")
length(d)
# [1] 64802
tail(d)
# [1] 1098907648 1098907648 1098907648 1098907648 1098907648 259200
(Note that even though I suggested reading another 10 bytes, only two more were available. So you know, the rationale for the n
argument is for pre-allocation of memory, nothing more.) That 259200
is there again, I'll infer that confirms the end of data, so we should be able to safely discard those two (first/last) numbers.
d <- d[-c(1, length(d))]
The first non-zero number is:
head(which(d>0))
# [1] 4321 4322 4323 4324 4325 4326
d[4321]
# [1] 1101004800
and looking at the bits:
intToBits(d[4321])
# [1] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 01 00 01 01
# [26] 00 00 00 00 00 01 00
So if you infer a direct binary interpretation, that value is 2820, which doesn't match the smoothmap.readme
description of available values. Furthermore, what we are expecting to see:
intToBits(22)
# [1] 00 01 01 00 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
So it appears that your bits are ... not in the right order, or something like that. If you intToBits
all unique values, you'll notice that all bits 1-19 (the least-significant bits) are zero.
From here, I'm at a loss ...
sapply(unique(d), function(a) packBits(rev(intToBits(a)), type="integer"))
# [1] 0 1410 258 1154 3714 6530 3458 770 514 5506 2690 1666 2434 2178 1282 130 642 4482 2 3202 1794 508 386
来源:https://stackoverflow.com/questions/47947793/reading-a-binary-map-file-in-r