问题
I have millions of json-files, where each of the files contains the same number of columns, lets say x
and y
. Note that the length of x
and y
is equal for a single file, but could be different when comparing two different files.
The problem is that the only thing that separates the data is the name of the file. So when combining the files I'd like to have the name of the file included as a third column. Is this possible using sparklyr::spark_read_json
, i.e. when using wildcards?
MWE:
library(sparklyr)
## Spark connection
sc <- spark_connect(master = "local", version = "2.1.0")
## Create data
data_dir <- tempdir()
tbl_json1 <- data.frame(x = 1:3, y = 1:3)
tbl_json2 <- data.frame(x = 1:10, y = 1:10)
## Write data to disk
write(jsonlite::toJSON(tbl_json1), sprintf("%s/tab1.json", data_dir))
write(jsonlite::toJSON(tbl_json2), sprintf("%s/tab2.json", data_dir))
## Read both files using wildcard
combined_table <- spark_read_json(
sc,
name = "combined_table",
path = sprintf("%s/*.json", data_dir)
)
## Tranfer results to R
library(dplyr)
dt <- combined_table %>% collect
# # A tibble: 13 x 2
# x y
# <dbl> <dbl>
# 1 1. 1.
# 2 2. 2.
# 3 3. 3.
# 4 4. 4.
# 5 5. 5.
# 6 6. 6.
# 7 7. 7.
# 8 8. 8.
# 9 9. 9.
# 10 10. 10.
# 11 1. 1.
# 12 2. 2.
# 13 3. 3.
Wanted output
# # A tibble: 13 x 2
# x y id
# <dbl> <dbl> <chr>
# 1 1. 1. tab2
# 2 2. 2. tab2
# 3 3. 3. tab2
# 4 4. 4. tab2
# 5 5. 5. tab2
# 6 6. 6. tab2
# 7 7. 7. tab2
# 8 8. 8. tab2
# 9 9. 9. tab2
# 10 10. 10. tab2
# 11 1. 1. tab1
# 12 2. 2. tab1
# 13 3. 3. tab1
回答1:
You can disable eager caching (you really should anyway):
combined_table <- spark_read_json(
sc,
name = "combined_table",
path = sprintf("%s/*.json", data_dir),
memory=FALSE
)
and use input_file_name function:
combined_table %>% mutate(id = input_file_name())
# Source: lazy query [?? x 3]
# Database: spark_connection
x y id
<dbl> <dbl> <chr>
1 1 1 file:///tmp/RtmpnIAUek/tab2.json
2 2 2 file:///tmp/RtmpnIAUek/tab2.json
3 3 3 file:///tmp/RtmpnIAUek/tab2.json
4 4 4 file:///tmp/RtmpnIAUek/tab2.json
5 5 5 file:///tmp/RtmpnIAUek/tab2.json
6 6 6 file:///tmp/RtmpnIAUek/tab2.json
7 7 7 file:///tmp/RtmpnIAUek/tab2.json
8 8 8 file:///tmp/RtmpnIAUek/tab2.json
9 9 9 file:///tmp/RtmpnIAUek/tab2.json
10 10 10 file:///tmp/RtmpnIAUek/tab2.json
# ... with more rows
If needed it can be combined with Hive's parse_url UDF:
combined_table %>% mutate(id = parse_url(input_file_name(), "FILE"))
# Source: lazy query [?? x 3]
# Database: spark_connection
x y id
<dbl> <dbl> <chr>
1 1 1 /tmp/RtmpnIAUek/tab2.json
2 2 2 /tmp/RtmpnIAUek/tab2.json
3 3 3 /tmp/RtmpnIAUek/tab2.json
4 4 4 /tmp/RtmpnIAUek/tab2.json
5 5 5 /tmp/RtmpnIAUek/tab2.json
6 6 6 /tmp/RtmpnIAUek/tab2.json
7 7 7 /tmp/RtmpnIAUek/tab2.json
8 8 8 /tmp/RtmpnIAUek/tab2.json
9 9 9 /tmp/RtmpnIAUek/tab2.json
10 10 10 /tmp/RtmpnIAUek/tab2.json
# ... with more rows
and you can use other string processing function to extract individual bits of information.
来源:https://stackoverflow.com/questions/51440162/adding-name-of-file-when-using-sparklyrspark-read-json