Adding name of file when using sparklyr::spark_read_json

问题

I have millions of json-files, where each of the files contains the same number of columns, lets say x and y. Note that the length of x and y is equal for a single file, but could be different when comparing two different files.

The problem is that the only thing that separates the data is the name of the file. So when combining the files I'd like to have the name of the file included as a third column. Is this possible using sparklyr::spark_read_json, i.e. when using wildcards?

MWE:

library(sparklyr)

## Spark connection
sc <- spark_connect(master = "local", version = "2.1.0")

## Create data
data_dir <- tempdir()
tbl_json1 <- data.frame(x = 1:3, y = 1:3)
tbl_json2 <- data.frame(x = 1:10, y = 1:10)

## Write data to disk
write(jsonlite::toJSON(tbl_json1), sprintf("%s/tab1.json", data_dir))
write(jsonlite::toJSON(tbl_json2), sprintf("%s/tab2.json", data_dir))

## Read both files using wildcard 
combined_table <- spark_read_json(
    sc, 
    name = "combined_table", 
    path = sprintf("%s/*.json", data_dir)
)

## Tranfer results to R
library(dplyr)
dt <- combined_table %>% collect

# # A tibble: 13 x 2
#       x     y
#     <dbl> <dbl>
#  1    1.    1.
#  2    2.    2.
#  3    3.    3.
#  4    4.    4.
#  5    5.    5.
#  6    6.    6.
#  7    7.    7.
#  8    8.    8.
#  9    9.    9.
# 10   10.   10.
# 11    1.    1.
# 12    2.    2.
# 13    3.    3.

Wanted output

# # A tibble: 13 x 2
#       x     y     id
#     <dbl> <dbl> <chr>
#  1    1.    1.    tab2
#  2    2.    2.    tab2
#  3    3.    3.    tab2
#  4    4.    4.    tab2
#  5    5.    5.    tab2
#  6    6.    6.    tab2
#  7    7.    7.    tab2
#  8    8.    8.    tab2
#  9    9.    9.    tab2
# 10   10.   10.    tab2
# 11    1.    1.    tab1
# 12    2.    2.    tab1
# 13    3.    3.    tab1

回答1:

You can disable eager caching (you really should anyway):

combined_table <- spark_read_json(
  sc, 
  name = "combined_table", 
  path = sprintf("%s/*.json", data_dir),
  memory=FALSE
)

and use input_file_name function:

combined_table %>% mutate(id = input_file_name())

# Source:   lazy query [?? x 3]
# Database: spark_connection
       x     y id                              
   <dbl> <dbl> <chr>                           
 1     1     1 file:///tmp/RtmpnIAUek/tab2.json
 2     2     2 file:///tmp/RtmpnIAUek/tab2.json
 3     3     3 file:///tmp/RtmpnIAUek/tab2.json
 4     4     4 file:///tmp/RtmpnIAUek/tab2.json
 5     5     5 file:///tmp/RtmpnIAUek/tab2.json
 6     6     6 file:///tmp/RtmpnIAUek/tab2.json
 7     7     7 file:///tmp/RtmpnIAUek/tab2.json
 8     8     8 file:///tmp/RtmpnIAUek/tab2.json
 9     9     9 file:///tmp/RtmpnIAUek/tab2.json
10    10    10 file:///tmp/RtmpnIAUek/tab2.json
# ... with more rows

If needed it can be combined with Hive's parse_url UDF:

combined_table %>% mutate(id = parse_url(input_file_name(), "FILE"))

# Source:   lazy query [?? x 3]
# Database: spark_connection
       x     y id                       
   <dbl> <dbl> <chr>                    
 1     1     1 /tmp/RtmpnIAUek/tab2.json
 2     2     2 /tmp/RtmpnIAUek/tab2.json
 3     3     3 /tmp/RtmpnIAUek/tab2.json
 4     4     4 /tmp/RtmpnIAUek/tab2.json
 5     5     5 /tmp/RtmpnIAUek/tab2.json
 6     6     6 /tmp/RtmpnIAUek/tab2.json
 7     7     7 /tmp/RtmpnIAUek/tab2.json
 8     8     8 /tmp/RtmpnIAUek/tab2.json
 9     9     9 /tmp/RtmpnIAUek/tab2.json
10    10    10 /tmp/RtmpnIAUek/tab2.json
# ... with more rows

and you can use other string processing function to extract individual bits of information.

来源：https://stackoverflow.com/questions/51440162/adding-name-of-file-when-using-sparklyrspark-read-json

标签

apache-spark

dplyr

sparklyr