Sparklyr: how to center a Spark table based on column?

后端 未结 1 423
失恋的感觉
失恋的感觉 2020-11-27 08:31

I have a Spark table:

simx
x0: num 1.00 2.00 3.00 ...
x1: num 2.00 3.00 4.00 ...
...
x788: num 2.00 3.00 4.00 ...

and a handle named

相关标签:
1条回答
  • 2020-11-27 08:55

    You just use mutate_each / muate_all

    library(dplyr)
    
    df <- data.frame(x=c(1, 2, 3), y = c(-4, 5, 6), z = c(42, 42, 42))
    sdf <- copy_to(sc, df, overwrite=TRUE)
    

    mutate_all(sdf, funs(. - mean(.)))
    

    Source:   query [3 x 3]
    Database: spark connection master=local[*] app=sparklyr local=TRUE
    
          x         y     z
      <dbl>     <dbl> <dbl>
    1    -1 -6.333333     0
    2     0  2.666667     0
    3     1  3.666667     0
    

    but it looks like it is expanded to a really inefficient (unacceptable for large datasets) window function application. You could be better with more verbose solution:

    avgs <- summarize_all(sdf, funs(mean)) %>% as.data.frame()
    
    exprs <- as.list(paste(colnames(sdf),"-", avgs))
    
    sdf %>%  
      spark_dataframe() %>% 
      invoke("selectExpr", exprs) %>% 
      invoke("toDF", as.list(colnames(sdf))) %>% 
      invoke("registerTempTable", "centered")
    
    tbl(sc, "centered")
    
    Source:   query [3 x 3]
    Database: spark connection master=local[*] app=sparklyr local=TRUE
    
          x         y     z
      <dbl>     <dbl> <dbl>
    1    -1 -6.333333     0
    2     0  2.666667     0
    3     1  3.666667     0
    

    It is not as pretty as dplyr approach but unlike the former one does a sensible thing.

    If you want to skip all the invokes you can use dplyr to the same thing:

    transmute_(sdf, .dots = setNames(exprs, colnames(sdf)))
    
    Source:   query [3 x 3]
    Database: spark connection master=local[*] app=sparklyr local=TRUE
    
          x         y     z
      <dbl>     <dbl> <dbl>
    1    -1 -6.333333     0
    2     0  2.666667     0
    3     1  3.666667     0
    

    Execution plans:

    A helper function (see also dbplyr::remote_query for physical plan):

    optimizedPlan <- function(df) {
      df %>% 
        spark_dataframe() %>%
        invoke("queryExecution") %>%
        invoke("optimizedPlan")
    }
    

    dplyr version:

    mutate_all(sdf, funs(. - mean(.))) %>% optimizedPlan()
    
    <jobj[190]>
      class org.apache.spark.sql.catalyst.plans.logical.Project
      Project [x#2877, y#2878, (z#1123 - _we0#2894) AS z#2879]
    +- Window [avg(z#1123) windowspecdefinition(ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS _we0#2894]
       +- Project [x#2877, (y#1122 - _we0#2892) AS y#2878, z#1123]
          +- Window [avg(y#1122) windowspecdefinition(ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS _we0#2892]
             +- Project [(x#1121 - _we0#2890) AS x#2877, z#1123, y#1122]
                +- Window [avg(x#1121) windowspecdefinition(ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS _we0#2890]
                   +- Project [y#1122, z#1123, x#1121]
                      +- InMemoryRelation [x#1121, y#1122, z#1123], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas), `df`
                         :  +- *Scan csv [x#1121,y#1122,z#1123] Format: CSV, InputPaths: file:/tmp/RtmpiEECCe/spark_serialize_f848ebf3e065c9a204092779c3e8f32ce6afdcb6e79bf6b9868ae9ff198a..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<x:double,y:double,z:double>
    

    Spark solution:

    tbl(sc, "centered") %>% optimizedPlan()
    
    <jobj[204]>
      class org.apache.spark.sql.catalyst.plans.logical.Project
      Project [(x#1121 - 2.0) AS x#2339, (y#1122 - 2.33333333333333) AS y#2340, (z#1123 - 42.0) AS z#2341]
    +- InMemoryRelation [x#1121, y#1122, z#1123], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas), `df`
       :  +- *Scan csv [x#1121,y#1122,z#1123] Format: CSV, InputPaths: file:/tmp/RtmpiEECCe/spark_serialize_f848ebf3e065c9a204092779c3e8f32ce6afdcb6e79bf6b9868ae9ff198a..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<x:double,y:double,z:double>
    

    dplyr optimized:

    transmute_(sdf, .dots = setNames(exprs, colnames(sdf))) %>% optimizedPlan()
    
    <jobj[272]>
      class org.apache.spark.sql.catalyst.plans.logical.Project
      Project [(x#1121 - 2.0) AS x#4792, (y#1122 - 2.33333333333333) AS y#4793, (z#1123 - 42.0) AS z#4794]
    +- InMemoryRelation [x#1121, y#1122, z#1123], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas), `df`
       :  +- *Scan csv [x#1121,y#1122,z#1123] Format: CSV, InputPaths: file:/tmp/RtmpiEECCe/spark_serialize_f848ebf3e065c9a204092779c3e8f32ce6afdcb6e79bf6b9868ae9ff198a..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<x:double,y:double,z:double>
    

    Notes:

    Spark SQL is not that good in handling wide datasets. With core Spark you usually combine features into a single Vector Column and Spark provides a number of transformers which can be used to operate on Vector data.

    0 讨论(0)
提交回复
热议问题