How to flatten the data of different data types by using Sparklyr package?

后端 未结 1 1587
慢半拍i
慢半拍i 2021-01-08 01:01

Introduction

R code is written by using Sparklyr package to create database schema. [Reproducible code and database is given]

相关标签:
1条回答
  • 2021-01-08 02:02

    Alright, so this is one possible way to unnest it all.

    You can use the schema information to create all of the nested names. For example, entities.media.additional_media_info, then you can just use SQL to select them.

    This is a bit labour intensive, and may not generalise, but it works

    I would like to think this should be quick too, as it is only a SELECT statement.

    columns_to_flatten <- sdf_schema_json(sample_tbl, simplify = T) %>%
      # using rlist package for ease of use
      rlist::list.flatten(use.names = T) %>%
      # get names
      names() %>%
      # remove contents of brackets and whitespace
      gsub("\\(.*?\\)|\\s", "", .) %>%
      # add alias for column names, dot replaced with double underscore
      # this avoids duplicate names that would otherwise occur with singular
      {paste(., "AS", gsub("\\.", "__", .))} %>%
      # required, otherwise doesn't seem to work
      sub("variants", "variants[0]", .)
    
    # construct query
    sql_statement <- paste("SELECT",
                           paste(columns_to_flatten, collapse = ", "),
                           "FROM example")
    
    # execute on spark cluster, save as table in cluster
    spark_session(sc) %>%
      sparklyr::invoke("sql", sql_statement) %>%
      sparklyr::invoke("createOrReplaceTempView", "flattened_example")
    

    tbl(sc, "flattened_example") %>%
      sdf_schema_viewer()
    


    The SQL generated looks like this, rather simple, just long:

    SELECT contributors AS contributors, coordinates AS coordinates, created_at AS created_at, display_text_range AS display_text_range, entities.hashtags.indices AS entities__hashtags__indices, entities.hashtags.text AS entities__hashtags__text, entities.media.additional_media_info.description AS entities__media__additional_media_info__description, entities.media.additional_media_info.embeddable AS entities__media__additional_media_info__embeddable, entities.media.additional_media_info.monetizable AS entities__media__additional_media_info__monetizable, entities.media.additional_media_info.title AS entities__media__additional_media_info__title, entities.media.display_url AS entities__media__display_url, entities.media.expanded_url AS entities__media__expanded_url, entities.media.id AS entities__media__id, entities.media.id_str AS entities__media__id_str, entities.media.indices AS entities__media__indices, entities.media.media_url AS entities__media__media_url, entities.media.media_url_https AS entities__media__media_url_https, entities.media.sizes.large.h AS entities__media__sizes__large__h, entities.media.sizes.large.resize AS entities__media__sizes__large__resize, entities.media.sizes.large.w AS entities__media__sizes__large__w FROM example
    
    0 讨论(0)
提交回复
热议问题