Read Json file into a data.frame without nested lists

前端 未结 4 1011
清歌不尽
清歌不尽 2020-11-30 06:50

I am trying to load a json file into a data.frame in r. I have had some luck with the fromJSON function in the jsonlite package - But am getting nested lists and am not sur

相关标签:
4条回答
  • 2020-11-30 07:15

    So this isn't really eligible as a solution since it doesn't directly answer the question, but here is how I would analyze this data.

    First, I had to understand your data set. It appears to be information about health providers.

     providers <- fromJSON( "http://fm.formularynavigator.com/jsonFiles/publish/11/47/providers.json" , simplifyDataFrame=FALSE ) 
     types = sapply(providers,"[[","type")
     table(types)
    
     # FACILITY INDIVIDUAL 
     #    279       2977 
    
    • FACILITY entries have the "ID" fields facility_name and facility_type.
    • INDIVIDUAL entries have the "ID" fields name, speciality, accepting, languages, and gender.
    • All entries have "ID" fields npi and last_updated_on.
    • All entries have two nested fields: addresses and plans. For example addresses is a list that contains city, state, etc.

    Since there are multiple addresses for each npi, I'd prefer to convert them to a data frame with columns for the city, state, etc. I'll also make a similar data frame for the plans. Then I'll join the addresses and plans into a single data frame. Hence, if there are 4 addresses and 8 plans, there will be 4*8=32 rows in the joined data frame. Finally, I'll tac on a similarly denormalized data frame with "ID" information using another merge.

    library(dplyr)
    unfurl_npi_data = function (x) {
      repeat_cols = c("plans","addresses")
      id_cols = setdiff(names(x),repeat_cols)
      repeat_data = x[repeat_cols]
      id_data  = x[id_cols]
    
      # Denormalized ID data
      id_data_df = Reduce(function(x,y) merge(x,y,by=NULL), id_data, "")[,-1]
      atomic_colnames = names(which(!sapply(id_data, is.list)))
      df_atomic_cols = unlist(sapply(id_data,function(x) if(is.list(x)) rep(FALSE, length(x)) else TRUE))
      colnames(id_data_df)[df_atomic_cols] = atomic_colnames
    
      # Join the plans and addresses (denormalized)
      repeated_data = lapply(repeat_data, rbind_all)
      repeated_data_crossed = Reduce(merge, repeated_data, repeated_data[[1]])
    
      merge(id_data_df, repeated_data_crossed)
    }
    
    providers2 = split(providers, types)
    providers3 = lapply(providers2, function(x) rbind_all(lapply(x, unfurl_npi_data)))
    

    Then do some cleanup.

    unique_df = function(x) {
      chr_col_names = names(which(sapply(x, class) == "character"))
      for( col in chr_col_names )
        x[[col]] = toupper(x[[col]])
      unique(x)
    }
    providers3 = lapply(providers3, unique_df)
    facilities = providers3[["FACILITY"]]
    individuals = providers3[["INDIVIDUAL"]]
    rm(providers, providers2, providers3)
    

    And now you can ask some interesting questions. For example, how many addresses does each health care provider have?

     unique_providers = individuals %>% select(first, middle, last, gender, state, city, address) %>% unique()
     num_addresses = unique_providers %>% count(first, middle, last, gender)
     table(num_addresses$n)
    
     #    1    2    3    4    5    6    7    8    9   12   13 
     # 2258  492  119   33   43   21    6    1    2    1    1 
    

    At addresses with more than five people, what is the percent of male healthcare providers?

    address_pcts = unique_providers %>% 
      group_by(address, city, state) %>%
      filter(n()>5) %>%
      arrange(address) %>%
      summarise(pct_male = sum(gender=="MALE")/n())
    library(ggplot2)
    qplot(address_pcts$pct_male, binwidth=1/7) + xlim(0,1)
    

    And on and on...

    0 讨论(0)
  • 2020-11-30 07:17

    My first step was to load the data via RCurl::getURL() and rjson::fromJSON(), as per your second code sample:

    ##--------------------------------------
    ## libraries
    ##--------------------------------------
    library(rjson);
    library(RCurl);
    
    ##--------------------------------------
    ## get data
    ##--------------------------------------
    URL <- 'https://fm.formularynavigator.com/jsonFiles/publish/11/47/providers.json';
    jsonRList <- fromJSON(getURL(URL)); ## recursive list representing the original JSON data
    

    Next, to get a deep understanding of the structure and cleanness of the data, I wrote a set of helper functions:

    ##--------------------------------------
    ## helper functions
    ##--------------------------------------
    ## apply a function to a set of nodes at the same depth level in a recursive list structure
    levelApply <- function(
        nodes, ## the root node of the list (recursive calls pass deeper nodes as they drill down into the list)
        keyList, ## another list, expected to hold a sequence of keys (component names, integer indexes, or NULL for all) specifying which nodes to select at each depth level
        func=identity, ## a function to run separately on each node once keyList has been exhausted
        ..., ## further arguments passed to func()
        joinFunc=NULL ## optional function for joining the return values of func() at each successive depth, as the stack is unwound. An alternative is calling unlist() on the result, but careful not to lose the top-level index association
    ) {
        if (length(keyList) == 0L) {
            ret <- if (is.null(nodes)) NULL else func(nodes,...)
        } else if (is.null(keyList[[1L]]) || length(keyList[[1L]]) != 1L) {
            ret <- lapply(if (is.null(keyList[[1L]])) nodes else nodes[keyList[[1L]]],levelApply,keyList[-1L],func,...,joinFunc=joinFunc);
            if (!is.null(joinFunc))
                ret <- do.call(joinFunc,ret);
        } else {
            ret <- levelApply(nodes[[keyList[[1L]]]],keyList[-1L],func,...,joinFunc=joinFunc);
        }; ## end if
        ret;
    }; ## end if
    ## these two wrappers automatically attempt to simplify the results of func() to a vector or matrix/data.frame, respectively
    levelApplyToVec <- function(...) levelApply(...,joinFunc=c);
    levelApplyToFrame <- function(...) levelApply(...,joinFunc=rbind); ## can return matrix or data.frame, depending on ret
    

    The key to understanding the above is the keyList parameter. Let's say you have a list like this:

    list(NULL,'addresses',2:3,'city')
    

    That would select all city strings underneath the second and third address elements underneath the addresses list underneath all elements of the main list.

    There are no built-in apply functions in R that can operate on such "parallel" node selections (rapply() is close, but no cigar), which is why I wrote my own. levelApply() finds each of the matching nodes and runs the given func() on it (default identity(), thus returning the node itself), returning the results to the caller, either joined as per joinFunc(), or in the same recursive list structure in which those nodes existed in the input list. Quick demo:

    unname(levelApplyToVec(jsonRList,list(4L,'addresses',1:2,c('address','city'))));
    ## [1] "1001 Noble St"  "Fairbanks"      "1650 Cowles St" "Fairbanks"
    

    Here are the remaining helper functions I wrote in the process of working on this problem:

    ## for the given node selection key union, retrieve a data.frame of logicals representing the unique combinations of keys possessed by the selected nodes, possibly with a count
    keyCombos <- function(node,keyList,allKeys) `rownames<-`(setNames(unique(as.data.frame(levelApplyToFrame(node,keyList,function(h) allKeys%in%names(h)))),allKeys),NULL);
    keyCombosWithCount <- function(node,keyList,allKeys) { ks <- keyCombos(node,keyList,allKeys); ks$.count <- unname(apply(ks,1,function(combo) sum(levelApplyToVec(node,keyList,function(h) identical(sort(names(ks)[combo]),sort(names(h))))))); ks; };
    
    ## return a simple two-component list with type (list, namedlist, or atomic vector type) and len for non-namedlist types; tlStr() returns a nice stringified form of said list
    tl <- function(e) { if (is.null(e)) return(NULL); ret <- typeof(e); if (ret == 'list' && !is.null(names(e))) ret <- list(type='namedlist') else ret <- list(type=ret,len=length(e)); ret; };
    tlStr <- function(e) { if (is.null(e)) return(NA); ret <- tl(e); if (is.null(ret$len)) ret <- ret$type else ret <- paste0(ret$type,'[',ret$len,']'); ret; };
    
    ## stringification functions for display
    mkcsv <- function(v) paste0(collapse=',',v);
    keyListToStr <- function(keyList) paste0(collapse='','/',sapply(keyList,function(key) if (is.null(key)) '*' else paste0(collapse=',',key)));
    
    ## return a data.frame giving a comma-separated list of the unique types possessed by the selected nodes; useful for learning about the structure of the data
    keyTypes <- function(node,keyList,allKeys) data.frame(key=allKeys,tl=sapply(allKeys,function(key) mkcsv(unique(na.omit(levelApplyToVec(node,c(keyList,key),tlStr))))),row.names=NULL);
    
    ## useful for testing; can call npiToFrame() to show the row with a specified npi value, in a nice vertical form
    rowToFrame <- function(dfrow) data.frame(column=names(dfrow),value=c(as.matrix(dfrow)));
    getNPIRow <- function(df,npi) which(df$npi == npi);
    npiToFrame <- function(df,npi) rowToFrame(df[getNPIRow(df,npi),]);
    

    I've tried to capture the sequence of commands I ran against the data as I first examined it. Below are the results, showing the commands I ran, the command output, and leading comments describing what my intention was, and my conclusion from the output:

    ##--------------------------------------
    ## data examination
    ##--------------------------------------
    ## type of object -- plain unnamed list => array, length 3256
    levelApplyToVec(jsonRList,list(),tlStr);
    ## [1] "list[3256]"
    
    ## unique types of main array elements => all named lists => hashes
    unique(levelApplyToVec(jsonRList,list(NULL),tlStr));
    ## [1] "namedlist"
    
    ## get the union of keys among all hashes
    allKeys <- unique(levelApplyToVec(jsonRList,list(NULL),names)); allKeys;
    ##  [1] "npi"             "type"            "facility_name"   "facility_type"   "addresses"       "plans"           "last_updated_on" "name"            "speciality"      "accepting"       "languages"       "gender"
    
    ## get the unique pattern of keys among all hashes, and how often each occurs => shows there are inconsistent key sets among the top-level hashes
    keyCombosWithCount(jsonRList,list(NULL),allKeys);
    ##    npi type facility_name facility_type addresses plans last_updated_on  name speciality accepting languages gender .count
    ## 1 TRUE TRUE          TRUE          TRUE      TRUE  TRUE            TRUE FALSE      FALSE     FALSE     FALSE  FALSE    279
    ## 2 TRUE TRUE         FALSE         FALSE      TRUE  TRUE            TRUE  TRUE       TRUE      TRUE      TRUE   TRUE   2973
    ## 3 TRUE TRUE         FALSE         FALSE      TRUE  TRUE            TRUE  TRUE       TRUE      TRUE      TRUE  FALSE      4
    
    ## for each key, get the unique set of types it takes on among all hashes, ignoring hashes where the key is omitted => some scalar strings, some multi-string, addresses is a variable-length list, plans is length-9 list, and name is a hash
    keyTypes(jsonRList,list(NULL),allKeys);
    ##                key                                                                                        tl
    ## 1              npi                                                                              character[1]
    ## 2             type                                                                              character[1]
    ## 3    facility_name                                                                              character[1]
    ## 4    facility_type                                                    character[1],character[2],character[3]
    ## 5        addresses list[1],list[2],list[3],list[6],list[5],list[7],list[4],list[8],list[9],list[13],list[12]
    ## 6            plans                                                                                   list[9]
    ## 7  last_updated_on                                                                              character[1]
    ## 8             name                                                                                 namedlist
    ## 9       speciality                                       character[1],character[2],character[3],character[4]
    ## 10       accepting                                                                              character[1]
    ## 11       languages                          character[2],character[3],character[4],character[6],character[5]
    ## 12          gender                                                                              character[1]
    
    ## must look deeper into addresses array, plans array, and name hash; we'll have to flatten them
    
    ## ==== addresses =====
    ## note: the addresses key is always present under main array elements
    ## unique types of address elements across all hashes => all named lists, thus nested hashes
    unique(levelApplyToVec(jsonRList,list(NULL,'addresses',NULL),tlStr));
    ## [1] "namedlist"
    
    ## union of keys among all address element hashes
    allAddressKeys <- unique(levelApplyToVec(jsonRList,list(NULL,'addresses',NULL),names)); allAddressKeys;
    ## [1] "address"   "city"      "state"     "zip"       "phone"     "address_2"
    
    ## pattern of keys among address elements => only address_2 varies, similar frequency with it as without it
    keyCombosWithCount(jsonRList,list(NULL,'addresses',NULL),allAddressKeys);
    ##   address city state  zip phone address_2 .count
    ## 1    TRUE TRUE  TRUE TRUE  TRUE     FALSE   1898
    ## 2    TRUE TRUE  TRUE TRUE  TRUE      TRUE   2575
    
    ## for each address element key, get the unique set of types it takes on among all hashes, ignoring hashes where the key (only address_2 in this case) is omitted => all scalar strings
    keyTypes(jsonRList,list(NULL,'addresses',NULL),allAddressKeys);
    ##         key           tl
    ## 1   address character[1]
    ## 2      city character[1]
    ## 3     state character[1]
    ## 4       zip character[1]
    ## 5     phone character[1]
    ## 6 address_2 character[1]
    
    ## ==== plans =====
    ## note: the plans key is always present under main array elements
    ## unique types of plan elements across all hashes => all named lists, thus nested hashes
    unique(levelApplyToVec(jsonRList,list(NULL,'plans',NULL),tlStr));
    ## [1] "namedlist"
    
    ## union of keys among all plan element hashes
    allPlanKeys <- unique(levelApplyToVec(jsonRList,list(NULL,'plans',NULL),names)); allPlanKeys;
    ## [1] "plan_id_type" "plan_id"      "network_tier"
    
    ## pattern of keys among plan elements => good, all plan elements have all 3 keys, perfectly consistent
    keyCombosWithCount(jsonRList,list(NULL,'plans',NULL),allPlanKeys);
    ##   plan_id_type plan_id network_tier .count
    ## 1         TRUE    TRUE         TRUE  29304
    
    ## for each plan element key, get the unique set of types it takes on among all hashes (note: no plan keys are ever omitted, so don't have to worry about that) => all scalar strings
    keyTypes(jsonRList,list(NULL,'plans',NULL),allPlanKeys);
    ##            key           tl
    ## 1 plan_id_type character[1]
    ## 2      plan_id character[1]
    ## 3 network_tier character[1]
    
    ## ==== name =====
    ## note: the name key is *not* always present under main array elements
    ## union of keys among all name hashes
    allNameKeys <- unique(levelApplyToVec(jsonRList,list(NULL,'name'),names)); allNameKeys;
    ## [1] "first"  "middle" "last"
    
    ## pattern of keys among name elements => sometimes middle is missing, relatively infrequently
    keyCombosWithCount(jsonRList,list(NULL,'name'),allNameKeys);
    ##   first middle last .count
    ## 1  TRUE   TRUE TRUE   2679
    ## 2  TRUE  FALSE TRUE    298
    
    ## for each name element key, get the unique set of types it takes on among all hashes, ignoring hashes where the key (only middle in this case) is omitted => all scalar strings
    keyTypes(jsonRList,list(NULL,'name'),allNameKeys);
    ##      key           tl
    ## 1  first character[1]
    ## 2 middle character[1]
    ## 3   last character[1]
    

    Here's my summary of the data:

    • one top-level main list, length 3256.
    • each element is a hash with inconsistent key sets. There are 12 keys in total across all main hashes, with 3 patterns of key sets present.
    • 6 of the hash values are scalar strings, 3 are variable-length string vectors, addresses is a list of variable length, plans is a list always of length 9, and name is a hash.
    • each addresses list element is a hash with 5 or 6 keys to scalar strings, address_2 being the inconsistent one.
    • each plans list element is a hash with 3 keys to scalar strings, no inconsistencies.
    • each name hash has first and last but not always middle scalar strings.

    The most important observation here is that there are no type-inconsistencies between parallel nodes (aside from omissions and length differences). That means we can combine all parallel nodes into vectors with no considerations of type coercion. We can flatten all the data into a two-dimensional structure provided we associate columns with deep-enough nodes, such that all columns correspond to a single scalar string node in the input list.

    Below is my solution. Note that it depends on the helper functions tl(), keyListToStr(), and mkcsv() I defined earlier.

    ##--------------------------------------
    ## solution
    ##--------------------------------------
    ## recursively traverse the list structure, building up a column at each leaf node
    extractLevelColumns <- function(
        nodes, ## current level node selection
        ..., ## additional arguments to data.frame()
        keyList=list(), ## current key path under main list
        sep=NULL, ## optional string separator on which to join multi-element vectors; if NULL, will leave as separate columns
        mkname=function(keyList,maxLen) paste0(collapse='.',if (is.null(sep) && maxLen == 1L) keyList[-length(keyList)] else keyList) ## name builder from current keyList and character vector max length across node level; default to dot-separated keys, and remove last index component for scalars
    ) {
        cat(sprintf('extractLevelColumns(): %s\n',keyListToStr(keyList)));
        if (length(nodes) == 0L) return(list()); ## handle corner case of empty main list
        tlList <- lapply(nodes,tl);
        typeList <- do.call(c,lapply(tlList,`[[`,'type'));
        if (length(unique(typeList)) != 1L) stop(sprintf('error: inconsistent types (%s) at %s.',mkcsv(typeList),keyListToStr(keyList)));
        type <- typeList[1L];
        if (type == 'namedlist') { ## hash; recurse
            allKeys <- unique(do.call(c,lapply(nodes,names)));
            ret <- do.call(c,lapply(allKeys,function(key) extractLevelColumns(lapply(nodes,`[[`,key),...,keyList=c(keyList,key),sep=sep,mkname=mkname)));
        } else if (type == 'list') { ## array; recurse
            lenList <- do.call(c,lapply(tlList,`[[`,'len'));
            maxLen <- max(lenList,na.rm=T);
            allIndexes <- seq_len(maxLen);
            ret <- do.call(c,lapply(allIndexes,function(index) extractLevelColumns(lapply(nodes,function(node) if (length(node) < index) NULL else node[[index]]),...,keyList=c(keyList,index),sep=sep,mkname=mkname))); ## must be careful to guard out-of-bounds to NULL; happens automatically with string keys, but not with integer indexes
        } else if (type%in%c('raw','logical','integer','double','complex','character')) { ## atomic leaf node; build column
            lenList <- do.call(c,lapply(tlList,`[[`,'len'));
            maxLen <- max(lenList,na.rm=T);
            if (is.null(sep)) {
                ret <- lapply(seq_len(maxLen),function(i) setNames(data.frame(sapply(nodes,function(node) if (length(node) < i) NA else node[[i]]),...),mkname(c(keyList,i),maxLen)));
            } else {
                ## keep original type if maxLen is 1, IOW don't stringify
                ret <- list(setNames(data.frame(sapply(nodes,function(node) if (length(node) == 0L) NA else if (maxLen == 1L) node else paste(collapse=sep,node)),...),mkname(keyList,maxLen)));
            }; ## end if
        } else stop(sprintf('error: unsupported type %s at %s.',type,keyListToStr(keyList)));
        if (is.null(ret)) ret <- list(); ## handle corner case of exclusively empty sublists
        ret;
    }; ## end extractLevelColumns()
    
    ## simple interface function
    flattenList <- function(mainList,...) do.call(cbind,extractLevelColumns(mainList,...));
    

    The extractLevelColumns() function traverses the input list and extracts all node values at each leaf node position, combining them into a vector with NA where the value was missing, and then transforming to a one-column data.frame. The column name is set immediately, leveraging a parameterized mkname() function to define the stringification of the keyList to the string column name. Multiple columns are returned as a list of data.frames from each recursive call and likewise from the top-level call.

    It also validates that there are no type-inconsistencies between parallel nodes. Although I manually verified the consistency of the data earlier, I tried to write as generic and reusable a solution as possible, because it's always a good idea to do so, so this validation step is appropriate.

    flattenList() is the primary interface function; it simply calls extractLevelColumns() and then do.call(cbind,...) to combine the columns into a single data.frame.

    An advantage of this solution is that it's entirely generic; it can handle an unlimited number of depth levels, by virtue of being fully recursive. Additionally, it has no package dependencies, parameterizes the column name building logic, and forwards variadic arguments to data.frame(), so for example you can pass stringsAsFactors=F to inhibit the automatic factorization of character columns normally done by data.frame(), and/or row.names={namevector} to set the row names of the resulting data.frame, or row.names=NULL to prevent the use of the top-level list component names as row names, if such existed in the input list.

    I've also added a sep parameter which defaults to NULL. If NULL, multi-element leaf nodes will be separated into multiple columns, one per element, with an index suffix on the column name for differentiation. Otherwise, it's taken as a string separator on which to join all elements to a single string, and only one column is generated for the node.

    In terms of performance, it's very fast. Here's a demo:

    ## actually run it
    system.time({ df <- flattenList(jsonRList); });
    ## extractLevelColumns(): /
    ## extractLevelColumns(): /npi
    ## extractLevelColumns(): /type
    ## extractLevelColumns(): /facility_name
    ## extractLevelColumns(): /facility_type
    ## extractLevelColumns(): /addresses
    ## extractLevelColumns(): /addresses/1
    ## extractLevelColumns(): /addresses/1/address
    ## extractLevelColumns(): /addresses/1/city
    ##
    ## ... snip ...
    ##
    ## extractLevelColumns(): /plans/9/network_tier
    ## extractLevelColumns(): /last_updated_on
    ## extractLevelColumns(): /name
    ## extractLevelColumns(): /name/first
    ## extractLevelColumns(): /name/middle
    ## extractLevelColumns(): /name/last
    ## extractLevelColumns(): /speciality
    ## extractLevelColumns(): /accepting
    ## extractLevelColumns(): /languages
    ## extractLevelColumns(): /gender
    ##    user  system elapsed
    ##   2.265   0.000   2.268
    

    Result:

    class(df); dim(df); names(df);
    ## [1] "data.frame"
    ## [1] 3256  126
    ##   [1] "npi"                    "type"                   "facility_name"          "facility_type.1"        "facility_type.2"        "facility_type.3"        "addresses.1.address"    "addresses.1.city"       "addresses.1.state"
    ##  [10] "addresses.1.zip"        "addresses.1.phone"      "addresses.1.address_2"  "addresses.2.address"    "addresses.2.city"       "addresses.2.state"      "addresses.2.zip"        "addresses.2.phone"      "addresses.2.address_2"
    ##  [19] "addresses.3.address"    "addresses.3.city"       "addresses.3.state"      "addresses.3.zip"        "addresses.3.phone"      "addresses.3.address_2"  "addresses.4.address"    "addresses.4.city"       "addresses.4.state"
    ##  [28] "addresses.4.zip"        "addresses.4.phone"      "addresses.4.address_2"  "addresses.5.address"    "addresses.5.address_2"  "addresses.5.city"       "addresses.5.state"      "addresses.5.zip"        "addresses.5.phone"
    ##  [37] "addresses.6.address"    "addresses.6.address_2"  "addresses.6.city"       "addresses.6.state"      "addresses.6.zip"        "addresses.6.phone"      "addresses.7.address"    "addresses.7.address_2"  "addresses.7.city"
    ##  [46] "addresses.7.state"      "addresses.7.zip"        "addresses.7.phone"      "addresses.8.address"    "addresses.8.address_2"  "addresses.8.city"       "addresses.8.state"      "addresses.8.zip"        "addresses.8.phone"
    ##  [55] "addresses.9.address"    "addresses.9.address_2"  "addresses.9.city"       "addresses.9.state"      "addresses.9.zip"        "addresses.9.phone"      "addresses.10.address"   "addresses.10.address_2" "addresses.10.city"
    ##  [64] "addresses.10.state"     "addresses.10.zip"       "addresses.10.phone"     "addresses.11.address"   "addresses.11.address_2" "addresses.11.city"      "addresses.11.state"     "addresses.11.zip"       "addresses.11.phone"
    ##  [73] "addresses.12.address"   "addresses.12.address_2" "addresses.12.city"      "addresses.12.state"     "addresses.12.zip"       "addresses.12.phone"     "addresses.13.address"   "addresses.13.city"      "addresses.13.state"
    ##  [82] "addresses.13.zip"       "addresses.13.phone"     "plans.1.plan_id_type"   "plans.1.plan_id"        "plans.1.network_tier"   "plans.2.plan_id_type"   "plans.2.plan_id"        "plans.2.network_tier"   "plans.3.plan_id_type"
    ##  [91] "plans.3.plan_id"        "plans.3.network_tier"   "plans.4.plan_id_type"   "plans.4.plan_id"        "plans.4.network_tier"   "plans.5.plan_id_type"   "plans.5.plan_id"        "plans.5.network_tier"   "plans.6.plan_id_type"
    ## [100] "plans.6.plan_id"        "plans.6.network_tier"   "plans.7.plan_id_type"   "plans.7.plan_id"        "plans.7.network_tier"   "plans.8.plan_id_type"   "plans.8.plan_id"        "plans.8.network_tier"   "plans.9.plan_id_type"
    ## [109] "plans.9.plan_id"        "plans.9.network_tier"   "last_updated_on"        "name.first"             "name.middle"            "name.last"              "speciality.1"           "speciality.2"           "speciality.3"
    ## [118] "speciality.4"           "accepting"              "languages.1"            "languages.2"            "languages.3"            "languages.4"            "languages.5"            "languages.6"            "gender"
    

    The resulting data.frame is quite wide, but we can use rowToFrame() and npiToFrame() to get a good vertical layout of one row at a time. For example, here's the first row:

    rowToFrame(df[1L,]);
    ##                     column           value
    ## 1                      npi      1063645026
    ## 2                     type        FACILITY
    ## 3            facility_name EXPRESS SCRIPTS
    ## 4          facility_type.1      Pharmacies
    ## 5          facility_type.2            <NA>
    ## 6          facility_type.3            <NA>
    ## 7      addresses.1.address    4750 E 450 S
    ## 8         addresses.1.city      WHITESTOWN
    ## 9        addresses.1.state              IN
    ## 10         addresses.1.zip           46075
    ## 11       addresses.1.phone      2012695236
    ## 12   addresses.1.address_2            <NA>
    ## 13     addresses.2.address            <NA>
    ## 14        addresses.2.city            <NA>
    ## 15       addresses.2.state            <NA>
    ## 16         addresses.2.zip            <NA>
    ## 17       addresses.2.phone            <NA>
    ## 18   addresses.2.address_2            <NA>
    ## 19     addresses.3.address            <NA>
    ## 20        addresses.3.city            <NA>
    ## 21       addresses.3.state            <NA>
    ##
    ## ... snip ...
    ##
    ## 77        addresses.12.zip            <NA>
    ## 78      addresses.12.phone            <NA>
    ## 79    addresses.13.address            <NA>
    ## 80       addresses.13.city            <NA>
    ## 81      addresses.13.state            <NA>
    ## 82        addresses.13.zip            <NA>
    ## 83      addresses.13.phone            <NA>
    ## 84    plans.1.plan_id_type    HIOS-PLAN-ID
    ## 85         plans.1.plan_id  38344AK0620003
    ## 86    plans.1.network_tier   HERITAGE-PLUS
    ## 87    plans.2.plan_id_type    HIOS-PLAN-ID
    ## 88         plans.2.plan_id  38344AK0620004
    ## 89    plans.2.network_tier   HERITAGE-PLUS
    ## 90    plans.3.plan_id_type    HIOS-PLAN-ID
    ## 91         plans.3.plan_id  38344AK0620006
    ## 92    plans.3.network_tier   HERITAGE-PLUS
    ## 93    plans.4.plan_id_type    HIOS-PLAN-ID
    ## 94         plans.4.plan_id  38344AK0620008
    ## 95    plans.4.network_tier   HERITAGE-PLUS
    ## 96    plans.5.plan_id_type    HIOS-PLAN-ID
    ## 97         plans.5.plan_id  38344AK0570001
    ## 98    plans.5.network_tier   HERITAGE-PLUS
    ## 99    plans.6.plan_id_type    HIOS-PLAN-ID
    ## 100        plans.6.plan_id  38344AK0570002
    ## 101   plans.6.network_tier   HERITAGE-PLUS
    ## 102   plans.7.plan_id_type    HIOS-PLAN-ID
    ## 103        plans.7.plan_id  38344AK0980003
    ## 104   plans.7.network_tier   HERITAGE-PLUS
    ## 105   plans.8.plan_id_type    HIOS-PLAN-ID
    ## 106        plans.8.plan_id  38344AK0980006
    ## 107   plans.8.network_tier   HERITAGE-PLUS
    ## 108   plans.9.plan_id_type    HIOS-PLAN-ID
    ## 109        plans.9.plan_id  38344AK0980012
    ## 110   plans.9.network_tier   HERITAGE-PLUS
    ## 111        last_updated_on      2015-10-14
    ## 112             name.first            <NA>
    ## 113            name.middle            <NA>
    ## 114              name.last            <NA>
    ## 115           speciality.1            <NA>
    ## 116           speciality.2            <NA>
    ## 117           speciality.3            <NA>
    ## 118           speciality.4            <NA>
    ## 119              accepting            <NA>
    ## 120            languages.1            <NA>
    ## 121            languages.2            <NA>
    ## 122            languages.3            <NA>
    ## 123            languages.4            <NA>
    ## 124            languages.5            <NA>
    ## 125            languages.6            <NA>
    ## 126                 gender            <NA>
    

    I've tested the result pretty thoroughly by doing many spot-checks on individual records, and it all looks correct. Let me know if you have any questions.

    0 讨论(0)
  • 2020-11-30 07:23

    Update: 21 February 2016

    col_fixer updated to include a vec2col argument that lets you flatten a list column into either a single string or a set of columns.


    In the data.frame you've downloaded, I see several different column types. There are normal columns comprising vectors of the same type. There are list columns where the items may be NULL or may themselves be a flat vector. There are list columns where there are data.frames as the list elements. There are list columns that contain a data.frame of the same number of rows as the main data.frame.

    Here's a sample dataset that recreates those conditions:

    mydf <- data.frame(id = 1:3, type = c("A", "A", "B"), 
                       facility = I(list(c("x", "y"), NULL, "x")),
      address = I(list(data.frame(v1 = 1, v2 = 2, v4 = 3), 
                       data.frame(v1 = 1:2, v2 = 3:4, v3 = 5), 
                       data.frame(v1 = 1, v2 = NA, v3 = 3))))
    
    mydf$person <- data.frame(name = c("AA", "BB", "CC"), age = c(20, 32, 23),
                              preference = c(TRUE, FALSE, TRUE))
    

    The str of this sample data.frame looks like:

    str(mydf)
    ## 'data.frame':    3 obs. of  5 variables:
    ##  $ id      : int  1 2 3
    ##  $ type    : Factor w/ 2 levels "A","B": 1 1 2
    ##  $ facility:List of 3
    ##   ..$ : chr  "x" "y"
    ##   ..$ : NULL
    ##   ..$ : chr "x"
    ##   ..- attr(*, "class")= chr "AsIs"
    ##  $ address :List of 3
    ##   ..$ :'data.frame': 1 obs. of  3 variables:
    ##   .. ..$ v1: num 1
    ##   .. ..$ v2: num 2
    ##   .. ..$ v4: num 3
    ##   ..$ :'data.frame': 2 obs. of  3 variables:
    ##   .. ..$ v1: int  1 2
    ##   .. ..$ v2: int  3 4
    ##   .. ..$ v3: num  5 5
    ##   ..$ :'data.frame': 1 obs. of  3 variables:
    ##   .. ..$ v1: num 1
    ##   .. ..$ v2: logi NA
    ##   .. ..$ v3: num 3
    ##   ..- attr(*, "class")= chr "AsIs"
    ##  $ person  :'data.frame':    3 obs. of  3 variables:
    ##   ..$ name      : Factor w/ 3 levels "AA","BB","CC": 1 2 3
    ##   ..$ age       : num  20 32 23
    ##   ..$ preference: logi  TRUE FALSE TRUE
    ## NULL
    

    One way you can "flatten" this is to "fix" the list columns. There are three fixes.

    1. flatten (from "jsonlite") will take care of columns like the "person" column.
    2. Columns like the "facility" column can be fixed using toString, which would convert each element to a comma separated item or which can be converted into multiple columns.
    3. Columns where there are data.frames, some with multiple rows, first need to be flattened into a single row (by transforming to a "wide" format) and then need to be bound together as a single data.table. (I'm using "data.table" for reshaping and for binding the rows together).

    We can take care of the second and third points with a function like the following:

    col_fixer <- function(x, vec2col = FALSE) {
      if (!is.list(x[[1]])) {
        if (isTRUE(vec2col)) {
          as.data.table(data.table::transpose(x))
        } else {
          vapply(x, toString, character(1L))
        }
      } else {
        temp <- rbindlist(x, use.names = TRUE, fill = TRUE, idcol = TRUE)
        temp[, .time := sequence(.N), by = .id]
        value_vars <- setdiff(names(temp), c(".id", ".time"))
        dcast(temp, .id ~ .time, value.var = value_vars)[, .id := NULL]
      }
    }
    

    We'll integrate that and the flatten function in another function that would do most of the processing.

    Flattener <- function(indf, vec2col = FALSE) {
      require(data.table)
      require(jsonlite)
      indf <- flatten(indf)
      listcolumns <- sapply(indf, is.list)
      newcols <- do.call(cbind, lapply(indf[listcolumns], col_fixer, vec2col))
      indf[listcolumns] <- list(NULL)
      cbind(indf, newcols)
    }
    

    Running the function gives us:

    Flattener(mydf)
    ##   id type person.name person.age person.preference facility address.v1_1
    ## 1  1    A          AA         20              TRUE     x, y            1
    ## 2  2    A          BB         32             FALSE                     1
    ## 3  3    B          CC         23              TRUE        x            1
    ##   address.v1_2 address.v2_1 address.v2_2 address.v4_1 address.v4_2 address.v3_1
    ## 1           NA            2           NA            3           NA           NA
    ## 2            2            3            4           NA           NA            5
    ## 3           NA           NA           NA           NA           NA            3
    ##   address.v3_2
    ## 1           NA
    ## 2            5
    ## 3           NA
    

    Or, with the vectors going into separate columns:

    Flattener(mydf, TRUE)
    ##   id type person.name person.age person.preference facility.V1 facility.V2
    ## 1  1    A          AA         20              TRUE           x           y
    ## 2  2    A          BB         32             FALSE        <NA>        <NA>
    ## 3  3    B          CC         23              TRUE           x        <NA>
    ##   address.v1_1 address.v1_2 address.v2_1 address.v2_2 address.v4_1 address.v4_2
    ## 1            1           NA            2           NA            3           NA
    ## 2            1            2            3            4           NA           NA
    ## 3            1           NA           NA           NA           NA           NA
    ##   address.v3_1 address.v3_2
    ## 1           NA           NA
    ## 2            5            5
    ## 3            3           NA
    

    Here's the str:

    str(Flattener(mydf))
    ## 'data.frame':    3 obs. of  14 variables:
    ##  $ id               : int  1 2 3
    ##  $ type             : Factor w/ 2 levels "A","B": 1 1 2
    ##  $ person.name      : Factor w/ 3 levels "AA","BB","CC": 1 2 3
    ##  $ person.age       : num  20 32 23
    ##  $ person.preference: logi  TRUE FALSE TRUE
    ##  $ facility         : chr  "x, y" "" "x"
    ##  $ address.v1_1     : num  1 1 1
    ##  $ address.v1_2     : num  NA 2 NA
    ##  $ address.v2_1     : num  2 3 NA
    ##  $ address.v2_2     : num  NA 4 NA
    ##  $ address.v4_1     : num  3 NA NA
    ##  $ address.v4_2     : num  NA NA NA
    ##  $ address.v3_1     : num  NA 5 3
    ##  $ address.v3_2     : num  NA 5 NA
    ## NULL
    

    On your "providers" object, this runs very quickly and consistently:

    library(microbenchmark)
    out <- microbenchmark(Flattener(providers), Flattener(providers, TRUE), flattenList(jsonRList))
    out
    # Unit: milliseconds
    #                        expr        min         lq      mean    median        uq       max neval
    #        Flattener(providers)  104.18939  126.59295  157.3744  138.4185  174.5222  308.5218   100
    #  Flattener(providers, TRUE)   67.56471   86.37789  109.8921   96.3534  121.4443  301.4856   100
    #      flattenList(jsonRList) 1780.44981 2065.50533 2485.1924 2269.4496 2694.1487 4397.4793   100
    
    library(ggplot2)
    qplot(y = time, data = out, colour = expr) ## Via @TylerRinker
    

    0 讨论(0)
  • 2020-11-30 07:35

    This answer is rather a data organization suggestion (and is much shorter than the bounty-attracting answers around;)

    If you want to keep the semantics of the fields, like keep all plan_ids in a single column, you can normalize your data design a bit, and do joins afterwards, if you need the information together:

    library(dplyr)
    
    # notice the simplifyVector=F
    providers <- fromJSON( "http://fm.formularynavigator.com/jsonFiles/publish/11/47/providers.json", simplifyVector=F) 
    
    # pick and repeat fields for each element of array
    # {field1:val, field2:val2, array:[{af1:av1, af2:av2}, {af1:av3, af2:av4}]}
    # gives data.frame 
    # field1, field2 array.af1 array.af2
    # val     val2  av1        av2
    # val     val2  av3        av4
    denormalize <- function(data, fields, array) {
      data.frame(
        c(
          data[fields], 
          as.list(
            bind_rows(
              lapply(data[[array]], data.frame)))))
    }
    
    plans_df <- bind_rows(lapply(providers, denormalize, c('npi'), 'plans'))
    addresses_df <- bind_rows(lapply(providers, denormalize, c('npi'), 'addresses'))
    npis <- bind_rows(lapply(providers, function(d, fields) data.frame(d[fields]), 
                             c('npi', 'type', 'last_updated_on')))
    

    Then you can first filter on the data and join in other information afterwards:

    addresses_df %>%
      filter(city == "Healy") %>%
      left_join(plans_df, by="npi") ->
      plans_in_healy
    
    0 讨论(0)
提交回复
热议问题