Parsing XML file with known structure and repeating elements

前端 未结 2 1370
孤街浪徒
孤街浪徒 2021-01-29 03:11

I\'m trying to parse information from a XML file that contains a lot of elements with repeating names.

Here is an example of the type of file I am trying to parse, conta

相关标签:
2条回答
  • 2021-01-29 03:47

    I stored OP's XML in a file but duplicated the single record that was provided!

    This could be slicker using some additional add-on packages (I would use dplyr and the %>%), but I held back. I do advise using xml2 instead of XML. You can use XPATH expressions to target the nodes of interest.

    x <- read_xml("so.xml")
    (elements <- xml_find_all(x, ".//dict/dict/array/dict"))
    #> {xml_nodeset (2)}
    #> [1] <dict>\n                    <key>IE_KEY_80211D_FIRST_CHANNEL</key>\n ...
    #> [2] <dict>\n                    <key>IE_KEY_80211D_FIRST_CHANNEL</key>\n ...
    
    ## isolate the key nodes ... will become variable names
    keys <- lapply(elements, xml_find_all, "key")
    keys <- lapply(keys, xml_text)
    ## I advise checking that keys are uniform across the records here!
    (keys <- keys[[1]])
    #> [1] "IE_KEY_80211D_FIRST_CHANNEL" "IE_KEY_80211D_MAX_POWER"    
    #> [3] "IE_KEY_80211D_NUM_CHANNELS"
    
    ## isolate integer data
    integers <- lapply(y, xml_find_all, "integer")
    integers <- lapply(integers, xml_text)
    integers <- lapply(integers, type.convert)
    yay <- as.data.frame(do.call(rbind, integers))
    names(yay) <- keys
    yay
    #>   IE_KEY_80211D_FIRST_CHANNEL IE_KEY_80211D_MAX_POWER
    #> 1                           1                      27
    #> 2                           1                      27
    #>   IE_KEY_80211D_NUM_CHANNELS
    #> 1                         11
    #> 2                         11
    
    0 讨论(0)
  • 2021-01-29 04:13

    New answer after the significant edit to the question.

    I stored OP's XML in a file BUT DUPLICATED THE SINGLE RECORD PROVIDED! I'm letting myself use %>% now. I get 16 elements per record where OP gets 18 because the actual XML posted contains no evidence of HT_CAPS_IE and HT_IE. Given the way we're doing this now, it's more about computation on lists than XML, which seems unavoidable. The link between keys and data is more based on adjacency than structure.

    library(magrittr)
    library(xml2)
    
    ## ugly workaround: xml2 does not seem to ignore insignificant whitespace?
    x <- "so.xml" %>%
      scan(what = character(), sep = "\n", strip.white = TRUE) %>%
      paste0(collapse = "") %>% 
      read_xml
    
    ## isolate each record
    (records <- x %>%
      xml_children() %>%
      xml_children())
    #> {xml_nodeset (2)}
    #> [1] <dict>\n  <key>80211D_IE</key>\n  <dict>\n    <key>IE_KEY_80211D_CHA ...
    #> [2] <dict>\n  <key>80211D_IE</key>\n  <dict>\n    <key>IE_KEY_80211D_CHA ...
    
    ## turn each record into a list
    records_list <- records %>% lapply(as_list)
    str(records_list, max.level = 1)
    #> List of 2
    #>  $ :List of 32
    #>  $ :List of 32
    
    ## IRL here's where I check that ...
    ##  we have key, THINGY, key, THINGY, etc. within each record
    ##  we have THINGY1, THINGY2, etc. across all records
    
    ## store item names from record 1
    keys <- records_list[[1]][c(TRUE, FALSE)] %>% unlist
    
    ## isolate the data, do obvious simplifications, apply item names
    jfun <- function(x) if(is.list(x) && length(x) > 1) x else unlist(x)
    z <- records_list %>%
      lapply(`[`, c(FALSE, TRUE)) %>% 
      lapply(`names<-`, keys) %>% 
      lapply(lapply, jfun)
    
    ## done!
    str(z[[1]], max.level = 1)
    #> List of 16
    #>  $ 80211D_IE    :List of 4
    #>  $ AGE          : chr "0"
    #>  $ AP_MODE      : chr "2"
    #>  $ BEACON_INT   : chr "100"
    #>  $ BSSID        : chr "ac:5d:10:73:c3:11"
    #>  $ CAPABILITIES : chr "1073"
    #>  $ CHANNEL      : chr "2"
    #>  $ CHANNEL_FLAGS: chr "10"
    #>  $ IE           : chr "AAZPbGl2ZXIBCIKEiwwSlhgkAwECBwZVUyABCxswGAEAAA+sAgIAAA+sBAAPrAIBAAAPrAIAAN0aAFDyAQEAAFDyAgIAAFDyBABQ8gIBAABQ8gIqAQAyBDBIYGw="
    #>  $ NOISE        : chr "0"
    #>  $ RATES        :List of 12
    #>  $ RSN_IE       :List of 8
    #>  $ RSSI         : chr "-74"
    #>  $ SSID         : chr "T2xpdmVy"
    #>  $ SSID_STR     : chr "Oliver"
    #>  $ WPA_IE       :List of 8
    
    0 讨论(0)
提交回复
热议问题