how to scrape this squawka page?

前端 未结 1 943
执笔经年
执笔经年 2021-01-02 16:36

I am trying to extract the following information:

On the page

http://epl.squawka.com/stoke-city-vs-arsenal/01-03-2014/english-barclays-premier-league/matche

相关标签:
1条回答
  • 2021-01-02 17:17

    Peter as the guys indicated you can do this with Selenium. I also like to use the excellent selectr package The idea is to briefly interact with the site then do the rest elsewhere. squawkData should contain everything needed.

    # RSelenium::startServer() # if needed
    require(RSelenium)
    remDr <- remoteDriver()
    remDr$open()
    remDr$setImplicitWaitTimeout(3000)
    remDr$navigate("http://epl.squawka.com/stoke-city-vs-arsenal/01-03-2014/english-barclays-premier-league/matches")
    squawkData <- remDr$executeScript("return new XMLSerializer().serializeToString(squawkaDp.xml);", list())
    require(selectr)
    example <- querySelectorAll(xmlParse(squawkData[[1]]), "crosses time_slice")
    example[[1]]
    
    
    <time_slice name="0 - 5" id="1">
      <event player_id="531" mins="4" secs="39" minsec="279" team="44" type="Failed">
        <start>73.1,87.1</start>
        <end>97.9,49.1</end>
      </event>
    </time_slice> 
    

    DISCLAIMER I am the author of the RSelenium package. A basic vignette on operation can be viewed at RSelenium basics and RSelenium: Testing Shiny apps.

    Further info can be accessed easily using selectr:

    > xmlValue(querySelectorAll(xmlParse(squawkData[[1]]), "players #531 name")[[1]])
    [1] "Charlie Adam"
    
    > xmlValue(querySelectorAll(xmlParse(squawkData[[1]]), "game team#44 long_name")[[1]])
    [1] "Stoke City"
    

    UPDATE:
    To process example into a dataframe you can do something like

    out <- lapply(example, function(x){
    # handle each event
      if(length(x['event']) > 0){
        res <- lapply(x['event'], function(y){
          matchAttrs <- as.list(xmlAttrs(y))
          matchAttrs$start <- xmlValue(y['start']$start)
          matchAttrs$end <- xmlValue(y['end']$end)
          matchAttrs
        })
        return(do.call(rbind.data.frame, res))
      }
    }
    )
    
    > head(do.call(rbind, out))
            player_id mins secs minsec team   type     start       end
    event         531    4   39    279   44 Failed 73.1,87.1 97.9,49.1
    event5        311    6   33    393   31 Failed 92.3,13.1 93.0,31.0
    event1        376    8   57    537   31 Failed  97.7,6.1 96.7,16.4
    event6        311   13   50    830   31 Failed  99.5,0.5 94.9,42.6
    event11       311   14   11    851   31 Failed  99.5,0.5 93.1,51.0
    event7        311   17   41   1061   31 Failed 99.5,99.5 92.6,50.1
    
    0 讨论(0)
提交回复
热议问题