Extracting html table from a website in R

后端 未结 2 495
闹比i
闹比i 2020-12-20 04:41

Hi I am trying to extract the table from the premierleague website.

The package I am using is rvest package and the code I am using in th

相关标签:
2条回答
  • 2020-12-20 05:18

    Since the data is loaded with JavaScript, grabbing the HTML with rvest will not get you what you want, but if you use PhantomJS as a headless browser within RSelenium, it's not all that complicated (by RSelenium standards):

    library(RSelenium)
    library(rvest)
    
    # initialize browser and driver with RSelenium
    ptm <- phantom()
    rd <- remoteDriver(browserName = 'phantomjs')
    rd$open()
    
    # grab source for page
    rd$navigate('https://fantasy.premierleague.com/a/entry/767830/history')
    html <- rd$getPageSource()[[1]]
    
    # clean up
    rd$close()
    ptm$stop()
    
    # parse with rvest
    df <- html %>% read_html() %>% 
        html_node('#ismr-event-history table.ism-table') %>% 
        html_table() %>% 
        setNames(gsub('\\S+\\s+(\\S+)', '\\1', names(.))) %>%    # clean column names
        setNames(gsub('\\s', '_', names(.)))
    
    str(df)
    ## 'data.frame':    20 obs. of  10 variables:
    ##  $ Gameweek                : chr  "GW1" "GW2" "GW3" "GW4" ...
    ##  $ Gameweek_Points         : int  34 47 53 51 66 66 65 63 48 90 ...
    ##  $ Points_Bench            : int  1 6 9 7 14 2 9 3 8 2 ...
    ##  $ Gameweek_Rank           : chr  "2,406,373" "2,659,789" "541,258" "905,524" ...
    ##  $ Transfers_Made          : int  0 0 2 0 3 2 2 0 2 0 ...
    ##  $ Transfers_Cost          : int  0 0 0 0 4 4 4 0 0 0 ...
    ##  $ Overall_Points          : chr  "34" "81" "134" "185" ...
    ##  $ Overall_Rank            : chr  "2,406,373" "2,448,674" "1,914,025" "1,461,665" ...
    ##  $ Value                   : chr  "£100.0" "£100.0" "£99.9" "£100.0" ...
    ##  $ Change_Previous_Gameweek: logi  NA NA NA NA NA NA ...
    

    As always, more cleaning is necessary, but overall, it's in pretty good shape without too much work. (If you're using the tidyverse, df %>% mutate_if(is.character, parse_number) will do pretty well.) The arrows are images which is why the last column is all NA, but you can calculate those anyway.

    0 讨论(0)
  • 2020-12-20 05:32

    This solution uses RSelenium along with the package XML. It also assumes that you have a working installation of RSelenium that can properly work with firefox. Just make sure you have the firefox starter script path added to your PATH.

    If you are using OS X, you will need to add /Applications/Firefox.app/Contents/MacOS/ to your PATH. Or, if you're on an Ubuntu machine, it's likely /usr/lib/firefox/. Once you're sure this is working, you can move on to R with the following:

    # Install RSelenium and XML for R
    #install.packages("RSelenium")
    #install.packages("XML")
    
    # Import packages
    library(RSelenium)
    library(XML)
    
    # Check and start servers for Selenium
    checkForServer()
    startServer()
    
    # Use firefox as a browser and a port that's not used
    remote_driver <- remoteDriver(browserName="firefox", port=4444)
    remote_driver$open(silent=T)
    
    # Use RSelenium to browse the site
    epl_link <- "https://fantasy.premierleague.com/a/entry/767830/history"
    remote_driver$navigate(epl_link)
    elem <- remote_driver$findElement(using="class", value="ism-table")
    
    # Get the HTML source
    elemtxt <- elem$getElementAttribute("outerHTML")
    
    # Use the XML package to work with the HTML source
    elem_html <- htmlTreeParse(elemtxt, useInternalNodes = T, asText = TRUE)
    
    # Convert the table into a dataframe
    games_table <- readHTMLTable(elem_html, header = T, stringsAsFactors = FALSE)[[1]]
    
    # Change the column names into something legible
    names(games_table) <- unlist(lapply(strsplit(names(games_table), split = "\\n\\s+"), function(x) x[2]))
    names(games_table) <- gsub("£", "Value", gsub("#", "CPW", gsub("Â","",names(games_table))))
    
    # Convert the fields into numeric values
    games_table <- transform(games_table, GR = as.numeric(gsub(",","",GR)),
                        OP = as.numeric(gsub(",","",OP)),
                        OR = as.numeric(gsub(",","",OR)),
                        Value = as.numeric(gsub("£","",Value)))
    

    This should yield:

     GW   GP PB GR     TM TC    OP   OR    Value CPW
     GW1  34 1  2406373 0  0    34 2406373 100.0    
     GW2  47 6  2659789 0  0    81 2448674 100.0    
     GW3  53 9   541258 2  0   134 1914025  99.9    
     GW4  51 7   905524 0  0   185 1461665 100.0    
     GW5  66 14  379438 3  4   247  958889 100.1    
     GW6  66 2   303704 2  4   309  510376  99.9    
     GW7  65 9   138792 2  4   370  232474  99.8    
     GW8  63 3   108363 0  0   433   87967 100.4    
     GW9  48 8  1114609 2  0   481   75385 100.9    
     GW10 90 2    71210 0  0   571   27716 101.1    
     GW11 71 2   421706 3  4   638   16083 100.9    
     GW12 35 9  2798661 2  4   669   31820 101.2    
     GW13 41 8  2738535 1  0   710   53487 101.1    
     GW14 82 15  308725 0  0   792   29436 100.2    
     GW15 55 9  1048808 2  4   843   29399 100.6    
     GW16 49 8  1801549 0  0   892   35142 100.7    
     GW17 48 4  2116706 2  0   940   40857 100.7    
     GW18 42 2  3315031 0  0   982   78136 100.8    
     GW19 41 9  2600618 0  0  1023   99048 100.6    
     GW20 53 0  1644385 0  0  1076  113148 100.8
    

    Please note that the column CPW (change from previous week) is a vector of empty strings.

    I hope this helps.

    0 讨论(0)
提交回复
热议问题