Importing wikipedia tables in R

后端 未结 4 369
别跟我提以往
别跟我提以往 2021-01-01 15:31

I regularly extract tables from Wikipedia. Excel\'s web import does not work properly for wikipedia, as it treats the whole page as a table. In google spreadsheet, I can ent

相关标签:
4条回答
  • 2021-01-01 16:22

    Building on Andrie's answer, and addressing SSL. If you can take one additional library dependency:

    library(httr)
    library(XML)
    
    url <- "https://en.wikipedia.org/wiki/Upper_Peninsula_of_Michigan"
    
    r <- GET(url)
    
    doc <- readHTMLTable(
      doc=content(r, "text"))
    
    doc[6]
    
    0 讨论(0)
  • 2021-01-01 16:27

    Here is a solution that works with the secure (https) link:

    install.packages("htmltab")
    library(htmltab)
    htmltab("http://en.wikipedia.org/wiki/Upper_Peninsula_of_Michigan",3)
    
    0 讨论(0)
  • 2021-01-01 16:34

    The function readHTMLTable in package XML is ideal for this.

    Try the following:

    library(XML)
    doc <- readHTMLTable(
             doc="http://en.wikipedia.org/wiki/Upper_Peninsula_of_Michigan")
    
    doc[[6]]
    
                V1         V2                 V3                              V4
    1       County Population Land Area (sq mi) Population Density (per sq mi)
    2        Alger      9,862                918                            10.7
    3       Baraga      8,735                904                             9.7
    4     Chippewa     38,413               1561                            24.7
    5        Delta     38,520               1170                            32.9
    6    Dickinson     27,427                766                            35.8
    7      Gogebic     17,370               1102                            15.8
    8     Houghton     36,016               1012                            35.6
    9         Iron     13,138               1166                            11.3
    10    Keweenaw      2,301                541                             4.3
    11        Luce      7,024                903                             7.8
    12    Mackinac     11,943               1022                            11.7
    13   Marquette     64,634               1821                            35.5
    14   Menominee     25,109               1043                            24.3
    15   Ontonagon      7,818               1312                             6.0
    16 Schoolcraft      8,903               1178                             7.6
    17       TOTAL    317,258             16,420                            19.3
    

    readHTMLTable returns a list of data.frames for each element of the HTML page. You can use names to get information about each element:

    > names(doc)
     [1] "NULL"                                                                               
     [2] "toc"                                                                                
     [3] "Election results of the 2008 Presidential Election by County in the Upper Peninsula"
     [4] "NULL"                                                                               
     [5] "Cities and Villages of the Upper Peninsula"                                         
     [6] "Upper Peninsula Land Area and Population Density by County"                         
     [7] "19th Century Population by Census Year of the Upper Peninsula by County"            
     [8] "20th & 21st Centuries Population by Census Year of the Upper Peninsula by County"   
     [9] "NULL"                                                                               
    [10] "NULL"                                                                               
    [11] "NULL"                                                                               
    [12] "NULL"                                                                               
    [13] "NULL"                                                                               
    [14] "NULL"                                                                               
    [15] "NULL"                                                                               
    [16] "NULL" 
    
    0 讨论(0)
  • 2021-01-01 16:38

    One simple way to do it is to use the RGoogleDocs interface to have Google Docs to do the conversion for you:

    http://www.omegahat.org/RGoogleDocs/run.html

    You can then use the =ImportHtml Google Docs function with all its pre-built magic.

    0 讨论(0)
提交回复
热议问题