I wish to scrape the following wiki article: http://en.wikipedia.org/wiki/Periodic_table
So that the output of my R code will be a table with the following columns:
Tal -- I thought this was going to be easy. I was going to point you to readHTMLTable(), my favorite function in the XML package. Heck, its help page even shows an example of scraping a Wikipedia page!
But alas, this is not what you want:
library(XML)
url = 'http://en.wikipedia.org/wiki/Periodic_table'
tables = readHTMLTable(html)
# ... look through the list to find the one you want...
table = tables[3]
table
$`NULL`
Group # 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
1 Period <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
2 1 1H 2He <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
3 2 3Li 4Be 5B 6C 7N 8O 9F 10Ne <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
4 3 11Na 12Mg 13Al 14Si 15P 16S 17Cl 18Ar <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
5 4 19K 20Ca 21Sc 22Ti 23V 24Cr 25Mn 26Fe 27Co 28Ni 29Cu 30Zn 31Ga 32Ge 33As 34Se 35Br 36Kr
6 5 37Rb 38Sr 39Y 40Zr 41Nb 42Mo 43Tc 44Ru 45Rh 46Pd 47Ag 48Cd 49In 50Sn 51Sb 52Te 53I 54Xe
7 6 55Cs 56Ba * 72Hf 73Ta 74W 75Re 76Os 77Ir 78Pt 79Au 80Hg 81Tl 82Pb 83Bi 84Po 85At 86Rn
8 7 87Fr 88Ra ** 104Rf 105Db 106Sg 107Bh 108Hs 109Mt 110Ds 111Rg 112Cn 113Uut 114Uuq 115Uup 116Uuh 117Uus 118Uuo
9 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
10 * Lanthanoids 57La 58Ce 59Pr 60Nd 61Pm 62Sm 63Eu 64Gd 65Tb 66Dy 67Ho 68Er 69Tm 70Yb 71Lu <NA> <NA>
11 ** Actinoids 89Ac 90Th 91Pa 92U 93Np 94Pu 95Am 96Cm 97Bk 98Cf 99Es 100Fm 101Md 102No 103Lr <NA> <NA>
The names are gone and the atomic number runs into the symbol.
So back to the drawing board...
My DOM walking-fu is not very strong, so this isn't pretty. It gets every link in a table cell, only keeps those with a "title" attribute (that's where the symbol is), and sticks what you want in a data.frame. It gets every other such link on the page, too, but we're lucky and the elements are the first 118 such links:
library(XML)
library(plyr)
url = 'http://en.wikipedia.org/wiki/Periodic_table'
# don't forget to parse the HTML, doh!
doc = htmlParse(url)
# get every link in a table cell:
links = getNodeSet(doc, '//table/tr/td/a')
# make a data.frame for each node with non-blank text, link, and 'title' attribute:
df = ldply(links, function(x) {
text = xmlValue(x)
if (text=='') text=NULL
symbol = xmlGetAttr(x, 'title')
link = xmlGetAttr(x, 'href')
if (!is.null(text) & !is.null(symbol) & !is.null(link))
data.frame(symbol, text, link)
} )
# only keep the actual elements -- we're lucky they're first!
df = head(df, 118)
head(df)
symbol text link
1 Hydrogen H /wiki/Hydrogen
2 Helium He /wiki/Helium
3 Lithium Li /wiki/Lithium
4 Beryllium Be /wiki/Beryllium
5 Boron B /wiki/Boron
6 Carbon C /wiki/Carbon
Try this:
library(XML)
URL <- "http://en.wikipedia.org/wiki/Periodic_table"
root <- htmlTreeParse(URL, useInternalNodes = TRUE)
# extract attributes and value of all 'a' tags within 3rd table
f <- function(x) c(xmlAttrs(x), xmlValue(x))
m1 <- xpathApply(root, "//table[3]//a", f)
m2 <- suppressWarnings(do.call(rbind, m1))
# extract rows that correspond to chemical symbols
ix <- grep("^[[:upper:]][[:lower:]]{0,2}", m2[, "class"])
m3 <- m2[ix, 1:3]
colnames(m3) <- c("URL", "Name", "Symbol")
m3[,1] <- sub("^", "http://en.wikipedia.org", m3[,1])
m3[,2] <- sub(" .*", "", m3[,2])
A bit of the output:
> dim(m3)
[1] 118 3
> head(m3)
URL Name Symbol
[1,] "http://en.wikipedia.org/wiki/Hydrogen" "Hydrogen" "H"
[2,] "http://en.wikipedia.org/wiki/Helium" "Helium" "He"
[3,] "http://en.wikipedia.org/wiki/Lithium" "Lithium" "Li"
[4,] "http://en.wikipedia.org/wiki/Beryllium" "Beryllium" "Be"
[5,] "http://en.wikipedia.org/wiki/Boron" "Boron" "B"
[6,] "http://en.wikipedia.org/wiki/Carbon" "Carbon" "C"
We can make this more compact by enhancing the xpath expression further starting with Jeffrey's xpath expression (since it nearly gets the elements at top) and adding a qualification to it which exactly does. In that case xpathSApply
can be used to eliminate the need for do.call
or the plyr package. The last bit where we fix up odds and ends is the same as before. This produces a matrix rather than a data frame which seems preferable since the content is entirely character.
library(XML)
URL <- "http://en.wikipedia.org/wiki/Periodic_table"
root <- htmlTreeParse(URL, useInternalNodes = TRUE)
# extract attributes and value of all a tags within 3rd table
f <- function(x) c(xmlAttrs(x), xmlValue(x))
M <- t(xpathSApply(root, "//table[3]/tr/td/a[.!='']", f))[1:118,]
# nicer column names, fix up URLs, fix up Mercury.
colnames(M) <- c("URL", "Name", "Symbol")
M[,1] <- sub("^", "http://en.wikipedia.org", M[,1])
M[,2] <- sub(" .*", "", M[,2])
View(M)
Do you have to scrape Wikipedia? You can run this SPARQL query against Wikidata instead (results):
SELECT
?elementLabel
?symbol
?article
WHERE
{
?element wdt:P31 wd:Q11344;
wdt:P1086 ?n;
wdt:P246 ?symbol.
OPTIONAL {
?article schema:about ?element;
schema:inLanguage "en";
schema:isPartOf <https://en.wikipedia.org/>.
}
FILTER (?n >= 1 && ?n <= 118).
SERVICE wikibase:label { bd:serviceParam wikibase:language "en" . }
}
ORDER BY ?n
Sorry if this doesn't answer your question directly but this should help people looking to scrape the same information but in a clean manner.