问题
I use rvest
to scrape this website. It contains data in such a form (simplified):
<div class="editor-type">Editors</div>
<div class="editor">
<div class="editor-name"><h3>Otto Heath</h3></div>
<span class="editor-affiliation">Royal Holloway University of London</span>
</div>
<div class="editor">
<div class="editor-name"><h3>Kathrin Smets</h3></div>
<span class="editor-affiliation">Royal Holloway University of London</span>
</div>
<div class="editor-type">Associate Editor</div>
<div class="editor">
<div class="editor-name"><h3>Rosa Dassonville</h3></div>
<span class="editor-affiliation">University of Montreal</span>
</div>
<div class="editor">
<div class="editor-name"><h3>Matthias Wagner</h3></div>
<span class="editor-affiliation">University of Wagner</span>
</div>
<div class="editor-type">Editorial Assistant</div>
<div class="editor">
<div class="editor-name"><h3>Markus Polacko</h3></div>
<span class="editor-affiliation">Royal Holloway University of London</span>
</div>
I can easily scrape editor-type
and editor-name
into respective lists, e.g. like this:
library("rvest")
webpage <- read_html(url("https://www.journals.elsevier.com/electoral-studies/editorial-board"))
editorial_types <- webpage %>%
html_nodes(xpath = "//div[@class='editor-type']")
editor_names <- webpage %>%
html_nodes(xpath = "//div[@class='editor']/descendant::div[@class='editor-name']")
However, I want to combine them into a single list. It should contain elements of editor-type
(Editors, Associate Editors, etc) and sub-elements with the respective editor-name
, perhaps like this:
list_of_editors
[[1]] Editors
[1] Otto Heath
[2] Kathrin Smets
[[2]] Associate Editor
[1] Rosa Dassonville
[2] Markus Wagner
[[3]] Editorial Assistant
[1] Markus Polacko
How can I achieve that?
回答1:
This was a bit tricky since it was a straight list of titles and names and not a hierarchical list. The strategy is to find all of the nodes sort out the nodes containing the title and then extract the names from the nodes between the nodes containing the titles.
library(rvest)
library(dplyr)
#read the document
webpage <- read_html("https://www.journals.elsevier.com/electoral-studies/editorial-board")
#find parent Node
pubeditors <- webpage %>% html_nodes("div.publication-editors")
#get the children Nodes
editorsnodes <- html_children(pubeditors)
#find nodes with the Position title
titlesnodesnum <- which(html_attr(editorsnodes, "class") =="publication-editor-type")
#create vector of title
titles <- editorsnodes[titlesnodesnum] %>% html_text() %>% trimws()
#include the last node in the list
titlesnodesnum <- c(titlesnodesnum, length(editorsnodes)+1) #identify the last record
#find names between subcategory nodes
answer <- lapply(2:length(titlesnodesnum), function(n){
start<- titlesnodesnum[n-1]+1 #starting node in subcategory
end <- titlesnodesnum [n] -1 #ending node in subcategory
names <- editorsnodes[start:end] %>% html_nodes("div.publication-editor-name") %>% html_text() %>% trimws()
})
#rename the list
names(answer) <- titles
answer
$Editors
[1] "Oliver Heath" "Kaat Smets"
$`Associate Editor`
[1] "Ruth Dassonneville" "Markus Wagner"
$`Editorial Assistant`
[1] "Matt Polacko"
$`Editorial Board`
[1] "Eva Anduiza" "Paolo Bellucci" "Amanda Bittner" "Andre Blais" "Damien Bol"
[6] "Shaun Bowler" "Barry Burden" "David Butler" "Rosie Campbell" "Miguel Carreras"
[11] "Harold D Clarke" "Brian Crisp" "Ruth Dassonneville" "Martin Elff" "Geoffrey Evans"
[16] "Steve Fisher" "Rob Ford" "Aina Gallego" "Thomas Gschwend" "Carolien van Ham"
[21] "Chris Hanretty" "Elina Kestilä-Kekkonen" "Ann-Kristin Kölln" "Mona Krewel" "Matthew Lebo"
[26] "Michael Lewis-Beck" "Ian McAllister" "Caitlin Milazzo" "Andreas Murr" "Anja Neundorf"
[31] "Sergi Pardos" "Charles Pattie" "Mikael Persson" "Stephanie Reher" "Jason Reifler"
[36] "Robert Rohrschneider" "Eline de Rooij" "Jan Rovny" "Shane Singh" "Mary Stegmaier"
[41] "Laura Stephenson" "Rune Stubager" "Nick Vivyan" "Herbert Weisberg" "Christopher Wlezien"
[46] "Georgios Xezonakis" "Elizabeth Zechmeister" "Adam Ziegfeld"
来源:https://stackoverflow.com/questions/64188257/r-webscraping-various-div-classes-into-lists-with-sub-elements