It is difficult to formulate the question, but with an example, it is simple to understand.
I use R to parse html code.
In the following, I have a html code call
It's easier to select the enclosing tag (the div
here) for each, and look for each tag inside. With rvest and purrr, which I find simpler,
library(rvest)
library(purrr)
html %>% read_html() %>%
html_nodes('.line') %>%
map_df(~list(number = .x %>% html_node('.number') %>% html_text(),
surface = .x %>% html_node('.surface') %>% html_text()))
#> # A tibble: 2 × 2
#> number surface
#> <chr> <chr>
#> 1 Number 1 Surface 1
#> 2 <NA> Surface 2
library( 'XML' ) # load library
doc = htmlParse( html ) # parse html
# define xpath expression. div contains class = line, within which span has classes number and surface
xpexpr <- '//div[ @class = "line" ]'
a1 <- lapply( getNodeSet( doc, xpexpr ), function( x ) { # loop through nodeset
y <- xmlSApply( x, xmlValue, trim = TRUE ) # get xmlvalue
names(y) <- xmlApply( x, xmlAttrs ) # get xmlattributes and assign it as names to y
y # return y
} )
loop through a1
and extract values of number
and surface
and set names accordingly. Then column bind number and surface values
nm <- c( 'number', 'surface' )
do.call( 'cbind', lapply( a1, function( x ) setNames( x[ nm ], nm ) ) )
# [,1] [,2]
# number "Number 1" NA
# surface "Surface 1" "Surface 2"
Data:
html <- '<div class="line">
<span class="number">Number 1</span>
<span class="surface">Surface 1</span>
</div>
<div class="line">
<span class="surface">Surface 2</span>
</div>'