Extract links from html table

后端未结

关注

 2  1493

I\'m trying to extract the links from the following webpage http://ipt.humboldt.org.co/ that are of type \"Specimen\". I can get the table from the webpage using the followi

相关标签:

2条回答

Happy的楠姐

2020-12-16 22:32

xmlFun<-function(x){
   y<-xpathSApply(x,'./a',xmlAttrs)
   if(length(y)>0){
      list(href=y,orig=xmlValue(x))
   }else{
      xmlValue(x)
   }
}
ans<-readHTMLTable(tableNodes[[1]],elFun=xmlFun,stringsAsFactors = FALSE)
ans$Name<-lapply(ans$Name,function(x){unlist(eval(parse(text=x)))})
ans$Name[ans$Subtype=='Specimen']

0 讨论(0)

一生所求

2020-12-16 22:40
It ended up being an intricate XPath expression:
```
library(XML)
sitePage<-htmlParse("http://ipt.humboldt.org.co/")
hyperlinksYouNeed<-getNodeSet(sitePage,"//table[@id='resourcestable']
                                        //td[5][.='Specimen']
                                        /preceding-sibling
                                        ::td[3]
                                        /a
                                        /@href")
```
but let me explain the XPath expression bit-by-bit:
- //table[@id='resourcestable'] -> This way we are getting the main table on the page called 'resourcestable'
- //td[5][.='Specimen'] -> Now we are filtering only these rows that have Type as Specimen
- /preceding-sibling -> Now we start looking backwards
- ::td[3] -> 3 steps to be precise counting backwards from where we are. Be careful preceding-sibling start counting backwards therefore td[1] is the Type column, td[2] is the Organisation column and td[3] is the Name column we want.
- /a -> now get the included a node
- /@href -> and finally more precisely the href attribute content
0 讨论(0)
发布评论:

提交评论
- 加载中...