R: posting search forms and scraping results

会有一股神秘感。 提交于 2019-12-04 20:38:11

First, I'd recommend you use httr instead of RCurl - for most problems it's much easier to use.

r <- GET("http://www.washingtonpost.com/newssearch/search.html", 
  query = list(
    st = "Dilma Rousseff"
  )
)
stop_for_status(r)
content(r)

Second, if you look at url in your browse, you'll notice that clicking the page number, modifies the startat query parameter:

r <- GET("http://www.washingtonpost.com/newssearch/search.html", 
  query = list(
    st = "Dilma Rousseff",
    startat = 10
  )
)

Third, you might want to try out my experiment rvest package. It makes it easier to extract information from a web page:

# devtools::install_github("hadley/rvest")
library(rvest)

page <- html(r)
links <- page[sel(".pb-feed-headline a")]
links["href"]
html_text(links)

I highly recommend reading the selectorgadget tutorial and using that to figure out what css selectors you need.

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!