scrape | 易学教程

Beautifulsoup unable to extract data using attrs=class

阅读更多关于 Beautifulsoup unable to extract data using attrs=class

问题 I am extracting data for a research project and I have sucessfully used findAll('div', attrs={'class':'someClassName'}) in many websites but this particular website, WebSite Link doesn't return any values when I used attrs option. But when I don't use the attrs option I get entire html dom. Here is the simple code that I started with to test it out: soup = bs(urlopen(url)) for div in soup.findAll('div', attrs={'class':'data'}): print div 回答1: My code is working fine, with requests import

Scraping XML with JSoup

阅读更多关于 Scraping XML with JSoup

问题 I'm trying to scrape an RSS feed located here. At the moment I'm just trying to wrap my head around JSoup, so the following code is merely proof of concept (or an attempt at it, at least). public static void grabShakers(String url) throws IOException { doc = Jsoup.connect(url).get(); desc = doc.select("title"); links = doc.select("link"); price = doc.select("span.price"); } It grabs the title of each item perfectly. The output of each link is simply ten repeated closing link tags and it never

WebScraping dynamic pages in R

阅读更多关于 WebScraping dynamic pages in R

问题 I will change the website, to make this question better. Still facing similar issues, that can't use only rvest package and maybe answer will be easier to obtain with RSelenium. Website: http://ravimaailma.fi/cg/tulokset/20/ and I want to obtain links from the main article which would direct me to individual race results. Links look something like this: http://ravimaailma.fi/article/tulokset/pori-18-11-2017-tulokset/8718/ I'm trying to use simple Rvest as thought that would be all needed here

WebScraping dynamic pages in R

阅读更多关于 WebScraping dynamic pages in R

How to scrape page requiring cookies and javascript in PHP

阅读更多关于 How to scrape page requiring cookies and javascript in PHP

问题 Is there an easy way to emulate cookies and javascript with a php script scraping a web page requiring it? The current response shows: <body><noscript>This site requires JavaScript and Cookies to be enabled. Please change your browser settings or upgrade your browser.</noscript></body> I put this in the code and it made no difference: $strCookie = 'PHPSESSID=' . $_COOKIE['PHPSESSID'] . '; path=/'; curl_setopt( $ch, CURLOPT_COOKIE, $strCookie ); 回答1: HTML inside the <noscript> </noscript> will

Scraping a table from a website using R (Rvest).. or VBA if possible

阅读更多关于 Scraping a table from a website using R (Rvest).. or VBA if possible

问题 I am trying to scrape the table from this URL: "https://hutdb.net/17/players" I have spent a lot of time learning rvest and using selectorgadget, however whenever I try to get an output I always get the same error (Character(0)). library(rvest) library(magrittr) url <- read_html("https://hutdb.net/17/players") table <- url %>% html_nodes("td") %>% html_text() Any help would be appreciated. 回答1: The data is dynamically loaded, and cannot be retrieved directly from the html. But, looking at

Crawling tables from webpage

阅读更多关于 Crawling tables from webpage

问题 I'm trying to extract csu employee salary data from this webpage (http://www.sacbee.com/statepay/#req=employee%2Fsearch%2Fname%3D%2Fyear%3D2013%2Fdepartment%3DCSU%20Sacramento). I've tried using urlib2 and requests library, but none of them returned the actual table from the webpage. I guessed the reason could be that the table was generated dynamically by javascript. Below is my code using requests. from lxml import html import requests page = requests.get("http://www.sacbee.com/statepay/

Java scrap website with login required using Jsoup

阅读更多关于 Java scrap website with login required using Jsoup

问题 I'd like to printsome datas (div with class="news_article") from streetinsider.com. I created an account and I need to log in to access those datas. Can anyone explain me why this code is not working ? I've tried a lot but nothing is working. public static final String SPLIT_INTERNET_URL = "http://www.streetinsider.com/Special+Dividends?offset=55"; public static final String SPLIT_LOGIN = "https://www.streetinsider.com/login.php"; /** * @param args the command line arguments * @throws java.io

How do I scrape information off ASP.NET websites when paging and JavaScript links are being used?

阅读更多关于 How do I scrape information off ASP.NET websites when paging and JavaScript links are being used?

问题 I have been given a staff list which is supposed to be up to date but it doesn't match an intranet People Finder which is written in ASP.NET. As the information is sensitive I am not able to access the database the People Finder is using so the only way I can get at the information is by scraping the structure starting at the top brass at the top and then going through each tier in turn. Each person has a Staff number which then forms the URL http://intranet/peoplefinder/index.aspx?srn

Scraping Javascript generated data

阅读更多关于 Scraping Javascript generated data

问题 I'm working on a project with the World Bank analyzing their procurement processes. The WB maintains websites for each of their projects, containing links and data for the associated contracts issued (example). Contract-related data is available under the procurement tab. I'd like to be able to pull a project's contract information from this site, but the links and associated data are generated using embedded Javascript, and the URLs of the pages displaying contract awards and other data don