screen-scraping

How to create a thumbnail image of html content stored in a database

喜你入骨 提交于 2020-01-16 18:18:06
问题 If you were to convert this html content into a small thumbnail image , how would you do it? P.S. I'm trying to do this to allow users on my site to browse through their posts(containing html elements (e.g. strong, i, img tag's). 回答1: See here: http://www.thumbalizr.com/index.php They have a pretty decent simple API Simple syntax to use the API: http://api.thumbalizr.com/?url=http://www.ford.de&width=250 Parameter: Check http://api.thumbalizr.com for more details. So you can simply have <img

How do I automate navigation to a website that requires authentication?

柔情痞子 提交于 2020-01-16 01:55:33
问题 Here's what I'm trying to achieve. I would like to write a script that will navigate to a website that requires me to be authenticated as myself, say Facebook, Live Spaces, Twitter or any other, and then have that script search for certain information on one of the pages of the website. I've done something similar in the past with the Windows.Forms WebBrowser control, which is a full blown implementation of IE that can be controlled through code and will store whatever cookies you get once

Selenium with PhantomJS: Form being validated but not submitted

别说谁变了你拦得住时间么 提交于 2020-01-15 18:50:15
问题 I'm having a strange problem submitting a form through Selenium Webdriver's PhantomJS API. Upon clicking the submit button, the form gets validated (are the username and password too short, or blank, etc.), but it does not get ultimately submitted. That is, if I submit an invalid form, and check the screenshot, there are alert notifications. If I submit a valid form, nothing happens. The JS on the page is supposed to validate the form, then submit it, when the submit button is clicked. A

Selenium with PhantomJS: Form being validated but not submitted

家住魔仙堡 提交于 2020-01-15 18:48:34
问题 I'm having a strange problem submitting a form through Selenium Webdriver's PhantomJS API. Upon clicking the submit button, the form gets validated (are the username and password too short, or blank, etc.), but it does not get ultimately submitted. That is, if I submit an invalid form, and check the screenshot, there are alert notifications. If I submit a valid form, nothing happens. The JS on the page is supposed to validate the form, then submit it, when the submit button is clicked. A

Selenium with PhantomJS: Form being validated but not submitted

允我心安 提交于 2020-01-15 18:48:22
问题 I'm having a strange problem submitting a form through Selenium Webdriver's PhantomJS API. Upon clicking the submit button, the form gets validated (are the username and password too short, or blank, etc.), but it does not get ultimately submitted. That is, if I submit an invalid form, and check the screenshot, there are alert notifications. If I submit a valid form, nothing happens. The JS on the page is supposed to validate the form, then submit it, when the submit button is clicked. A

crawl a list of sites one by one with scrapy

∥☆過路亽.° 提交于 2020-01-15 10:33:51
问题 I am trying to crawl a list of sites with scrapy . I tried to put the list of website urls as the start_urls , but then I found I couldn't afford so much memory with it. Is there any way to set the scrapy crawling one or two sites at a time? 回答1: You can try using concurrent_requests = 1 so that you don't overloaded with data http://doc.scrapy.org/en/latest/topics/settings.html#concurrent-requests 回答2: You can define a start_requests method which iterates through requests to your URLs. This

“Exception in thread ”main“ java.lang.NullPointerException” error when running web scraper program

北战南征 提交于 2020-01-15 06:22:11
问题 I'm fairly new to web scraping and have limited knowledge on Java. Every time I run this code, I get the error: Exception in thread "main" java.lang.NullPointerException at sws.SWS.scrapeTopic(SWS.java:38) at sws.SWS.main(SWS.java:26) Java Result: 1 BUILD SUCCESSFUL (total time: 0 seconds) My code is: import java.io.*; import java.net.*; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; public class SWS { /** * @param args the command line arguments */ public static void main(String[]

casperJS how to click multiple links in a table while collecting data from the web /.click() doesn't work?

a 夏天 提交于 2020-01-14 14:36:33
问题 I want to scrape some web data using CasperJS. The data is in a table, in each row there is a link leading to a page with more detail. In the script there is a loop iterating through all table rows. I want Casper to click the link, collect the data on a sub-page and come one history step back to process next table row. The problem is that the click() doesn't work and I don't know why. Is there any way to fix this ? (note: a javascript function viewContact is invoked by href) Here is the code

php cURL Operation timed out after 120308 milliseconds with X out of -1 bytes received

半世苍凉 提交于 2020-01-14 09:29:30
问题 I'm experiencing this error (see Title) occasionally in my scraping script. X is the integer number of bytes > 0, the real number of bytes the webserver sent in response. I debugged this issue with Charles proxy and here is what I see As you can see there is no Content-Length: header in response, and the proxy still waits for the data (and so the cURL waited for 2 minutes and gave up) The cURL error code is 28. Below is some debug info from verbose curl output with var_export'ed curl_getinfo(

Scrapy is following and scraping non-allowed links

痞子三分冷 提交于 2020-01-14 08:58:54
问题 I have a CrawlSpider set up to following certain links and scrape a news magazine where the links to each issue follow the following URL scheme: http://example.com/YYYY/DDDD/index.htm where YYYY is the year and DDDD is the three or four digit issue number. I only want issues 928 onwards, and have my rules below. I don't have any problem connecting to the site, crawling links, or extracting items (so I didn't include the rest of my code). The spider seems determined to follow non-allowed links