scrape | 易学教程

Web Scraping - Google Map Website - is it possible to scrape?

阅读更多关于 Web Scraping - Google Map Website - is it possible to scrape?

问题 Just joined SO so I was wondering if you can help me with this issue. We used to scrape a website and get all the contact information for crossfit gyms in the US/world as the information was pretty exposed out there. Now, however, they have changed their website to map.crossfit.com so the information is embedded within a google style map, so you can only actually get the information for each gym (name, address, phone #, etc.) by zooming in and choosing them one by one, which would take me

(Web scraping) I've located the proper tags, now how do I extract the text?

阅读更多关于 (Web scraping) I've located the proper tags, now how do I extract the text?

问题 I'm creating my first web scraping application that collects the titles of games currently on the "new and trending" tab on https://store.steampowered.com/. Once I figure out how to do this, I want to repeat the process with prices, and export both to a spreadsheet in separate columns. I've successfully found the tags that contain the text I'm trying to extract (the title), but I'm unsure how to extract the titles once I've located their containers. from urllib.request import urlopen from bs4

PHP file_get_contents returns other web page than I see in my browser

阅读更多关于 PHP file_get_contents returns other web page than I see in my browser

问题 I'm trying to load a website into a variable. But if I run my code, it returns an other page than I see in my browser. Here's my code: $query = $_GET['q']; $url = 'https://www.google.com/search?q='.str_replace(' ','+',$query); $doc = file_get_contents($url, false, $context); echo $doc; I see a (I guess older and) other version of Google in my website then when I go to Google's website itsself. Can anyone help me? Edit Here are some screenshots: This is my websiteThis is Google To be more

Try to scrape image from image url (using python urllib ) but get html instead

阅读更多关于 Try to scrape image from image url (using python urllib ) but get html instead

问题 I've tried to get the image from the following url. http://upic.me/i/fj/the_wonderful_mist_once_again_01.jpg I can do right-click and save-as but when I tried to use urlretrieve like import urllib img_url = 'http://upic.me/i/fj/the_wonderful_mist_once_again_01.jpg' urllib.urlretrieve( img_url, 'cover.jpg') I found that it is html instead of .jpg image but I don't know why. Could you please tell me why does my method not work? Are there any option that can mimic right-click save-as method? 回答1

How to scrape address from websites using Scrapy? [closed]

阅读更多关于 How to scrape address from websites using Scrapy? [closed]

问题 Closed . This question needs to be more focused. It is not currently accepting answers. Want to improve this question? Update the question so it focuses on one problem only by editing this post. Closed 4 years ago . I am using Scrapy and I need to scrape the address from contact us page from a given domain. The domains are provided as a result of google search api and hence I do not know what the exact structure of the web page is going to be. Is this kind of scraping possible? Any examples

extract elements from a html page

阅读更多关于 extract elements from a html page

问题 I download some youtube comment page and I want to extract username(or user display name) and the link like from the following code block: <p class="metadata"> <span class="author "> <a href="/channel/UCuoJ_C5xNTrdnc4motXPHIA" class="yt-uix-sessionlink yt-user-name " data-sessionlink="ei=CKG174zFqbQCFZmaIQodtmyE0A%3D%3D" dir="ltr">Sabil Muhammad</a> </span> <span class="time" dir="ltr"> <a dir="ltr" href="http://www.youtube.com/comment?lc=S2ZH2gSPYaef43vTRkLDxUzo2fYicVUc3SFvmYq2jrs"> il y a 1

CSS selectors to be used for scraping specific links

阅读更多关于 CSS selectors to be used for scraping specific links

问题 I am new to Python and working on a scraping project. I am using Firebug to copy the CSS path of required links. I am trying to collect the links under the tab of "UPCOMING EVENTS" from http://kiascenehai.pk/ but it is just for learning how I can get the specified links. I am looking for the fix of this problem and also suggestions for how to retrieve specified links using CSS selectors. from bs4 import BeautifulSoup import requests url = "http://kiascenehai.pk/" r = requests.get(url) data =

Node can't scrape certain pages

阅读更多关于 Node can't scrape certain pages

问题 I don't know if this is something to do with coldfusion pages or what but I can't scrape these .cfm pages In the command line in a directory with request run: node> var request = require('request'); node> var url = 'http://linguistlist.org/callconf/browse-conf-action.cfm?confid=173395'; node> request(url, function (err, res, body) { if (err) { console.log(err) } else { console.log('body:', body) }; }); I've tried with some other .cfm sites but they work, and am only getting blank results so I

R Rvest for() and Error server error: (503) Service Unavailable

阅读更多关于 R Rvest for() and Error server error: (503) Service Unavailable

问题 I'm new to webscraping, but I am excited using rvest in R. I tried to use it to scrape particular data of companies. I have created a for loop (171 urls), and when I am running it stops on 6th or 7th url with an error Error in parse.response(r, parser, encoding = encoding) : server error: (503) Service Unavailable When I start my loop from 7th url it goes for two or three more and stops again with the same error. My loop library(rvest) thing<-c("http://www.informazione-aziende.it/Azienda_ LA

Extract Links from Facebook activity feed

阅读更多关于 Extract Links from Facebook activity feed

问题 I'm trying to get the links from a facebook activity feed, i've tried extracting the HTML from the iframe, but this doesn't work because of cross domain. Then I tried cURL but that doesn't work because of the javascript. http://developers.facebook.com/docs/reference/plugins/activity Any ideas? 回答1: It is not possible, basically it requires your session that you can't maintain in iframe, I tried this some days ago through file_get_contents and cURL but nothing works. 回答2: You could try using a