How to scrape address from websites using Scrapy? [closed]

问题

I am using Scrapy and I need to scrape the address from contact us page from a given domain. The domains are provided as a result of google search api and hence I do not know what the exact structure of the web page is going to be. Is this kind of scraping possible? Any examples would be nice.

回答1:

Providing few examples would help to make a better answer, but the general idea could be to:

find the "Contact Us" link
follow the link and extract the address

assuming you don't have any information about the web-sites you'll be given.

Let's focus on the first problem.

The main problem here is that the web-sites are structured differently and, strictly speaking, you cannot build a 100% reliable way to find the "Contact Us" page. But, you can "cover" the most common cases:

follow the a tag with the text "Contact Us", "Contact", "About Us", "About" etc
check /about, /contact_us and similar endpoints, examples:
- http://www.sample.com/contact.php
- http://www.sample.com/contact
follow all links that have contact, about etc text inside

From these you can build a set of Rules for your CrawlSpider.

The second problem is no easier - you don't know where on the page an address is located (and may be it doesn't exist on a page), and you don't know the address format. You may need to dive into Natural Language Processing and Machine Learning.

来源：https://stackoverflow.com/questions/28145409/how-to-scrape-address-from-websites-using-scrapy

标签

web-scraping

scrapy

scrape