问题
I am using Scrapy and I need to scrape the address from contact us page from a given domain. The domains are provided as a result of google search api and hence I do not know what the exact structure of the web page is going to be. Is this kind of scraping possible? Any examples would be nice.
回答1:
Providing few examples would help to make a better answer, but the general idea could be to:
- find the "Contact Us" link
- follow the link and extract the address
assuming you don't have any information about the web-sites you'll be given.
Let's focus on the first problem.
The main problem here is that the web-sites are structured differently and, strictly speaking, you cannot build a 100% reliable way to find the "Contact Us" page. But, you can "cover" the most common cases:
- follow the
a
tag with the text "Contact Us", "Contact", "About Us", "About" etc - check
/about
,/contact_us
and similar endpoints, examples:- http://www.sample.com/contact.php
- http://www.sample.com/contact
- follow all links that have
contact
,about
etc text inside
From these you can build a set of Rules for your CrawlSpider.
The second problem is no easier - you don't know where on the page an address is located (and may be it doesn't exist on a page), and you don't know the address format. You may need to dive into Natural Language Processing and Machine Learning.
来源:https://stackoverflow.com/questions/28145409/how-to-scrape-address-from-websites-using-scrapy