How to scrape address from websites using Scrapy? [closed]

有些话、适合烂在心里 提交于 2019-12-24 14:16:33

问题


I am using Scrapy and I need to scrape the address from contact us page from a given domain. The domains are provided as a result of google search api and hence I do not know what the exact structure of the web page is going to be. Is this kind of scraping possible? Any examples would be nice.


回答1:


Providing few examples would help to make a better answer, but the general idea could be to:

  • find the "Contact Us" link
  • follow the link and extract the address

assuming you don't have any information about the web-sites you'll be given.

Let's focus on the first problem.

The main problem here is that the web-sites are structured differently and, strictly speaking, you cannot build a 100% reliable way to find the "Contact Us" page. But, you can "cover" the most common cases:

  • follow the a tag with the text "Contact Us", "Contact", "About Us", "About" etc
  • check /about, /contact_us and similar endpoints, examples:
    • http://www.sample.com/contact.php
    • http://www.sample.com/contact
  • follow all links that have contact, about etc text inside

From these you can build a set of Rules for your CrawlSpider.

The second problem is no easier - you don't know where on the page an address is located (and may be it doesn't exist on a page), and you don't know the address format. You may need to dive into Natural Language Processing and Machine Learning.



来源:https://stackoverflow.com/questions/28145409/how-to-scrape-address-from-websites-using-scrapy

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!