问题
I have a Website URL Like www.example.com
I want to collect social information from this website like : facebook url (facebook.com/example ), twitter url ( twitter.com/example ) etc., if available anywhere, at any page of website.
How to complete this task, suggest any tutorials, blogs, technologies ..
回答1:
Since you don't know exactly where (on which page of the website) those link are located, you probably want to base you spider on CrawlSpider class. Such spider lets you define rules for link extraction and navigation through the website. See this minimal example:
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class MySpider(CrawlSpider):
name = 'example.com'
start_urls = ['http://www.example.com']
rules = (
Rule(LinkExtractor(allow_domains=('example.com', )), callback='parse_page', follow=True),
)
def parse_page(self, response):
item = dict()
item['page'] = response.url
item['facebook_urls'] = response.xpath('//a[contains(@href, "facebook.com")]/@href').extract()
item['twitter_urls'] = response.xpath('//a[contains(@href, "twitter.com")]/@href').extract()
yield item
This spider will crawl all pages of example.com
website and extract URLs containing facebook.com
and twitter.com
.
回答2:
import requests
from html_to_etree import parse_html_bytes
from extract_social_media import find_links_tree
res = requests.get('http://www.jpmorganchase.com')
tree = parse_html_bytes(res.content, res.headers.get('content-type'))
set(find_links_tree(tree))
Source: https://github.com/fluquid/extract-social-media
回答3:
Most likely you want to 1. Search for links in Header/Footer of the html page layout. As that is the most common place for them. 2. You can cross reference with found links on the other pages of the same site. 3. You can check if name of site/organization is in the link. But this one is not reliable as name may differ abit or use absolutely strange handle.
That is all I can think of.
来源:https://stackoverflow.com/questions/46580177/how-to-extract-social-information-from-a-given-website