How to extract social information from a given website?

▼魔方 西西 提交于 2020-06-27 06:38:22

问题


I have a Website URL Like www.example.com

I want to collect social information from this website like : facebook url (facebook.com/example ), twitter url ( twitter.com/example ) etc., if available anywhere, at any page of website.

How to complete this task, suggest any tutorials, blogs, technologies ..


回答1:


Since you don't know exactly where (on which page of the website) those link are located, you probably want to base you spider on CrawlSpider class. Such spider lets you define rules for link extraction and navigation through the website. See this minimal example:

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class MySpider(CrawlSpider):
    name = 'example.com'
    start_urls = ['http://www.example.com']

    rules = (
        Rule(LinkExtractor(allow_domains=('example.com', )), callback='parse_page', follow=True),
    )

    def parse_page(self, response):
        item = dict()
        item['page'] = response.url
        item['facebook_urls'] = response.xpath('//a[contains(@href, "facebook.com")]/@href').extract()
        item['twitter_urls'] = response.xpath('//a[contains(@href, "twitter.com")]/@href').extract()
        yield item

This spider will crawl all pages of example.com website and extract URLs containing facebook.com and twitter.com.




回答2:


import requests
from html_to_etree import parse_html_bytes
from extract_social_media import find_links_tree

res = requests.get('http://www.jpmorganchase.com')
tree = parse_html_bytes(res.content, res.headers.get('content-type'))

set(find_links_tree(tree))

Source: https://github.com/fluquid/extract-social-media




回答3:


Most likely you want to 1. Search for links in Header/Footer of the html page layout. As that is the most common place for them. 2. You can cross reference with found links on the other pages of the same site. 3. You can check if name of site/organization is in the link. But this one is not reliable as name may differ abit or use absolutely strange handle.

That is all I can think of.



来源:https://stackoverflow.com/questions/46580177/how-to-extract-social-information-from-a-given-website

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!