Extract Google Search Results

后端 未结 2 1638
無奈伤痛
無奈伤痛 2020-12-30 16:32

I would like to periodically check what sub-domains are being listed by Google.

To obtain list of sub-domains, I type \'site:example.com\' in Google search box - thi

相关标签:
2条回答
  • 2020-12-30 17:12

    Regex is a bad idea for parsing HTML. It's cryptic to read and relies of well-formed HTML.

    Try BeautifulSoup for Python. Here's an example script that returns URLs from the first 10 pages of a site:domain.com Google query.

    import sys # Used to add the BeautifulSoup folder the import path
    import urllib2 # Used to read the html document
    
    if __name__ == "__main__":
        ### Import Beautiful Soup
        ### Here, I have the BeautifulSoup folder in the level of this Python script
        ### So I need to tell Python where to look.
        sys.path.append("./BeautifulSoup")
        from BeautifulSoup import BeautifulSoup
    
        ### Create opener with Google-friendly user agent
        opener = urllib2.build_opener()
        opener.addheaders = [('User-agent', 'Mozilla/5.0')]
    
        ### Open page & generate soup
        ### the "start" variable will be used to iterate through 10 pages.
        for start in range(0,10):
            url = "http://www.google.com/search?q=site:stackoverflow.com&start=" + str(start*10)
            page = opener.open(url)
            soup = BeautifulSoup(page)
    
            ### Parse and find
            ### Looks like google contains URLs in <cite> tags.
            ### So for each cite tag on each page (10), print its contents (url)
            for cite in soup.findAll('cite'):
                print cite.text
    

    Output:

    stackoverflow.com/
    stackoverflow.com/questions
    stackoverflow.com/unanswered
    stackoverflow.com/users
    meta.stackoverflow.com/
    blog.stackoverflow.com/
    chat.meta.stackoverflow.com/
    ...
    

    Of course, you could append each result to a list so you can parse it for subdomains. I just got into Python and scraping a few days ago, but this should get you started.

    0 讨论(0)
  • 2020-12-30 17:19

    The Google Custom Search API can deliver results in ATOM XML format

    Getting Started with Google Custom Search

    0 讨论(0)
提交回复
热议问题