Link extraction from a google page in bash

拈花ヽ惹草 提交于 2019-12-22 18:19:52

问题


I'm making a script that takes all the links from a google page in bash. I get the google page with the w3m utility and this script:

#!/bin/bash 
# performs a google search using a word in input

word=$1

touch .google

if [ -z $word ]
then
    echo "$word missing!"
    echo "Aborting..."
    exit
fi


a="www.google.com/search?q="  
search=$a$word

w3m -no-cookie $search > .google

sleep 1

Next, I have to get all the sites from this page. I was thinking to take all the string that start with www. and ends with /

echo `grep -wo "www[^/]*" .google`> .temp

The problem with this is that I miss a lot of the links that don't start with www and at the same time I risk breaking everything when there is a site that doesn't end with /.

What better way could I get the urls from this response?


回答1:


Link extraction is a hard problem. However, the lynx program has a handy -dump option that will let you skip most (or all) html parsing.

Specifically, note the References section at the bottom. You could take the output from that line onward, and strip off the leading bullet numbers:

$ lynx -dump 'http://www.seomoz.org/'
   #[1]RSS 2.0 [2]publisher

   [3]SEOmoz
     * [4]Log in
     * [5]Sign up
     * [6]Help
          + [7]Help Resources
          + [8]Support Forums
          + [9]Request a Feature
          + [10]Contact Us

     * [11]Features
     * [12]Pricing & Plans
     * [13]Community
          + [14]SEO Blog
          + [15]YOUmoz User Blog
          + [16]Top Users
          + [17]Events
          + [18]Recommended Companies
     * [19]Resources
          + [20]Learn SEO
          + [21]SEO Tools
          + [22]PRO Q&A Forum
          + [23]Mozscape API
     * [24]Blog
          + [25]SEO Blog
          + [26]YOUmoz User Blog
     * [27]About
          + [28]Our TAGFEE Mission
          + [29]Meet the Mozzers
          + [30]Contact Us
          + [31]Join Our Team
          + [32]Press & Awards
          + [33]Events

   Search SEOmoz

   ____________________ Search

SEO & Social Monitoring

Made Simple.

   SEOmoz PRO combines SEO management, social media monitoring, actionable
   recommendations, and so much more in one easy-to-use platform. Try it
   free for 30 days.
   [34]Try it for Free!

   [35]Take a tour of SEOmoz PRO
   or see [36]plans & pricing

     * Campaign Overview
     * Social Dashboard
     * Crawl Diagnostics
     * Dashboard
     * Google Analytics
     * Link Analysis

Loved By...

     * Zillow
     * Disney
     * Overstock
     * Best Buy
     * Yelp
     * Sun Microsystems


   Roger Mozbot

Be My Buddy...

     * [37]RSS
     * [38]Twitter
     * [39]Facebook
     * [40]Google+

Effectively Manage Your SEO and Monitor Your Social Media

   [41]Link Analysis

   Analyze links and track key performance metrics in an efficient
   all-in-one dashboard.
   [42]Identify SEO Issues

   Identify critical SEO issues and get actionable recommendations.
   [43]Monitor Changes

   Automatically monitor changes to your rankings and take control of your
   organic traffic.
   Avinash Kaushik

     "SEOmoz tools provide best of class data. Their tools are a
     must-have for marketers looking to optimize their organic search
     results."

Avinash Kaushik,

   Author, Web Analytics 1.0: An Hour A Day
   Patrick Altoft

     "SEOmoz has enabled us to scale our link-building process quickly
     without compromising on quality."

Patrick Altoft,

   CEO, Branded3

Latest from the SEOmoz Blog
     __________________________________________________________________

   [44]jennita

[45]Winners of #MozCation 2012

   Posted by [46]jennita on 08/04/2012
   Whoa. Ever have one of those times where your expectations are
   completely blown out of the water? Well that's what happened during
   this year's nomination for a MozCation. Wait, wait, wait, before I get
   too far ahead of myself, I...
   [47]Read Full Entry

   13

   2
   [48]13 Comments
     __________________________________________________________________

Latest from the Community YouMoz Blog
     __________________________________________________________________

   [49]larry.kim

[50]Does SEO Even Work for Small Businesses?

   Posted by [51]larry.kim on 08/03/2012
   Clicks on paid search listings beat out organic listings by nearly a
   2:1 margin for keywords with high commercial intent in the US. Is SEO
   still a viable marketing tactic for the average small business owner?
   [52]Read Full Entry

   17

   3
   [53]28 Comments
     __________________________________________________________________

Voted Best SEO Tool 2010!

   [54]Try it for Free!

Looking for SEO consulting?

   SEOmoz doesn't provide consulting, but our friends at [55]Distilled
   still do. Rock on!

   Copyright ? 1996-2012 SEOmoz. All Rights Reserved.

Product and Tools

     * [56]SEOmoz PRO
     * [57]Pricing and Plans
     * [58]Open Site Explorer
     * [59]SEO Toolbar
     * [60]Mozscape API
     * [61]More SEO Tools

Company

     * [62]About
     * [63]SEO Blog
     * [64]YOUmoz Blog
     * [65]Affiliate Program
     * [66]Terms & Privacy Policy
     * [67]PRO Perks

Popular Content

     * [68]Link Building
     * [69]Reputation Management
     * [70]Analytics
     * [71]Social Media
     * [72]Content & Blogging
     * [73]See All Categories

Stay in Touch

     *
          + [74]RSS
          + [75]Twitter
          + [76]Facebook
          + [77]LinkedIn
     *


    SEOmoz
    119 Pine St. Suite 400
    Seattle, WA 98101
    206.632.3171
     * [78]Contact Us
     * [79]Sitemap

References

   1. http://feeds.feedburner.com/seomoz
   2. https://plus.google.com/112544075040456048636
   3. http://www.seomoz.org/
   4. https://www.seomoz.org/users/login
   5. https://www.seomoz.org/users/register
   6. http://www.seomoz.org/
   7. http://www.seomoz.org/help
   8. http://www.seomoz.org/q
   9. http://seomoz.zendesk.com/forums/293194-seomoz-PRO-feature-requests
  10. http://www.seomoz.org/about/contact
  11. http://www.seomoz.org/features
  12. http://www.seomoz.org/plans
  13. http://www.seomoz.org/community
  14. http://www.seomoz.org/blog
  15. http://www.seomoz.org/ugc
  16. http://www.seomoz.org/users
  17. http://www.seomoz.org/about/events
  18. http://www.seomoz.org/article/recommended
  19. http://www.seomoz.org/resources
  20. http://www.seomoz.org/learn-seo
  21. http://www.seomoz.org/tools
  22. http://www.seomoz.org/q
  23. http://www.seomoz.org/api
  24. http://www.seomoz.org/blog
  25. http://www.seomoz.org/blog
  26. http://www.seomoz.org/ugc
  27. http://www.seomoz.org/about
  28. http://www.seomoz.org/about/mission
  29. http://www.seomoz.org/about/team
  30. http://www.seomoz.org/about/contact
  31. http://www.seomoz.org/about/jobs
  32. http://www.seomoz.org/about/press
  33. http://www.seomoz.org/about/seo-events
  34. http://www.seomoz.org/cart/freetrial?pg=home
  35. http://www.seomoz.org/features
  36. http://www.seomoz.org/plans
  37. http://feeds.feedburner.com/seomoz
  38. http://twitter.com/seomoz
  39. http://www.facebook.com/SEOmoz
  40. https://plus.google.com/112544075040456048636?prsrc=3
  41. http://www.seomoz.org/features
  42. http://www.seomoz.org/features
  43. http://www.seomoz.org/features
  44. http://www.seomoz.org/users/profile/81197
  45. http://www.seomoz.org/blog/winners-mozcation-2012
  46. http://www.seomoz.org/users/profile/81197
  47. http://www.seomoz.org/blog/winners-mozcation-2012
  48. http://www.seomoz.org/blog/winners-mozcation-2012#comments
  49. http://www.seomoz.org/users/profile/402613
  50. http://www.seomoz.org/ugc/does-seo-even-work-for-small-businesses
  51. http://www.seomoz.org/users/profile/402613
  52. http://www.seomoz.org/ugc/does-seo-even-work-for-small-businesses
  53. http://www.seomoz.org/ugc/does-seo-even-work-for-small-businesses#comments
  54. http://www.seomoz.org/cart/freetrial?pg=features
  55. http://www.seomoz.org/dp/distilled
  56. http://www.seomoz.org/features
  57. http://www.seomoz.org/plans
  58. http://www.opensiteexplorer.org/
  59. http://www.seomoz.org/seo-toolbar
  60. http://www.seomoz.org/api
  61. http://www.seomoz.org/tools
  62. http://www.seomoz.org/about
  63. http://www.seomoz.org/blog
  64. http://www.seomoz.org/ugc
  65. http://www.seomoz.org/dp/seomoz-pro-affiliate-program
  66. http://www.seomoz.org/terms-and-privacy
  67. http://www.seomoz.org/pro-perks
  68. http://www.seomoz.org/blog/category/4
  69. http://www.seomoz.org/blog/category/19
  70. http://www.seomoz.org/blog/category/8
  71. http://www.seomoz.org/blog/category/18
  72. http://www.seomoz.org/blog/category/1
  73. http://www.seomoz.org/blog
  74. http://feeds.feedburner.com/seomoz
  75. http://twitter.com/seomoz
  76. http://www.facebook.com/SEOmoz
  77. http://www.linkedin.com/groups?about=&gid=2976409&trk=anet_ug_grppro
  78. http://www.seomoz.org/about/contact
  79. http://www.seomoz.org/sitemap



回答2:


You might want to grep for <a href=" and take the value up to the next quote symbol. Then filter out all javascript stuff. Although this solution is probably not fool-proof either.



来源:https://stackoverflow.com/questions/11820061/link-extraction-from-a-google-page-in-bash

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!