问题
I'm making a script that takes all the links from a google page in bash. I get the google page with the w3m
utility and this script:
#!/bin/bash
# performs a google search using a word in input
word=$1
touch .google
if [ -z $word ]
then
echo "$word missing!"
echo "Aborting..."
exit
fi
a="www.google.com/search?q="
search=$a$word
w3m -no-cookie $search > .google
sleep 1
Next, I have to get all the sites from this page. I was thinking to take all the string that start with www.
and ends with /
echo `grep -wo "www[^/]*" .google`> .temp
The problem with this is that I miss a lot of the links that don't start with www
and at the same time I risk breaking everything when there is a site that doesn't end with /
.
What better way could I get the urls from this response?
回答1:
Link extraction is a hard problem. However, the lynx
program has a handy -dump
option that will let you skip most (or all) html parsing.
Specifically, note the References
section at the bottom. You could take the output from that line onward, and strip off the leading bullet numbers:
$ lynx -dump 'http://www.seomoz.org/'
#[1]RSS 2.0 [2]publisher
[3]SEOmoz
* [4]Log in
* [5]Sign up
* [6]Help
+ [7]Help Resources
+ [8]Support Forums
+ [9]Request a Feature
+ [10]Contact Us
* [11]Features
* [12]Pricing & Plans
* [13]Community
+ [14]SEO Blog
+ [15]YOUmoz User Blog
+ [16]Top Users
+ [17]Events
+ [18]Recommended Companies
* [19]Resources
+ [20]Learn SEO
+ [21]SEO Tools
+ [22]PRO Q&A Forum
+ [23]Mozscape API
* [24]Blog
+ [25]SEO Blog
+ [26]YOUmoz User Blog
* [27]About
+ [28]Our TAGFEE Mission
+ [29]Meet the Mozzers
+ [30]Contact Us
+ [31]Join Our Team
+ [32]Press & Awards
+ [33]Events
Search SEOmoz
____________________ Search
SEO & Social Monitoring
Made Simple.
SEOmoz PRO combines SEO management, social media monitoring, actionable
recommendations, and so much more in one easy-to-use platform. Try it
free for 30 days.
[34]Try it for Free!
[35]Take a tour of SEOmoz PRO
or see [36]plans & pricing
* Campaign Overview
* Social Dashboard
* Crawl Diagnostics
* Dashboard
* Google Analytics
* Link Analysis
Loved By...
* Zillow
* Disney
* Overstock
* Best Buy
* Yelp
* Sun Microsystems
Roger Mozbot
Be My Buddy...
* [37]RSS
* [38]Twitter
* [39]Facebook
* [40]Google+
Effectively Manage Your SEO and Monitor Your Social Media
[41]Link Analysis
Analyze links and track key performance metrics in an efficient
all-in-one dashboard.
[42]Identify SEO Issues
Identify critical SEO issues and get actionable recommendations.
[43]Monitor Changes
Automatically monitor changes to your rankings and take control of your
organic traffic.
Avinash Kaushik
"SEOmoz tools provide best of class data. Their tools are a
must-have for marketers looking to optimize their organic search
results."
Avinash Kaushik,
Author, Web Analytics 1.0: An Hour A Day
Patrick Altoft
"SEOmoz has enabled us to scale our link-building process quickly
without compromising on quality."
Patrick Altoft,
CEO, Branded3
Latest from the SEOmoz Blog
__________________________________________________________________
[44]jennita
[45]Winners of #MozCation 2012
Posted by [46]jennita on 08/04/2012
Whoa. Ever have one of those times where your expectations are
completely blown out of the water? Well that's what happened during
this year's nomination for a MozCation. Wait, wait, wait, before I get
too far ahead of myself, I...
[47]Read Full Entry
13
2
[48]13 Comments
__________________________________________________________________
Latest from the Community YouMoz Blog
__________________________________________________________________
[49]larry.kim
[50]Does SEO Even Work for Small Businesses?
Posted by [51]larry.kim on 08/03/2012
Clicks on paid search listings beat out organic listings by nearly a
2:1 margin for keywords with high commercial intent in the US. Is SEO
still a viable marketing tactic for the average small business owner?
[52]Read Full Entry
17
3
[53]28 Comments
__________________________________________________________________
Voted Best SEO Tool 2010!
[54]Try it for Free!
Looking for SEO consulting?
SEOmoz doesn't provide consulting, but our friends at [55]Distilled
still do. Rock on!
Copyright ? 1996-2012 SEOmoz. All Rights Reserved.
Product and Tools
* [56]SEOmoz PRO
* [57]Pricing and Plans
* [58]Open Site Explorer
* [59]SEO Toolbar
* [60]Mozscape API
* [61]More SEO Tools
Company
* [62]About
* [63]SEO Blog
* [64]YOUmoz Blog
* [65]Affiliate Program
* [66]Terms & Privacy Policy
* [67]PRO Perks
Popular Content
* [68]Link Building
* [69]Reputation Management
* [70]Analytics
* [71]Social Media
* [72]Content & Blogging
* [73]See All Categories
Stay in Touch
*
+ [74]RSS
+ [75]Twitter
+ [76]Facebook
+ [77]LinkedIn
*
SEOmoz
119 Pine St. Suite 400
Seattle, WA 98101
206.632.3171
* [78]Contact Us
* [79]Sitemap
References
1. http://feeds.feedburner.com/seomoz
2. https://plus.google.com/112544075040456048636
3. http://www.seomoz.org/
4. https://www.seomoz.org/users/login
5. https://www.seomoz.org/users/register
6. http://www.seomoz.org/
7. http://www.seomoz.org/help
8. http://www.seomoz.org/q
9. http://seomoz.zendesk.com/forums/293194-seomoz-PRO-feature-requests
10. http://www.seomoz.org/about/contact
11. http://www.seomoz.org/features
12. http://www.seomoz.org/plans
13. http://www.seomoz.org/community
14. http://www.seomoz.org/blog
15. http://www.seomoz.org/ugc
16. http://www.seomoz.org/users
17. http://www.seomoz.org/about/events
18. http://www.seomoz.org/article/recommended
19. http://www.seomoz.org/resources
20. http://www.seomoz.org/learn-seo
21. http://www.seomoz.org/tools
22. http://www.seomoz.org/q
23. http://www.seomoz.org/api
24. http://www.seomoz.org/blog
25. http://www.seomoz.org/blog
26. http://www.seomoz.org/ugc
27. http://www.seomoz.org/about
28. http://www.seomoz.org/about/mission
29. http://www.seomoz.org/about/team
30. http://www.seomoz.org/about/contact
31. http://www.seomoz.org/about/jobs
32. http://www.seomoz.org/about/press
33. http://www.seomoz.org/about/seo-events
34. http://www.seomoz.org/cart/freetrial?pg=home
35. http://www.seomoz.org/features
36. http://www.seomoz.org/plans
37. http://feeds.feedburner.com/seomoz
38. http://twitter.com/seomoz
39. http://www.facebook.com/SEOmoz
40. https://plus.google.com/112544075040456048636?prsrc=3
41. http://www.seomoz.org/features
42. http://www.seomoz.org/features
43. http://www.seomoz.org/features
44. http://www.seomoz.org/users/profile/81197
45. http://www.seomoz.org/blog/winners-mozcation-2012
46. http://www.seomoz.org/users/profile/81197
47. http://www.seomoz.org/blog/winners-mozcation-2012
48. http://www.seomoz.org/blog/winners-mozcation-2012#comments
49. http://www.seomoz.org/users/profile/402613
50. http://www.seomoz.org/ugc/does-seo-even-work-for-small-businesses
51. http://www.seomoz.org/users/profile/402613
52. http://www.seomoz.org/ugc/does-seo-even-work-for-small-businesses
53. http://www.seomoz.org/ugc/does-seo-even-work-for-small-businesses#comments
54. http://www.seomoz.org/cart/freetrial?pg=features
55. http://www.seomoz.org/dp/distilled
56. http://www.seomoz.org/features
57. http://www.seomoz.org/plans
58. http://www.opensiteexplorer.org/
59. http://www.seomoz.org/seo-toolbar
60. http://www.seomoz.org/api
61. http://www.seomoz.org/tools
62. http://www.seomoz.org/about
63. http://www.seomoz.org/blog
64. http://www.seomoz.org/ugc
65. http://www.seomoz.org/dp/seomoz-pro-affiliate-program
66. http://www.seomoz.org/terms-and-privacy
67. http://www.seomoz.org/pro-perks
68. http://www.seomoz.org/blog/category/4
69. http://www.seomoz.org/blog/category/19
70. http://www.seomoz.org/blog/category/8
71. http://www.seomoz.org/blog/category/18
72. http://www.seomoz.org/blog/category/1
73. http://www.seomoz.org/blog
74. http://feeds.feedburner.com/seomoz
75. http://twitter.com/seomoz
76. http://www.facebook.com/SEOmoz
77. http://www.linkedin.com/groups?about=&gid=2976409&trk=anet_ug_grppro
78. http://www.seomoz.org/about/contact
79. http://www.seomoz.org/sitemap
回答2:
You might want to grep for <a href="
and take the value up to the next quote symbol. Then filter out all javascript stuff. Although this solution is probably not fool-proof either.
来源:https://stackoverflow.com/questions/11820061/link-extraction-from-a-google-page-in-bash