问题
The need has arisen within our organisation to monitor (on a daily basis) where our site appears (both organic and PPC) on the page 1 of Google. Also where a key competitor appears. For certain key words.
In the immediate short term a colleague is doing this by hitting Google manually and jotting down the results. Yep.
It occurs to us we can write a script (e.g. using C#) to do this.
I know Analytics will tell us an awful lot but it doesn't note the competitor's position, plus I don't think it has other data we want.
Question is, is there an existing basic tool which does this (for free, I guess)? And if we write it ourselves, where to start and are there obvious pitfalls to avoid (for example can Google detect and block automated requests?)
Edit: To those answers suggesting using the Google API - this post over on Google Groups would appear to rule that out completely:
The Custom Search API requires you to set up a Custom Search Engine (CSE) which must be set to search particular sites rather than the entire web.
The Custom Search API TOS explicitly prohibit you from making automated queries, which would be key to "regularly and accurately" measuring the SERP of a site.
Jeremy R. Geerdes
回答1:
Google actually does prohibit scraping of their search results without "human" interaction (see 5.3, and here). I'm not advocating you do so. The concern they state is that having too many people doing this could cause issues (how many search terms would you look for?), as well as possibly gaming the rankings themselves.
Having said that, you could possibly use the API to do a search result and iterate through the results as I have below, using the html result. Or, you could try some of the services available to help you do this:
http://www.googlerankings.com/
(Note: I am in no way affiliated with this website, it is only an example.)
I am sure there are plenty of SEO companies that would also provide this as a service. I would recommend exploring those options before getting into scraping.
I went ahead and threw together a quick CS class that would pull basic information from a Google search result. This class uses the mentioned HTML Agility Pack, a pretty nifty tool Microsoft created for iterating over web pages that allows you to use XPath to find what you are looking for in the page. In this case, "//span//cite" gives you the url, so this example uses that.
To use, do the following:
GoogleRankScrape.Do(
"google scraping",
"C:\\rankings\\",
"//span//cite",
new string[] {"stackoverflow.com","wikipedia.org","okeydoke.org"},
100
);
This could be wrapped into a CS console app and then use the Windows scheduler to run the console app. There are many other ways that this could go; this is only an example.
The GoogleRankScrape code is following:
using System;
using System.IO;
using System.Text;
using HtmlAgilityPack;
class GoogleRankScrape
{
public static void Do(string query, string dest, string path, string[] matches, int depth)
{
Directory.SetCurrentDirectory(@dest);
string url = "http://www.google.com/search?q=" + query + "&num=" + depth;
string rp = "rankings.txt";
DateTime dt = DateTime.Now;
string dtf = String.Format("{0:u}", dt);
string dtfr = String.Format("{0:f}", dt);
dtf = dtf.Replace("-", "");
dtf = dtf.Replace(" ", "");
dtf = dtf.Replace(":", "");
string wp = "page" + dtf + ".html";
string op = "output" + dtf + ".txt";
FileInfo r = new FileInfo(rp);
if (!File.Exists("rankings.txt"))
{
StreamWriter rsw = r.CreateText();
rsw.Close();
}
StreamWriter rs = new StreamWriter(r.Name, true);
rs.WriteLine("Date: " + dtfr);
rs.WriteLine("Date: " + dtf);
rs.WriteLine("Depth: " + depth);
rs.WriteLine("Query: " + query);
HtmlWeb hw = new HtmlWeb();
HtmlDocument d = hw.Load(url);
d.Save(wp);
FileInfo o = new FileInfo(op);
StreamWriter os = o.CreateText();
HtmlDocument HD = new HtmlDocument();
HD.Load(wp);
string check = "";
string checkblock = "";
var SpanCite = HD.DocumentNode.SelectNodes(path);
if (SpanCite != null)
{
int rank = 1;
foreach (HtmlNode HN in SpanCite)
{
String line = "";
if (HN.InnerText.ToString().IndexOf("/") > 0)
{
line = HN.InnerText.ToString().Substring(0, HN.InnerText.ToString().IndexOf("/"));
}
else if (HN.InnerText.ToString().IndexOf(" ") > 0)
{
line = HN.InnerText.ToString().Substring(0, HN.InnerText.ToString().IndexOf(" "));
}
else
{
line = HN.InnerText.ToString();
}
os.WriteLine(line);
os.WriteLine(rs.NewLine);
for (int i = 0; i < matches.Length; i++)
{
checkblock = "[" + matches[i] + "]";
if (line.Contains(matches[i]) && !check.Contains(matches[i]))
{
rs.WriteLine("Rank: " + rank.ToString() + ", " + matches[i]);
check += checkblock;
}
}
rank++;
}
for (int i = 0; i < matches.Length; i++)
{
checkblock = "[" + matches[i] + "]";
if (!check.Contains(matches[i]))
{
rs.WriteLine("Rank: not ranked" + ", " + matches[i]);
}
}
}
os.Close();
rs.WriteLine("==========");
rs.Close();
}
}
回答2:
You could develop a simple C# program using Html Agility Pack. It's a very good open source library to manipulate HTML, and it's very easy to use.
Regarding google blocking automated requests, if you are only going to check once a day and there are not a lot of keywords to check, I don't think you have any problem.
回答3:
Perhaps a look into the Google search API might give you a hint on how to access searches directly?
I haven't tried it myself but it could also be a solution.. see search API.
回答4:
Did you consider using the stats from Google Webmaster Tools?
They provide detailed reports on your sites ranking for given search phrases amongst other useful features.
Admittedly those reports don't provide your competitors position so using the Google Search API would be the best way to get all the data you need.
回答5:
If you have a mac then you can use Fake. It's incredible.
http://fakeapp.com/
If you only have windows then I'd write it myself. The best way to do it would be to write jQuery to snatch what you want. It wouldn't take 30 minutes to do it using jQuery. You can run a scheduled task against your page and you'll have the solution you wanted.
来源:https://stackoverflow.com/questions/4689671/produce-a-script-to-hit-google-once-a-day-and-log-our-serp-position