问题
I'm doing some research work into content aggregators, and I'm curious how some of the current craigslist aggregators get data into their mashups.
For example, www.housingmaps.com and the now closed www.chicagocrime.org
If there is a URL that can be used for reference, that would be perfect!
回答1:
For AdRavage.com I use a combination of Magpie RSS (to extract the data returned from searches) and a custom screen scraping class to properly populate the city/category information used when building searches.
For example, to extract the categories you could:
//scrape category data
$h = new http();
$h->dir = "../cache/";
$url = "http://craigslist.org/";
if (!$h->fetch($url, 300)) {
echo "<h2>There is a problem with the http request!</h2>";
exit();
}
//we need to get all category abbreviations (data looks like: <option value="ccc">community)
preg_match_all ("/<option value=\"(.*)\">([^`]*?)\n/", $h->body, $categoryTemp);
$catNames = $categoryTemp['2'];
//return the array of abreviations
if(sizeof($catNames) > 0)
return $catNames;
else
return $emptyArray = array();
回答2:
An alternative to scraping (and getting blocked), using frames, or Google search is to use a data broker or data exchange service.
3taps is a beta service which provides a developer API to many services, including Craigslist. Their team also built Craiggers to demonstrate a use case of this API. Founder Greg Kidd told me that 3taps harvests Craigslist data from non-Craigslist sources where it is already indexed and cached so that it doesn't put any strain on Craigslist. Other 3taps data sources are also listed, but these stats make it unclear whether they're currently supported. Their goal is to Democratize the Exchange of Data.
80legs is a crawling service which provides a less real-time but potentially more comprehensive option. Their data dump-style service includes crawl packages for hundreds of sites sites including Amazon, Facebook, and Zillow (I don't believe Craigslist currently). Their newer effort Datafiniti is providing a search engine over this type of data.
回答3:
The alternative option would be to use YQL or Yahoo pipes to gather the results.
Craiglook and HousingMaps are using them to gather results
回答4:
The problem with any scraping solution of craigslist is that they automatically block any IP address that accesses them 'too much' - which usually means more than a few hundred times a day. So as soon as your tool got any kind of popularity, it would be shut down.
That's why the only craigslist search sites that have lasted either use frames (like searchtempest.com and crazedlist.org) or google (like allofcraigs.com).
What 3taps does is to gather craigslist listing from third party sources 'in the wild' - things like the Google and Bing caches for example.
Edit: this answer is no longer up to date. Most classifieds search engines that include results from craigslist now use Google Custom Search or similar solutions from Yahoo or Bing. SearchTempest uses both. Allofcraigs is now adhuntr and uses Google. Crazedlist has shut down.
回答5:
I've done a lot of data aggregation from sites like eBay, Craigslist, and Zillow. Each source requires a different method to aggregate the data.
For Craigslist, I got the data using RSS feeds. I only wanted specific data in specific categories in specific cities, and the RSS feeds worked fine for me. If you're trying to get all the data, and you overuse the RSS feeds, Craigslist will likely ban you. Also, you won't be able to get all the data from Craigslist feeds, because the feeds show most of the data but not all. If your reliability doesn't need to be 100%, then RSS is the easiest way to do it.
回答6:
i am guessing screen scraping
i do not think there is a craigslist API yet.. and i do not think they will release one..
so the only way to go is to scrape data.. you could use cURL library and heave regex to scrape the data you want of a page
if you see a link .. access the page.. scrape the new page get the data and show it or store it
and so on..
回答7:
I just made one:
http://cdn.javascriptmvc.com/videos/jobs/craigslist.js
That produces:
http://cdn.javascriptmvc.com/videos/jobs/craigslist.html
Must be run in rhino.
回答8:
While continuing to research this area, I found an awesome site that does partly what I'm interested in:
Crazedlist
It uses the HTTPReferer of the client browser, which is interesting but not ideal. The author of the site also claims to have royally ticked on CL, which I understand. It also gives clear example of business need, which are similar to my needs, and why I'm interested in this topic.
来源:https://stackoverflow.com/questions/237124/how-do-craigslist-mashups-get-data