Anyone know of a way to programatically detect a parked web page? That is, those pages that you accidentally type in (or intentionally sometimes) and they are hosted by a domain parking service with nothing but ads on them.
I am working on a linking network and want to make sure that sites that expire don't end up getting snatched by someone else and then being a parked page.
Here is a test that I think may catch a decent number of them. It takes advantage of the fact you don't actually want to have real web sites up for your parked domains. It looks for the wildcarding of both subdomain and path. Lets say we have this URL in our system
http://www.example.com/method-to-detect-parked.
First I would check the actual URL and hash it or grab a copy for comparison.
My second check would be to
http://random.example.com/random
If it matches the original link or even succeeds, you have a pretty good indicator that the page is parked. If it fails I might check both the subdomain and path individually. If the page randomly changes some elements, you may want to choose a few items to compare. For example make a list of links included in the page and compare those or maybe the title tag.
I would say that you'll have to examine the WHOIS records for the sites in question and/or the actual content of the pages and develop some heuristics as to what constitutes a "parked page".
Take goooogle.com, looking at their WHOIS record shows that they are owned by "Privacy Protection" and that their DNS servers are ns1/ns2.fastpark.net. If you look at the source for the site, they're silly enough to have a CSS file named "style_park.css" :)
All in all, I don't think you'll be able to come up with a generic way to do it. You'll probably end up with some ever evolving rule base or blacklist
You could just rely on your users to "Report this link"... which would put it into a queue to review later?
Look at the creation date of the dns/whois record, and compare it to the add date of the link. If the DNS is newer, that's a link that needs manual checking.
Or: check http://example.com/ and http://example.com/xxxxxxrandomstringxxxxx . If those two pages are identical, you've got some sort of problem that needs manual checking. Either the primary page you wanted to link to is broken, or the domain is parked and all pages return the same value. This test is not 100%, because some parked pages echo back elements from the URL.
If you just want to check an existing website, a service like http://www.linkalarm.com/ does this well.
来源:https://stackoverflow.com/questions/490025/method-to-detect-a-parked-page