Let\'s say you\'re given http://nytimes.com How would you pull out the \"main\" image?
The reason I\'m asking is because Flipboard is able to grab the main image from a
I don't believe there's a standard method. You could start by looking for an Open Graph Protocol image tag. Facebook uses these to select images for urls posted in status updates and comments.
<meta property="og:image" content="http://ia.media-imdb.com/rock.jpg"/>
If you're prepared to use a third party, Embedly offer this as a chargeable service.
Embedly provides a powerful API to convert standard URLs into embedded videos, images, and rich article previews from 218 leading providers.
There are many strategies to determine what is the "main" image of an URL:
I've created a JavaScript library that uses most of these techniques to determine the "main" picture of an URL : ImageResolver.
Facebook allows the user to pick one of several images that it has deemed to be a "main" image. As far as automatically determining a "main" image, I would judge it based on page position, size, relation to text, and (if you wanted to be more sophisticated) its visual content.
For example, you could use a simple face detection program, or look at color breakdowns to determine if the picture was "interesting" to you or not.
EDIT: In the case of www.nytimes.com, I would probably just look at the page structure, because a large carousel of images is located right underneath an H1 tag.
There really isn't anything that is considered the "main" image in a web page--nothing in HTML or otherwise to distinguish this. Not to mention you'd probably have to read all the images in CSS (or rather the background images etc). But if I had to do this, here is what I would do:
First I would decide a suitable image size, lets say a 400x400 minimum. (I don't want to pick any old image, something really small would likely scale horribly)
I would then iterate through each image on the page.2.
For each image I encountered I would check the size of it3. If it was 400x400 (my predefined size) or larger I would use this image. If it wasn't, I would check that its the largest image I've found so far and if so keep its information stored off to the side.
Once I had reached a predefined number of images I've checked
(for argument lets say 10, but surely you'd probably go much higher) I'd use the largest image I've found (stored off to the side) because I wouldn't want to scan the page indefinitely looking for images!