问题
Get .txt file instead of .jpg - using Webclient
and DownloadFile();
I'm trying to download the .jpg from this URL:
http://1.bp.blogspot.com/_pK6J3MTn5co/S6kuH3aqbeI/AAAAAAAACUY/06axvmjU91k/s1600-h/avengers02_B&W_UL.jpg
Using this code:
private void TEST_button1_Click(object sender, EventArgs e)
{
WebClient MyDownloader = new WebClient();
MyDownloader.DownloadFile(@"http://1.bp.blogspot.com/_pK6J3MTn5co/S6kuH3aqbeI/AAAAAAAACUY/06axvmjU91k/s1600-h/avengers02_B&W_UL.jpg", @"c:\test.jpg");
}
However, when I run this, I end up with a file called test.jpg, which contains html mark up... :
<html>
<head>
<title>avengers02_B&W_UL.jpg (image)</title>
<script type="text/javascript">
<!--
if (top.location != self.location) top.location = self.location;
// -->
</script>
</head>
<body bgcolor="#ffffff" text="#000000">
<img src="http://1.bp.blogspot.com/_pK6J3MTn5co/S6kuH3aqbeI/AAAAAAAACUY/06axvmjU91k/s1600/avengers02_B%26W_UL.jpg" alt="[avengers02_B&W_UL.jpg]" border=0>
</body>
</html>
How can I download the actual .jpg?
Any help is greatly appreciated - thank you!
回答1:
There is a way to do it. First you download the HTML content to a string and extract the correct image URL. Then use the correct URL to download the file.
WebClient client = new WebClient();
var path = @"http://1.bp.blogspot.com/_pK6J3MTn5co/S6kuH3aqbeI/AAAAAAAACUY/06axvmjU91k/s1600-h/avengers02_B&W_UL.jpg";
var content = client.DownloadString(path);
System.Text.RegularExpressions.Regex regex = new Regex(@"(?<=<img\s+[^>]*?src=(?<q>['""]))(?<url>.+?)(?=\k<q>)");
var match = regex.Match(content);
if (match.Success)
{
client.DownloadFile(match.Value, @"e:\test1.jpg");
}
回答2:
If server returns HTML to your request at particular Url you can't do much to force it to return something else at that Url.
What you can do is parse response with HtmlAgilityPack and find url to actual image and get it in another request.
回答3:
Clicking that link causes 2 downloads, first a page of HTML (mislabelled with suffix .jpg
), and next an image in the HTML.
So perhaps you need to fetch the url of the img
tag in the HTML fetched by the previous request?
http://1.bp.blogspot.com/_pK6J3MTn5co/S6kuH3aqbeI/AAAAAAAACUY/06axvmjU91k/s1600/avengers02_B%26W_UL.jpg
I'm guessing that removing -h
from the original URL might point to the actual file that you're after.
Here's hoping that you have permission to scrape these files...
来源:https://stackoverflow.com/questions/11303101/get-txt-file-instead-of-jpg-using-webclient-and-downloadfile