Here is a snippet of the code :
HttpWebRequest webRequest = (HttpWebRequest)WebRequest.Create(request.RawUrl);
WebRequest.DefaultWebProxy = null;//Ensure that
There is still some problems when requesting the web page "www.google.fr" from a WebRequest.
I checked the raw request and response with Fiddler. The problem comes from Google servers. The response HTTP headers are set to charset=ISO-8859-1, the text itself is encoded with ISO-8859-1, while the HTML says charset=UTF-8. This is incoherent and lead to encoding errors.
After many tests, I managed to find a workaround. Just add :
myHttpWebRequest.UserAgent = "Mozilla/5.0";
to your code, and Google Response will magically and entirely become UTF-8.
Firstly, the easier way of writing that code is to use a StreamReader and ReadToEnd:
HttpWebRequest webRequest = (HttpWebRequest)WebRequest.Create(myURL);
using (HttpWebResponse response = (HttpWebResponse)webRequest.GetResponse())
{
using (Stream resStream = response.GetResponseStream())
{
StreamReader reader = new StreamReader(resStream, Encoding.???);
return reader.ReadToEnd();
}
}
Then it's "just" a matter of finding the right encoding. How did you create the file? If it's with Notepad then you probably want Encoding.Default
- but that's obviously not portable, as it's the default encoding for your PC.
In a well-run web server, the response will indicate the encoding in its headers. Having said that, response headers sometimes claim one thing and the HTML claims another, in some cases.
I studied the same problem with the help of WireShark, a great protocol analyser. I think that there are some design short coming to the httpWebResponse class. In fact, the whole message entity was downloaded the first time you invoking the GetResponse() method of the HttpWebRequest class, but the framework have no place to hold the data in the HttpWebResponse class or somewhere else, resulting you have to get the response stream the second time.
This is code that download one time.
String FinalResult = "";
HttpWebRequest Request = (HttpWebRequest)System.Net.WebRequest.Create( URL );
HttpWebResponse Response = (HttpWebResponse)Request.GetResponse();
Stream ResponseStream = Response.GetResponseStream();
StreamReader Reader = new StreamReader( ResponseStream );
bool NeedEncodingCheck = true;
while( true )
{
string NewLine = Reader.ReadLine(); // it may not working for zipped HTML.
if( NewLine == null )
{
break;
}
FinalResult += NewLine;
FinalResult += Environment.NewLine;
if( NeedEncodingCheck )
{
int Start = NewLine.IndexOf( "charset=" );
if( Start > 0 )
{
Start += "charset=\"".Length;
int End = NewLine.IndexOfAny( new[] { ' ', '\"', ';' }, Start );
Reader = new StreamReader( ResponseStream, Encoding.GetEncoding(
NewLine.Substring( Start, End - Start ) ) ); // Replace Reader with new encoding.
NeedEncodingCheck = false;
}
}
}
Reader.Close();
Response.Close();
There are some good solutions here, but they all seem to be trying to parse the charset out of the content type string. Here's a solution using System.Net.Mime.ContentType, which should be more reliable, and shorter.
var client = new System.Net.WebClient();
var data = client.DownloadData(url);
var encoding = System.Text.Encoding.Default;
var contentType = new System.Net.Mime.ContentType(client.ResponseHeaders[HttpResponseHeader.ContentType]);
if (!String.IsNullOrEmpty(contentType.CharSet))
{
encoding = System.Text.Encoding.GetEncoding(contentType.CharSet);
}
string result = encoding.GetString(data);
In case you don't want to download the page twice, I slightly modified Alex's code using How do I put a WebResponse into a memory stream?. Here's the result
public static string DownloadString(string address)
{
string strWebPage = "";
// create request
System.Net.WebRequest objRequest = System.Net.HttpWebRequest.Create(address);
// get response
System.Net.HttpWebResponse objResponse;
objResponse = (System.Net.HttpWebResponse)objRequest.GetResponse();
// get correct charset and encoding from the server's header
string Charset = objResponse.CharacterSet;
Encoding encoding = Encoding.GetEncoding(Charset);
// read response into memory stream
MemoryStream memoryStream;
using (Stream responseStream = objResponse.GetResponseStream())
{
memoryStream = new MemoryStream();
byte[] buffer = new byte[1024];
int byteCount;
do
{
byteCount = responseStream.Read(buffer, 0, buffer.Length);
memoryStream.Write(buffer, 0, byteCount);
} while (byteCount > 0);
}
// set stream position to beginning
memoryStream.Seek(0, SeekOrigin.Begin);
StreamReader sr = new StreamReader(memoryStream, encoding);
strWebPage = sr.ReadToEnd();
// Check real charset meta-tag in HTML
int CharsetStart = strWebPage.IndexOf("charset=");
if (CharsetStart > 0)
{
CharsetStart += 8;
int CharsetEnd = strWebPage.IndexOfAny(new[] { ' ', '\"', ';' }, CharsetStart);
string RealCharset =
strWebPage.Substring(CharsetStart, CharsetEnd - CharsetStart);
// real charset meta-tag in HTML differs from supplied server header???
if (RealCharset != Charset)
{
// get correct encoding
Encoding CorrectEncoding = Encoding.GetEncoding(RealCharset);
// reset stream position to beginning
memoryStream.Seek(0, SeekOrigin.Begin);
// reread response stream with the correct encoding
StreamReader sr2 = new StreamReader(memoryStream, CorrectEncoding);
strWebPage = sr2.ReadToEnd();
// Close and clean up the StreamReader
sr2.Close();
}
}
// dispose the first stream reader object
sr.Close();
return strWebPage;
}