问题
I want to read the website text without html tags and headers. i just need the text displayed in the web browser.
i don't need like this
<html>
<body>
bla bla </td><td>
bla bla
<body>
<html>
i just need the text "bla bla bla bla".
I have used the webclient and httpwebrequest methods to get the HTML content and to split the received data but it is not possible because if i change the website the tags may change.
So is there any way to get only the displayed text in the website anagrammatically?
回答1:
Here is how you would do it using the HtmlAgilityPack.
First your sample HTML:
var html = "<html>\r\n<body>\r\nbla bla </td><td>\r\nbla bla \r\n<body>\r\n<html>";
Load it up (as a string in this case):
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
If getting it from the web, similar:
var web = new HtmlWeb();
var doc = web.Load(url);
Now select only text nodes with non-whitespace and trim them.
var text = doc.DocumentNode.Descendants()
.Where(x => x.NodeType == HtmlNodeType.Text && x.InnerText.Trim().Length > 0)
.Select(x => x.InnerText.Trim());
You can get this as a single joined string if you like:
String.Join(" ", text)
Of course this will only work for simple web pages. Anything complex will also return nodes with data you clearly don't want, such as javascript functions etc.
回答2:
You need to use special HTML parser. The only way to get the content of the such non regular language.
See: What is the best way to parse html in C#?
回答3:
public string GetwebContent(string urlForGet)
{
// Create WebClient
var client = new WebClient();
// Download Text From web
var text = client.DownloadString(urlForGet);
return text.ToString();
}
回答4:
I think this link can help you.
/// <summary>
/// Remove HTML tags from string using char array.
/// </summary>
public static string StripTagsCharArray(string source)
{
char[] array = new char[source.Length];
int arrayIndex = 0;
bool inside = false;
for (int i = 0; i < source.Length; i++)
{
char let = source[i];
if (let == '<')
{
inside = true;
continue;
}
if (let == '>')
{
inside = false;
continue;
}
if (!inside)
{
array[arrayIndex] = let;
arrayIndex++;
}
}
return new string(array, 0, arrayIndex);
}
回答5:
// Reading Web page content in c# program
//Specify the Web page to read
WebRequest request = WebRequest.Create("http://aspspider.info/snallathambi/default.aspx");
//Get the response
WebResponse response = request.GetResponse();
//Read the stream from the response
StreamReader reader = new StreamReader(response.GetResponseStream());
//Read the text from stream reader
string str = reader.ReadLine();
for(int i=0;i<200;i++)
{
str += reader.ReadLine();
}
Console.Write(str);
来源:https://stackoverflow.com/questions/10579292/how-to-read-the-website-content-in-c