问题
I have below content in Java where I want to strip only html tags but not new line characters
<p>test1 <b>test2</b> test 3 </p> //line 1
<p>test4 </p> //line 2
If I open above content in text rich editor, line 1 and line 2 are displayed in different lines(without showing </p>
tag).But in notepad content is shown along with </p>
tags. To remove all html tags I used
Jsoup.parse(aboveContent).text()
It removes all html characters. But it shows all line 1 and line 2 in same line in notepad. Somehow Jsoup also removes newline character.
What I tried:-
I also tried replacing </p>
with \r\n
and then do to remove html tags
Jsoup.parse(contentWith\r\n-Insteadof-</p>Tag ).text()
but still Jsoup removes end of line character(as in the debugger I can see both line1 and line2) in same line.
How I can make Jsoup to strip only html character but not new line character?
回答1:
You get a single line because text()
remove all whitepace characters.
But you can use a StringBuilder
and insert each line there:
final String html = "<p>test1 <b>test2</b> test 3 </p>"
+ "<p>test4 </p>";
Document doc = Jsoup.parse(html);
StringBuilder sb = new StringBuilder();
for( Element element : doc.select("p") )
{
/*
* element.text() returns the text of this element (= without tags).
*/
sb.append(element.text()).append('\n');
}
System.out.println(sb.toString().trim());
Output:
test1 test2 test 3
test4
回答2:
You can also do this:
public static String cleanNoMarkup(String input) {
final Document.OutputSettings outputSettings = new Document.OutputSettings().prettyPrint(false);
String output = Jsoup.clean(input, "", Whitelist.none(), outputSettings);
return output;
}
The important things here are: 1. Whitelist.none() - so no markup is allowed 2..prettyPrint(false) - so linebreaks are not removed
来源:https://stackoverflow.com/questions/14453047/jsoup-to-strip-only-html-tags-not-new-line-character