问题
Possible duplicate: RegEx matching HTML tags and extracting text
I need to get the text between the html tag like <p></p>
or whatever. My pattern is this
Pattern pText = Pattern.compile(">([^>|^<]*?)<");
Anyone knows some better pattern, because this one its not very usefull. I need it to get for index the content from web page.
Thanks
回答1:
SO is about to descend on you. But let me be the first to say, don't use regular expressions to parse HTML. Here is a list of Java HTML Parsers. Look around until you see an API that suits your fancy and use that instead.
回答2:
It looks like you are trying to use the |
operator inside a negative set, which is neither working nor needed. Just specify the characters that you don't want to match:
Pattern pText = Pattern.compile(">([^<>]*?)<");
回答3:
Don't use regular expressions when parsing HTML.
Use XPath instead (if your HTML is well formed). You can reference text nodes using the text()
function very easily.
来源:https://stackoverflow.com/questions/2088412/get-text-between-html-tags