get text between html tags

天大地大妈咪最大 提交于 2019-12-24 18:17:13

问题


Possible duplicate: RegEx matching HTML tags and extracting text

I need to get the text between the html tag like <p></p> or whatever. My pattern is this

Pattern pText = Pattern.compile(">([^>|^<]*?)<");

Anyone knows some better pattern, because this one its not very usefull. I need it to get for index the content from web page.

Thanks


回答1:


SO is about to descend on you. But let me be the first to say, don't use regular expressions to parse HTML. Here is a list of Java HTML Parsers. Look around until you see an API that suits your fancy and use that instead.




回答2:


It looks like you are trying to use the | operator inside a negative set, which is neither working nor needed. Just specify the characters that you don't want to match:

Pattern pText = Pattern.compile(">([^<>]*?)<");



回答3:


Don't use regular expressions when parsing HTML.

Use XPath instead (if your HTML is well formed). You can reference text nodes using the text() function very easily.



来源:https://stackoverflow.com/questions/2088412/get-text-between-html-tags

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!