Extract text between two links in HTML through Java

让人想犯罪 __ 提交于 2019-12-23 01:59:09

问题


I am trying to retrieve the text data from an ePub file using Java. The text of the ePub file lies within a HTML file that is formatted something like this -

<h2 id="pgepubid00001">Chapter I</h2>

<p>Some text</p>
<p>Another line of Text</p>

<br/>

<h2 id="pgepubid00001">Chapter II</h2>

etc..

Before opening this file I already know the id of the Chapter I need to extract and can find the id of the next chapter too. Because of this I thought a logical approach would be to attempt to parse it in a SAX parser and extract the text in each paragraph until I reached the link of the next chapter. But this is proving quite a task.

Of course, everything is dynamic so there is no set link to go to etc. The HTML is semi-strictly formatted so I didn't expect parsing to be so much of a problem. Can anyone recommend a good way to extract the text needed?

The solution needs to be JAVA ONLY, no other languages can be used. I am looking to implement this in an Android device


回答1:


Well, you know ids of the chapters, why not use String.indexOf ?

start = text.indexOf("<h2 id=\"pgepubid00001\">");
end = text.indexOf("<h2 id=\"pgepubid00002\">");

whatYoureLookingFor = text.substring(start, end-start)

Keep it simple.



来源:https://stackoverflow.com/questions/5690219/extract-text-between-two-links-in-html-through-java

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!