I am trying to retrieve the text data from an ePub file using Java. The text of the ePub file lies within a HTML file that is formatted something like this -
<h2 id="pgepubid00001">Chapter I</h2>
<p>Some text</p>
<p>Another line of Text</p>
<br/>
<h2 id="pgepubid00001">Chapter II</h2>
etc..
Before opening this file I already know the id of the Chapter I need to extract and can find the id of the next chapter too. Because of this I thought a logical approach would be to attempt to parse it in a SAX parser and extract the text in each paragraph until I reached the link of the next chapter. But this is proving quite a task.
Of course, everything is dynamic so there is no set link to go to etc. The HTML is semi-strictly formatted so I didn't expect parsing to be so much of a problem. Can anyone recommend a good way to extract the text needed?
The solution needs to be JAVA ONLY, no other languages can be used. I am looking to implement this in an Android device
Well, you know ids of the chapters, why not use String.indexOf ?
start = text.indexOf("<h2 id=\"pgepubid00001\">");
end = text.indexOf("<h2 id=\"pgepubid00002\">");
whatYoureLookingFor = text.substring(start, end-start)
Keep it simple.
来源:https://stackoverflow.com/questions/5690219/extract-text-between-two-links-in-html-through-java