I am trying to write a small application to extract content from Wikipedia pages. When I first thought if it, I thought that I could just target divs containing content with XPa
Yes, you're on the right track with XPath -- it's ideal for selecting parts of an XML document.
For example, for this XML,
Title A
Some Content
More Content
Title B
this XPath,
//div[preceding-sibling::h2 = 'Title A' and following-sibling::h2 = 'Title B']
will select this content,
Some Content
More Content
between the two h2
titles, as requested.
Update to address OP's self-answer:
For this new XML example,
Summary
Paragraph
- List1
- List2
- List3
Paragraph
Location
Paragraph
the XPath I provided above can easily be adapted,
//*[preceding-sibling::h2 = 'Summary' and following-sibling::h2 = 'Location']
to select this XML,
Paragraph
- List1
- List2
- List3
Paragraph
as requested.