I am trying to write a small application to extract content from Wikipedia pages. When I first thought if it, I thought that I could just target divs containing content with XPa
Yes, you're on the right track with XPath -- it's ideal for selecting parts of an XML document.
For example, for this XML,
<r>
<h2>Title A</h2>
<div>Some Content</div>
<div>More Content</div>
<h2>Title B</h2>
</r>
this XPath,
//div[preceding-sibling::h2 = 'Title A' and following-sibling::h2 = 'Title B']
will select this content,
<div>Some Content</div>
<div>More Content</div>
between the two h2
titles, as requested.
Update to address OP's self-answer:
For this new XML example,
<div>
<h2><span>Summary</span></h2>
<p>Paragraph</p>
<ul>
<li>List1</li>
<li>List2</li>
<li>List3</li>
</ul>
<p>Paragraph</p>
<h2><span>Location</span></h2>
<p>Paragraph</p>
</div>
the XPath I provided above can easily be adapted,
//*[preceding-sibling::h2 = 'Summary' and following-sibling::h2 = 'Location']
to select this XML,
<p>Paragraph</p>
<ul>
<li>List1</li>
<li>List2</li>
<li>List3</li>
</ul>
<p>Paragraph</p>
as requested.
With the help from kjhughes suggestion, I managed to get the code working.
I was unable to make the = 'Text'
part work, but replaced it with [text() = 'text']
That alone wasn't enough, as the title of the content I need is location inside a span
in a h2
tag, so I had to adapt the XPath a bit more.
This is what I came up with:
//*[preceding-sibling::h2::following-sibling::span[text() = 'Summary'] and following-sibling::h2::following-sibling::span[text() = 'Location']]
I tested it using http://www.xpathtester.com/xpath on this HTML:
<div>
<h2><span>Summary</span></h2>
<p>Paragraph</p>
<ul>
<li>List1</li>
<li>List2</li>
<li>List3</li>
</ul>
<p>Paragraph</p>
<h2><span>Location</span></h2>
<p>Paragraph</p>
</div>
Which gave me the following result:
<p>Paragraph</p>
<ul>
<li>List1</li>
<li>List2</li>
<li>List3</li>
</ul>
<p>Paragraph</p>