XPath to get markup between two headings

前端 未结 2 1583
野性不改
野性不改 2021-01-24 01:17

I am trying to write a small application to extract content from Wikipedia pages. When I first thought if it, I thought that I could just target divs containing content with XPa

相关标签:
2条回答
  • 2021-01-24 01:46

    Yes, you're on the right track with XPath -- it's ideal for selecting parts of an XML document.

    For example, for this XML,

    <r>
       <h2>Title A</h2>
       <div>Some Content</div>
       <div>More Content</div>
       <h2>Title B</h2>
    </r>
    

    this XPath,

    //div[preceding-sibling::h2 = 'Title A' and following-sibling::h2 = 'Title B']
    

    will select this content,

    <div>Some Content</div>
    <div>More Content</div>
    

    between the two h2 titles, as requested.


    Update to address OP's self-answer:

    For this new XML example,

    <div>
        <h2><span>Summary</span></h2>
        <p>Paragraph</p>
        <ul>
            <li>List1</li>
            <li>List2</li>
            <li>List3</li>
        </ul>
        <p>Paragraph</p>
    
        <h2><span>Location</span></h2>
        <p>Paragraph</p>
    </div>
    

    the XPath I provided above can easily be adapted,

    //*[preceding-sibling::h2 = 'Summary' and following-sibling::h2 = 'Location']
    

    to select this XML,

    <p>Paragraph</p>  
    <ul>
       <li>List1</li>
       <li>List2</li>
       <li>List3</li>
    </ul>    
    <p>Paragraph</p>
    

    as requested.

    0 讨论(0)
  • 2021-01-24 01:59

    With the help from kjhughes suggestion, I managed to get the code working.

    I was unable to make the = 'Text' part work, but replaced it with [text() = 'text']

    That alone wasn't enough, as the title of the content I need is location inside a span in a h2 tag, so I had to adapt the XPath a bit more.

    This is what I came up with:

    //*[preceding-sibling::h2::following-sibling::span[text() = 'Summary'] and following-sibling::h2::following-sibling::span[text() = 'Location']]
    

    I tested it using http://www.xpathtester.com/xpath on this HTML:

    <div>
        <h2><span>Summary</span></h2>
        <p>Paragraph</p>
        <ul>
            <li>List1</li>
            <li>List2</li>
            <li>List3</li>
        </ul>
        <p>Paragraph</p>
    
        <h2><span>Location</span></h2>
        <p>Paragraph</p>
    </div>
    

    Which gave me the following result:

    <p>Paragraph</p>
    <ul>
        <li>List1</li>
        <li>List2</li>
        <li>List3</li>
    </ul>
    <p>Paragraph</p>
    
    0 讨论(0)
提交回复
热议问题