xpath accessing information in nodes

问题

i need to scrap information form a website contain the property details.

<div class="inner">
<div class="col">
  <h2><a href="house-in-digana-for-sale-kandy-32">House in Digana </a></h2>
  <div class="meta">
      <div class="date"></div>
      <span class="category">Houses</span>,
    <span class="location">Kandy</span>
  </div>
</div>
  <div class="attr polar">
    <span class="data">Rs. 3,600,000</span>
  </div>

what is the xpath notation for "Kandy" and "Rs. 3,600,000" ?

回答1:

It is not wise to address text nodes directly using text() because of nuances in an XML document.

Rather, addressing an element node directly returns the concatenation of all descendant text nodes as the element value, which is what people usually want (and think they are getting when they address text nodes).

The canonical example I use in the classroom is this example of OCR'ed content as XML:

<cost>39<!--that 9 may be an 8-->.22</cost>

The value of the element using the XPath address cost is "39.22", but in XSLT 1.0 the value of the XPath address cost/text() is "39" which is not complete. In XSLT 2.0 (which is how the question is tagged), you get two text nodes "39" and ".22", which if you concatenate them it looks correct. But, if you pass them to a function requiring a singleton argument, you will get a run-time error. When you address an element, the text returned is concatenated into a single string, which is suitable for a singleton argument.

I tell students that in all of my professional work there are only very (very!) few times that I ever have to use text() in my stylesheets.

So //span[@class='location' or @class='data'] would find the two fields if those were the only such elements in the entire document. You may need to use ".//span" from a location inside of the document tree.

来源：https://stackoverflow.com/questions/17244493/xpath-accessing-information-in-nodes

标签

xpath

xpath-2.0