Here is some HTML code from http://chem.sis.nlm.nih.gov/chemidplus/rn/75-07-0 in Google Chrome that I want to parse the website for some project.
It is important to inspect the string returned by page.text
and not
just rely on the page source as returned by your Chrome browser. Web sites can
return different content depending on the User-Agent
, and moreover, GUI browsers
such as your Chrome browser may change the content by executing JavaScript while
in contrast, requests.get
does not.
If you write the contents to a file
import requests
page = requests.get('http://chem.sis.nlm.nih.gov/chemidplus/rn/75-07-0')
with open('/tmp/test', 'wb') as f:
f.write(page.text)
and use a text editor to search for "yui_3_18_1_3_1434380225687_700"
you'll find that there is no tag with that attribute value.
If instead you search for Name of Substance
you'll find
Search for this InChIKey on the Web
Therefore, instead you could use:
In [219]: tree.xpath('//*[text()="Name of Substance"]/..//div')[0].text_content()
Out[219]: 'Acetaldehyde'
How this XPath was found:
Starting from the tag:
In [215]: tree.xpath('//*[text()="Name of Substance"]')
Out[215]: []
The and then use The first and we can extract the text using the . Therefore, go up to the parent:
In [216]: tree.xpath('//*[text()="Name of Substance"]/..')
Out[216]: [
//div
to search for all In [217]: tree.xpath('//*[text()="Name of Substance"]/..//div')
Out[217]:
[
div
is the one that we want:In [218]: tree.xpath('//*[text()="Name of Substance"]/..//div')[0]
Out[218]:
text_content
method:In [219]: tree.xpath('//*[text()="Name of Substance"]/..//div')[0].text_content()
Out[219]: 'Acetaldehyde'