I wrote following python code:
from bs4 import BeautifulSoup
import urllib2
url= \'http://www.example.com\'
page = urllib2.urlopen(url)
soup = BeautifulSoup
The problem is that BS4 is not a complete web browser. It is only an HTML parser. It doesn't parse CSS, nor Javascript.
A complete web browser does at least four things:
Still not sure? Now look at your code. BS4 does not even include the first step, fetching the web page, to do that you had to use urllib2
.
Dynamic sites usually include Javascript to run on the browser and periodically update contents. BS4 doesn't provide that, and so you won't see them, and furthermore never will by using only BS4. Why? Because item (3) above, downloading and executing the Javascript program is not happening. It would be happing in IE, Firefox, or Chrome, and that's why those work to show dynamic content while the BS4-only scraping does not show it.
PhantomJS and CasperJS provide a more mechanized browser that often can run the JavaScript codes enabling dynamic websites. But CasperJS and PhantomJS are programmed in server-side Javascript, not Python.
Apparently, some people are using a browser built into PyQt4 for these kinds of dynamic screenscaping tasks, isolating part of the DOM, and sending that to BS4 for parsing. That might allow for a Python solution.
In comments, @Cyphase suggests that the exact data you want might be available at a different URL, in which case it might be fetched and parsed with urllib2/BS4. This can be determined by careful examination of the Javascript that is running at a site, particularly you could look for setTimeout
and setInterval
which schedules updates, or ajax
, or jQuery's .load
function for fetching data from the back end. Javascripts for updates of dynamic content will usually only fetch data from back-end URLs of the same web site. If they use jQuery $('#frequenz')
refers to the div, and by searching for this in the JS you may find the code that updates the div. Without jQuery the JS update would probably use document.getElementById('frequenz')
.
You're missing a tiny bit of code:
from bs4 import BeautifulSoup
import urllib2
url= 'http://www.example.com'
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read(), 'html.parser')
freq = soup.find('div', attrs={'id':'frequenz'})
print freq.string # Added .string
This should do it:
freq.text.strip()
As in
>>> html = '<div id="frequenz" style="font-size:500%; font-weight: bold; width: 100%; height: 10%; margin-top: 5px; text-align: center">tempsensor</div>'
>>> soup = BeautifulSoup(html)
>>> soup.text.strip()
u'tempsensor'