I am working on a web-scraping project. One of the websites I am working with has the data coming from Javascript.
There was a suggestion on one of my earlier questi
You can eventually get the JavaScript from the page and execute it through some interpreter (such as v8 or Rhino). However, you can get a good result in a way easier way by using some functional testing tools, such as Selenium or Splinter. These solutions launch a browser and effectively load the page - it can be slow but assures that the expected browser displayed content will be available.
For example, consider the HTML document below:
<html>
<head>
<title>Test</title>
<script type="text/javascript">
function addContent(divId) {
var div = document.getElementById(divId);
div.innerHTML = '<em>My content!</em>';
}
</script>
</head>
<body>
<p>The element below will receive content</p>
<div id="mydiv" />
<script type="text/javascript">addContent('mydiv')</script>
</body>
</html>
The script below will use Splinter. Splinter will launch Firefox and after the complete load of the page it will get the content added to a div by JavaScript:
from splinter.browser import Browser
import os.path
browser = Browser()
browser.visit('file://' + os.path.realpath('test.html'))
elements = browser.find_by_css("#mydiv")
div = elements[0]
print div.value
browser.quit()
The result will be the content printed in the stdout.
You might call node through Popen.
My example how to do it
print execute('''function (args) {
var result = 0;
args.map(function (i) {
result += i;
});
return result;
}''', args=[[1, 2, 3, 4, 5]])
To interact with JavaScript from Python I use webkit, which is the browser renderer behind Chrome and Safari. There are Python bindings to webkit through Qt. In particular there is a function for executing JavaScript called evaluateJavaScript().
Here is a full example to execute JavaScript and extract the final HTML.
An interesting alternative I discovered recently is the Python bond module, which can be used to communicate with a NodeJs process (v8 engine).
Usage would be very similar to the pyv8 bindings, but you can directly use any NodeJs library without modification, which is a major selling point for me.
Your python code would look like this:
val = js.call('add2', var1, var2)
or even:
add2 = js.callable('add2')
val = add2(var1, var2)
Calling functions though is definitely slower than pyv8, so it greatly depends on your needs. If you need to use an npm
package that does a lot of heavy-lifting, bond
is great. You can even have more nodejs processes running in parallel.
But if you just need to call a bunch of JS functions (for instance, to have the same validation functions between the browser/backend), pyv8
will definitely be a lot faster.
Find a JavaScript interpreter that has Python bindings. (Try Rhino? V8? SeaMonkey?). When you have found one, it should come with examples of how to use it from python.
Python itself, however, does not include a JavaScript interpreter.
Did a whole run-down of the different methods recently.
PyQt4 node.js/zombie.js phantomjs
Phantomjs was the winner hands down, very straightforward with lots of examples.