How to apply javascript to html simulating a browser

问题

I've already searched on the Internet how to "create" a simple headless browser, because I was interested to know how does a Browser works internally. I'd like to implement a simple headless-browser.

What I mean is: suppose you have an html string, and a javascript string, both as a result of a HttpRequest to the server; how can I apply the javascript into the html string?

For example: I requested to an X server the html source file, and I obtained in the response this:

<html>
    <head>
         <script type="text/javascript" src="javascript.js">
    </head>
    <body>
        <p id="content"></p>
    <body>
</html>

Then, I request the javascript.js file, and I obtain this:

document.getElementById("content").text = "Hello";

How can I apply the content of the javascript.js file into the html file? The steps I should follow is something similar to this?:

Parse html source to Javascript DOM elements
Apply javascript to the DOM

I'd like to do it with Java, Scala or Node.js. Idk if you understand the main idea... im latin american, and my english isn't so good. Sorry for that. If dont understand, please let me know in the comments and I'll edit my post.

EDIT: what I would like to do, in other words, is like a pseudo method/function like this (in pseudocode):

function applu(html, js){
    // Apply js into html
}

回答1:

If you're looking a headless browser I'm sure you're aware of phantomsJS. PhantomJS is a headless browser based off apple's webkit browser engine.

You're asking for a lot here. You need:

a javascript runtime (such as v8) to run the javascript.
a web engine to bring the html and the document object model it defines to life.

Both of those things take millions of lines of code to execute.

My recommendation is integrate your program with PhantomJS. PhantomJS is a headless webbrowser and a javascript environment. If you're using scala, start a child process of phantomjs and send messages to it via std i/o. The JS part of PhantomJS means that you use it via it's javascript API, so additionally you'd have to write a js script to handle the messages coming in from std i/o. It's undocumented but phantomjs has a system.std.in and system.std.out apis to handle the messages.

That's a lot of work and a lot of extra resources outside of the JVM to get it work. I saw that you're using scala so you could go with a simpler solution using jsoup to parse and modify the HTML document, however you would have to do the transformations using scala (or java).

Actually, now that I think about it, you should use jsdom paired with nodejs. JSDom implements the dom API without actually rendering it which might be what you need. jsdom is made for nodejs which is headless. You can also use node's std i/o and have it send messages to and from the JVM if you wanted to use both scala and node.

Here is a proof of concept to using jsdom to evaluate the javascript and modify the html. It's a really simple solution and it is the most resource efficient for the given task (and this is a hard task).

I made a gist for you with a very simple proof of concept. To run the gist do:

git clone https://gist.github.com/c8aef41ee27e5304e94f6a255b048f87.git apply-js-to-html
cd apply-js-to-html
npm install
node example.js

This is the meat of the example:

const jsdom = require('jsdom');

module.exports = function (html, js) {
    return new Promise((resolve, reject) => {
        jsdom.env(html, (error, window) => {
            if (error) {
                reject(error);
            }
            try {
            (function evalInContext () {
                'use strict';
                const document = this.document;
                const window = this.window;
                eval(js);
                resolve(window.document.documentElement.innerHTML);
            }).call(window);
            } catch (e) {
                reject(e);
            }
        });
    });
}

And here is the module in use

const applu = require('./index');

const html = `
    <html>
        <head></head>
        <body>
            <p id="content"></p>
        <body>
    </html>
`;

const js = `document.getElementById("content").innerHTML = "Hello";`

applu(html, js).then(result => {
    console.log('input html: ', html);
    console.log('output html: ', result);
}).catch(err => console.error(error));

And here is the output of the code:

input html:  
    <html>
        <head></head>
        <body>
            <p id="content"></p>
        <body>
    </html>

output html:  <head></head>
        <body>
            <p id="content">Hello</p>


</body>

jsdom creates a headless window and document environment that doesn't render anything. You can use eval and call it in context using window as the this value. I've also declared document and window again the js that will be evaled will have those variables in scope.

This is a just a basic POC, you'll have iron out the details by yourself.

来源：https://stackoverflow.com/questions/42333359/how-to-apply-javascript-to-html-simulating-a-browser

标签

javascript

html

node.js

scala

headless-browser