Performant parsing of HTML pages with Node.js and XPath

前端未结

关注

 6  2104

I\'m into some web scraping with Node.js. I\'d like to use XPath as I can generate it semi-automatically with several sorts of GUI. The problem is that I cannot find a way t

相关标签:

6条回答

清酒与你

2020-12-07 21:24

There might be never a right way to parse HTML pages. A very first review on web scraping and crawling shows me that Scrapy can be a good candidate for your need. It accepts both CSS and XPath selectors. In the realm of Node.js, we have a pretty new module node-osmosis. This module is built upon libxmljs so that it is supposed to support both CSS and XPath although I did not find any example using XPath.

0 讨论(0)
发布评论:

提交评论
- 加载中...

离开以前

2020-12-07 21:30

You can do so in several steps.

Parse HTML with parse5. The bad part is that the result is not DOM. Though it's fast enough and W3C-compiant.
Serialize it to XHTML with xmlserializer that accepts DOM-like structures of parse5 as input.
Parse that XHTML again with xmldom. Now you finally have that DOM.
The xpath library builds upon xmldom, allowing you to run XPath queries. Be aware that XHTML has its own namespace, and queries like //a won't work.

Finally you get something like this.

const fs = require('mz/fs');
const xpath = require('xpath');
const parse5 = require('parse5');
const xmlser = require('xmlserializer');
const dom = require('xmldom').DOMParser;

(async () => {
    const html = await fs.readFile('./test.htm');
    const document = parse5.parse(html.toString());
    const xhtml = xmlser.serializeToString(document);
    const doc = new dom().parseFromString(xhtml);
    const select = xpath.useNamespaces({"x": "http://www.w3.org/1999/xhtml"});
    const nodes = select("//x:a/@href", doc);
    console.log(nodes);
})();

0 讨论(0)

自闭症患者

2020-12-07 21:34
With just one line, you can do it with xpath-html:
```
const xpath = require("xpath-html");

const node = xpath.fromPageSource(html).findElement("//*[text()='Made with love by']");
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
深忆病人

2020-12-07 21:36

I have just started using npm install htmlstrip-native which uses a native implementation to parse and extract the relevant html parts. It is claiming to be 50 times faster than the pure js implementation (I have not verified that claim).

Depending on your needs you can use html-strip directly, or lift the code and bindings to make you own C++ used internally in htmlstrip-native

If you want to use xpath, then use the wrapper already avaialble here; https://www.npmjs.org/package/xpath

0 讨论(0)
发布评论:

提交评论
- 加载中...
南笙

2020-12-07 21:38
Libxmljs is currently the fastest implementation (something like a benchmark) since it's only bindings to the LibXML C-library which supports XPath 1.0 queries:
```
var libxmljs = require("libxmljs");
var xmlDoc = libxmljs.parseXml(xml);
// xpath queries
var gchild = xmlDoc.get('//grandchild');
```
However, you need to sanitize your HTML first and convert it to proper XML. For that you could either use the HTMLTidy command line utility (tidy -q -asxml input.html), or if you want it to keep node-only, something like xmlserializer should do the trick.
0 讨论(0)
发布评论:

提交评论
- 加载中...
轮回少年

2020-12-07 21:39
I think Osmosis is what you're looking for.
- Uses native libxml C bindings
- Supports CSS 3.0 and XPath 1.0 selector hybrids
- Sizzle selectors, Slick selectors, and more
- No large dependencies like jQuery, cheerio, or jsdom
- HTML parser features
  - Fast parsing
  - Very fast searching
  - Small memory footprint
- HTML DOM features
  - Load and search ajax content
  - DOM interaction and events
  - Execute embedded and remote scripts
  - Execute code in the DOM
Here's an example:
```
osmosis.get(url)
    .find('//div[@class]/ul[2]/li')
    .then(function () {
        count++;
    })
    .done(function () {
        assert.ok(count == 2);
        assert.done();
    });
```
0 讨论(0)
发布评论:

提交评论
- 加载中...