问题
I am trying to find all the text along with the parent tag in the HTML. In the example below, the variable named html
has the sample HTML where I try to extract the tags and the text.
This works fine and as expected gives out the tags
with the text
Here I have used cheerio
to traverse DOM. cheerio
works exactly same as jquery
.
const cheerio = require("cheerio");
const html = `
<html>
<head></head>
<body>
<p>
Regular bail is the legal procedure through which a court can direct
release of persons in custody under suspicion of having committed an offence,
usually on some conditions which are designed to ensure
that the person does not flee or otherwise obstruct the course of justice.
These conditions may require executing a “personal bond”, whereby a person
pledges a certain amount of money or property which may be forfeited if
there is a breach of the bail conditions. Or, a court may require
executing a bond “with sureties”, where a person is not seen as
reliable enough and may have to present
<em>other persons</em> to vouch for her,
and the sureties must execute bonds pledging money / property which
may be forfeited if the accused person breaches a bail condition.
</p>
</body>
</html>
`;
const getNodeType = function (renderedHTML, el, nodeType) {
const $ = cheerio.load(renderedHTML)
return $(el).find(":not(iframe)").addBack().contents().filter(function () {
return this.nodeType == nodeType;
});
}
let allTextPairs = [];
const $ = cheerio.load(html);
getNodeType(html, $("html"), 3).map((i, node) => {
const parent = node.parentNode.tagName;
const nodeValue = node.nodeValue.trim();
allTextPairs.push([parent, nodeValue])
});
console.log(allTextPairs);
as shown below
But the problem is that the text tags extracted are out of order. If you see the above screenshot, other persons
has been reported in the end, although it should occur before to vouch for her ...
. Why does this happen? How can I prevent this?
回答1:
You might want to just walk the tree in depth order. Walk function courtesy of this gist.
function walk(el, fn, parents = []) {
fn(el, parents);
(el.children || []).forEach((child) => walk(child, fn, parents.concat(el)));
}
walk(cheerio.load(html).root()[0], (node, parents) => {
if (node.type === "text" && node.data.trim()) {
console.log(parents[parents.length - 1].name, node.data);
}
});
This prints out the stuff, but you could just as well put it in that array of yours.
来源:https://stackoverflow.com/questions/63270123/extracting-text-tags-in-order-how-can-this-be-done