Extracting text tags in order - How can this be done?

风格不统一 提交于 2021-02-11 17:07:43

问题


I am trying to find all the text along with the parent tag in the HTML. In the example below, the variable named html has the sample HTML where I try to extract the tags and the text. This works fine and as expected gives out the tags with the text

Here I have used cheerio to traverse DOM. cheerio works exactly same as jquery.

const cheerio = require("cheerio");

const html = `
                    <html>
                <head></head>
                <body>
                <p>
                  Regular bail is the legal procedure through which a court can direct 
                  release of persons in custody under suspicion of having committed an offence, 
                  usually on some conditions which are designed to ensure 
                  that the person does not flee or otherwise obstruct the course of justice. 
                  These conditions may require executing a “personal bond”, whereby a person
                  pledges a certain amount of money or property which may be forfeited if 
                  there is a breach of the bail conditions. Or, a court may require
                  executing a bond “with sureties”, where a person is not seen as 
                  reliable enough and may have to present 
                  <em>other persons</em> to vouch for her, 
                  and the sureties must execute bonds pledging money / property which 
                  may be forfeited if the accused person breaches a bail condition.
                </p>
                </body>
            </html>

`;

const getNodeType = function (renderedHTML, el, nodeType) {
    const $ = cheerio.load(renderedHTML)

    return $(el).find(":not(iframe)").addBack().contents().filter(function () {
        return this.nodeType == nodeType;
    });
}

let allTextPairs = [];
const $ = cheerio.load(html);
getNodeType(html, $("html"), 3).map((i, node) => {
            const parent = node.parentNode.tagName;
            const nodeValue = node.nodeValue.trim();
            allTextPairs.push([parent, nodeValue])
});

console.log(allTextPairs);

as shown below

But the problem is that the text tags extracted are out of order. If you see the above screenshot, other persons has been reported in the end, although it should occur before to vouch for her .... Why does this happen? How can I prevent this?


回答1:


You might want to just walk the tree in depth order. Walk function courtesy of this gist.

function walk(el, fn, parents = []) {
  fn(el, parents);
  (el.children || []).forEach((child) => walk(child, fn, parents.concat(el)));
}
walk(cheerio.load(html).root()[0], (node, parents) => {
  if (node.type === "text" && node.data.trim()) {
    console.log(parents[parents.length - 1].name, node.data);
  }
});

This prints out the stuff, but you could just as well put it in that array of yours.



来源:https://stackoverflow.com/questions/63270123/extracting-text-tags-in-order-how-can-this-be-done

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!