Possible to get HtmlNode's position & length within original input?

只谈情不闲聊 提交于 2020-01-04 06:51:13

问题


Consider the following HTML fragment (_ is used for whitespace):

<head>
    ...
    <link ... ___/>
    <!-- ... -->
    ...
</head>

I'm using Html Agility Pack (HAP) to read HTML files/fragments and to strip out links. What I want to do is find the LINK (and some other) elements and then replace them with whitespace, like so:

<head>
    ...
    ____________
    <!-- ... -->
    ...
</head>

The parsing part seems to be working so far, I get the nodes I'm looking for. However, HAP tries to fix the HTML content while I need everything to be exactly the same, except for the changes I'm trying to make. Plus, HAP seems to have quite a few bugs when it comes to writing back content that was read in previously, so the approach I want to take is let HAP parse the input and then I go back to the original input and replace content that I don't want.

The problem is, HtmlNode doesn't seem to have an input length property. It has StreamPosition which seems to indicate where reading of the node's content started within the input but I couldn't find a length property that'd tell me how many characters were consumed to build the node.

I tried using the OuterHtml propety but, unfortunately, HAP tries to fix the LINK by removing the ___/ part (a LINK element is not supposed to be closed). Because of this, OuterHtml.Length returns the wrong length.

Is there a way in HAP to get this information?


回答1:


I ended up modifying the code of HtmlAgilityPack to expose a new property that returns the private _outerlength field of HtmlNode.

public virtual int OuterLength
{
    get
    {
        return ( _outerlength );
    }
}

This seems to be working fine so far.




回答2:


If you want to achieve the same result without recompiling HAP, then use reflection to access the private variable.

I usually wouldn't recommend reflection to access private variables, but I recently had the exact same situation as this and used reflection, because I was unable to use a recompiled version of the assembly. To do this, create a static variable that holds the field info object (to avoid recreating it on every use):

private static readonly FieldInfo HtmlNodeOuterLengthFieldInfo = typeof(HtmlNode).GetField("_outerlength", BindingFlags.NonPublic | BindingFlags.Instance);

Then whenever you want to access the true length of the original outer HTML:

var match = htmlDocument.DocumentNode.SelectSingleNode("xpath");
var htmlLength = (int)HtmlNodeOuterLengthFieldInfo.GetValue(match);


来源:https://stackoverflow.com/questions/12861994/possible-to-get-htmlnodes-position-length-within-original-input

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!