Get character offsets for elements in jsoup

前端 未结 1 1155
小鲜肉
小鲜肉 2021-02-09 16:54

I need to map jsoup elements back to specific character offsets in the source HTML. In other words, if I have HTML that looks like this:

Hello 
World
1条回答
  •  长发绾君心
    2021-02-09 17:39

    I don't believe Jsoup has this functionality. This question seems closer to lexical analysis than HTML parsing.

    I would write a grammar, and then write a lexer against that grammar which would tokenize the HTML, and supply the offsets that you're looking for.

    First, parse the document with Jsoup to verify that it is valid HTML.

    Then, lexically analyze the document against a grammar. A grammar might look like:

    Document := {optional-opening-tag} | {literal} {optional-opening-tag} | {optional-closing-tag}
    
    optional-opening-tag := ["<" {literal} ">" {optional-opening-tag}|{literal} ] | ""
    
    optional-closing-tag := "" | ""
    
    literal := any string of characters not beginning with whitespace, or containing "<"
    

    Insert each token that you find in an object which stores the token, the index of the first character, and the length.

    0 讨论(0)
提交回复
热议问题