Remove html tags AND get start/end indices of marked-down text?

痞子三分冷 提交于 2019-12-24 22:04:11

问题


I have a bunch of text that in markdown format:

a**b**c

is abc.

I've got it converted to html tags to be more regular:

a<strong>b</strong>c

I know there's a lot of tools out there to convert to plain text, but I want to both do that, AND get the indices of the inner text for each markdown/tag.

For example, the input

a<strong>b</strong>c 

would return both the stripped text:

abc

and give me the start (position of first char(b)) and end (position of first char AFTER the tagged string(c)), so for this example (start,end) = (1,2). This also has to work on nested tags. I know there's a lot of libraries out there (I'm using Python 3) to remove the tags, but I haven't found one that will do both tasks. Can anyone help me by either pointing out something that does this, or describing an algorithm that might work?

Examples of nested markup:

Some tags can be nested inside their own tag type infinitely

<sup><sup>There</sup></sup> <sup><sup>was</sup></sup> <sup><sup>another</sup></sup> <sup><sup>thread</sup></sup> <sup><sup>like</sup></sup> <sup><sup>this</sup></sup>

Also lists

<ul>
<li>https://steamcommunity.com/tradeoffer/new/partner=30515749&token=WOIxg5eB</li>
<li>79</li>
<li>Why did the elephants get kicked out of the public pool?  THEY KEPT DROPPING THEIR TRUNKS! </li>
</ul>

Also strikethrough font can be nested inside italic, etc.

<em><strike>a</strike></em>

回答1:


Looks like what you want is an HTML Parser. HTML Parser's are complicated things. Therefore, you want to use an existing library (creating your own is hard and likely to fail on many edge cases). Unfortunately, as highlighted in this question, most of the existing HTML parsing libraries do not retain position information. The good news is that the one HTML Parser which reliably retains position information is in the Python standard library (see HTMLParser). And as you are using Python 3, the problems with that parser have been fixed.

A basic example might look like this:

from html.parser import HTMLParser


class StripTextParser(HTMLParser):
    def __init__(self, *args, **kwargs):
        self.data = []
        super(StripTextParser, self).__init__(*args, **kwargs)

    def handle_data(self, data):
        if data.strip():
            # Only use wtrings which are contain more than whitespace
            startpos = self.getpos()
            # `self.getpos()` returns `(line, column)` of start position.
            # Use that plus length of data to calculate end position.
            endpos = (startpos[0], startpos[1] + len(data))
            self.data.append((data, startpos, endpos))


def strip_text(html):
    parser = StripTextParser()
    parser.feed(html)
    return parser.data

test1 = "<sup><sup>There</sup></sup> <sup><sup>was</sup></sup> <sup><sup>another</sup></sup> <sup><sup>thread</sup></sup> <sup><sup>like</sup></sup> <sup><sup>this</sup></sup>" 

print(strip_text(test1))

# Ouputs: [('There', (1, 10), (1, 15)), ('was', (1, 38), (1, 41)), ('another', (1, 64), (1, 71)), ('thread', (1, 94), (1, 100)), ('like', (1, 123), (1, 127)), ('this', (1, 150), (1, 154))]


test2 = """
<ul>
<li>https://steamcommunity.com/tradeoffer/new/partner=30515749&token=WOIxg5eB</li>
<li>79</li>
<li>Why did the elephants get kicked out of the public pool?  THEY KEPT DROPPING THEIR TRUNKS! </li>
</ul>
"""

print(strip_text(test2))

# Outputs: [('https://steamcommunity.com/tradeoffer/new/partner=30515749&token=WOIxg5eB', (3, 4), (3, 77)), ('79', (4, 4), (4, 6)), ('Why did the elephants get kicked out of the public pool?  THEY KEPT DROPPING THEIR TRUNKS! ', (5, 4), (5, 95))]

test3 = "<em><strike>a</strike></em>"

print(strip_text(test3))

# Outputs: [('a', (1, 12), (1, 13))]

Without some more specific information about the format desired for the output, I just created a list of tuples. Of course, you can refactor to fit your specific needs. And if you want all of the whitespace, then remove the if data.strip(): line.




回答2:


This is the code that could be a good start for you. Hope it helps.

import sys
from html.parser import HTMLParser

line=sys.argv[1]

class MyHTMLParser(HTMLParser):
    stripped_text = ""
    isTag = False
    isData = False
    beginDataIndices = []
    endDataIndices = []
    global_index = 0
    def handle_starttag(self, tag, attrs):
       #print("Encountered a start tag:", tag)
       self.isTag = True
    def handle_endtag(self, tag):
       #print("Encountered an end tag :", tag)
       self.isTag = False
    def handle_data(self, data):
       #print("Encountered some data  :", data)
       self.stripped_text += data
       if(self.isTag):
          self.beginDataIndices.append(self.global_index)
          self.global_index += 1
          self.isData = True
       else:
          if(self.isData):
             self.endDataIndices.append(self.global_index)
          self.isData = False
          self.global_index += 1
    def printIndices(self):
          for i in range(len(self.endDataIndices)):
             print("(%d, %d)" % (self.beginDataIndices[i], self.endDataIndices[i]))

parser = MyHTMLParser()
parser.feed(line)
print(parser.stripped_text)
parser.printIndices()


来源:https://stackoverflow.com/questions/31953451/remove-html-tags-and-get-start-end-indices-of-marked-down-text

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!