Matching patterns in Python

后端 未结 3 1040
失恋的感觉
失恋的感觉 2021-01-22 08:15

I have an XML file which contains the following strings:

abcdef
 pqrst


        
相关标签:
3条回答
  • 2021-01-22 08:23

    You could use lxml.etree.XMLParser with recover=True option:

    import sys
    from lxml import etree
    
    invalid_xml = """
    <field name="id">abcdef</field>
    <field name="intro" > pqrst</field>
    <field name="desc"> this is a test file. We will show 5>2 and 3<5 and
    try to remove non xml compatible characters.</field>
    """
    root = etree.fromstring("<root>%s</root>" % invalid_xml,
                            parser=etree.XMLParser(recover=True))
    root.getroottree().write(sys.stdout)
    

    Output

    <root>
    <field name="id">abcdef</field>
    <field name="intro"> pqrst</field>
    <field name="desc"> this is a test file. We will show 5&gt;2 and 35 and
    try to remove non xml compatible characters.</field>
    </root>
    

    Note: > is left in the document as &gt; and < is completely removed (as invalid character in xml text).

    Regex-based solution

    For simple xml-like content you could use re.split() to separate tags from the text and make the substitutions in non-tag text regions:

    import re
    from itertools import izip_longest
    from xml.sax.saxutils import escape  # '<' -> '&lt;'
    
    # assumptions:
    #   doc = *( start_tag / end_tag / text )
    #   start_tag = '<' name *attr [ '/' ] '>'
    #   end_tag = '<' '/' name '>'
    ws = r'[ \t\r\n]*'  # allow ws between any token
    name = '[a-zA-Z]+'  # note: expand if necessary but the stricter the better
    attr = '{name} {ws} = {ws} "[^"]*"'  # note: fragile against missing '"'; no "'"
    start_tag = '< {ws} {name} {ws} (?:{attr} {ws})* /? {ws} >'
    end_tag = '{ws}'.join(['<', '/', '{name}', '>'])
    tag = '{start_tag} | {end_tag}'
    
    assert '{{' not in tag
    while '{' in tag: # unwrap definitions
        tag = tag.format(**vars())
    
    tag_regex = re.compile('(%s)' % tag, flags=re.VERBOSE)
    
    # escape &, <, > in the text
    iters = [iter(tag_regex.split(invalid_xml))] * 2
    pairs = izip_longest(*iters, fillvalue='')  # iterate 2 items at a time
    print(''.join(escape(text) + tag for text, tag in pairs))
    

    To avoid false positives for tags you could remove some of '{ws}' above.

    Output

    <field name="id">abcdef</field>
    <field name="intro" > pqrst</field>
    <field name="desc"> this is a test file. We will show 5&gt;2 and 3&lt;5 and
    try to remove non xml compatible characters.</field>
    

    Note: both <> are escaped in the text.

    You could call any function instead of escape(text) above e.g.,

    def escape4human(text):
        return text.replace('<', 'less than').replace('>', 'greater than')
    
    0 讨论(0)
  • 2021-01-22 08:27

    Seems I did it for >:

    re.sub('(?<! " )(?<! ")(?! )>','greater than', xml_string)
    

    ?<! - negative lookbehind assertion,

    ?! - negative lookahead assertion,

    (...)(...) is logical AND,

    so whole expression means "substitute all occurences of '>' which (does not start with ' " ') and (does not start with ' "') and ( does not end with ' ')

    case < is similar

    0 讨论(0)
  • 2021-01-22 08:36

    Use ElementTree for XML parsing.

    0 讨论(0)
提交回复
热议问题