Splitting strings in python

问题

I have a string which is like this:

this is [bracket test] "and quotes test "

I'm trying to write something in Python to split it up by space while ignoring spaces within square braces and quotes. The result I'm looking for is:

['this','is','bracket test','and quotes test ']

回答1:

Here's a simplistic solution that works with your test input:

import re
re.findall('\[[^\]]*\]|\"[^\"]*\"|\S+',s)

This will return any code that matches either

a open bracket followed by zero or more non-close-bracket characters followed by a close bracket,
a double-quote followed by zero or more non-quote characters followed by a quote,
any group of non-whitespace characters

This works with your example, but might fail for many real-world strings you may encounter. For example, you didn't say what you expect with unbalanced brackets or quotes,or how you want single quotes or escape characters to work. For simple cases, though, the above might be good enough.

回答2:

To complete Bryan post and match exactly the answer :

>>> import re
>>> txt = 'this is [bracket test] "and quotes test "'
>>> [x[1:-1] if x[0] in '["' else x for x in re.findall('\[[^\]]*\]|\"[^\"]*\"|\S+', txt)]
['this', 'is', 'bracket test', 'and quotes test ']

Don't misunderstand the whole syntax used : This is not several statments on a single line but a single functional statment (more bugproof).

回答3:

Here's a simplistic parser (tested against your example input) that introduces the State design pattern.

In real world, you probably want to build a real parser using something like PLY.

class SimpleParser(object):

    def __init__(self):
        self.mode = None
        self.result = None

    def parse(self, text):
        self.initial_mode()
        self.result = []
        for word in text.split(' '):
            self.mode.handle_word(word)
        return self.result

    def initial_mode(self):
        self.mode = InitialMode(self)

    def bracket_mode(self):
        self.mode = BracketMode(self)

    def quote_mode(self):
        self.mode = QuoteMode(self)


class InitialMode(object):

    def __init__(self, parser):
        self.parser = parser

    def handle_word(self, word):
        if word.startswith('['):
            self.parser.bracket_mode()
            self.parser.mode.handle_word(word[1:])
        elif word.startswith('"'):
            self.parser.quote_mode()
            self.parser.mode.handle_word(word[1:])
        else:
            self.parser.result.append(word)


class BlockMode(object):

    end_marker = None

    def __init__(self, parser):
        self.parser = parser
        self.result = []

    def handle_word(self, word):
        if word.endswith(self.end_marker):
            self.result.append(word[:-1])
            self.parser.result.append(' '.join(self.result))
            self.parser.initial_mode()
        else:
            self.result.append(word)

class BracketMode(BlockMode):
    end_marker = ']'

class QuoteMode(BlockMode):
    end_marker = '"'

回答4:

Here's a more procedural approach:

#!/usr/bin/env python

a = 'this is [bracket test] "and quotes test "'

words = a.split()
wordlist = []

while True:
    try:
        word = words.pop(0)
    except IndexError:
        break
    if word[0] in '"[':
        buildlist = [word[1:]]
        while True:
            try:
                word = words.pop(0)
            except IndexError:
                break
            if word[-1] in '"]':
                buildlist.append(word[:-1])
                break
            buildlist.append(word)
        wordlist.append(' '.join(buildlist))
    else:
        wordlist.append(word)

print wordlist

回答5:

Well, I've encountered this problem quite a few times, which led me to write my own system for parsing any kind of syntax.

The result of this can be found here; note that this may be overkill, and it will provide you with something that lets you parse statements with both brackets and parentheses, single and double quotes, as nested as you want. For example, you could parse something like this (example written in Common Lisp):

(defun hello_world (&optional (text "Hello, World!"))
    (format t text))

You can use nesting, brackets (square) and parentheses (round), single- and double-quoted strings, and it's very extensible.

The idea is basically a configurable implementation of a Finite State Machine which builds up an abstract syntax tree character-by-character. I recommend you look at the source code (see link above), so that you can get an idea of how to do it. It's capable via regular expressions, but try writing a system using REs and then trying to extend it (or even understand it) later.

回答6:

Works for quotes only.

rrr = []
qqq = s.split('\"')
[ rrr.extend( qqq[x].split(), [ qqq[x] ] )[ x%2]) for x in range( len( qqq ) )]
print rrr

来源：https://stackoverflow.com/questions/234512/splitting-strings-in-python

标签

python

string

split

parsing

tokenize