Accommodate two types of quotes in a regex

问题

I am using a regex to replace quotes within in an input string. My data contains two 'types' of quotes -

" and “

There's a very subtle difference between the two. Currently, I am explicitly mentioning both these types in my regex

\"*\“*

I am afraid though that in future data I may get a different 'type' of quote on which my regex may fail. How many different types of quotes exist? Is there way to normalize these to just one type so that my regex won't break for unseen data?

Edit -

My input data consists of HTML files and I am escaping HTML entities and URLs to ASCII

escaped_line = HTMLParser.HTMLParser().unescape(urllib.unquote(line.decode('ascii','ignore')))

where line specifies each line in the HTML file. I need to 'ignore' the ASCII as all files in my database don't have the same encoding and I don't know the encoding prior to reading the file.

Edit2

I am unable to do so using replace function. I tried replace('"','') but it doesn't replace the other type of quote '“'. If I add it in another replace function it throws me NON-ASCII character error.

Condition

No external libraries allowed, only native python libraries could be used.

回答1:

I don't think there is a "quotation marks" character class in Python's regex implementation so you'll have to do the matching yourself.

You could keep a list of common quotation mark unicode characters (here's a list for a good start) and build the part of regex that matches quotation marks programmatically.

回答2:

I can only help you with the original question about quotations marks. As it turns out, Unicode defines many properties per character and these are all available though the Unicode Character Database. "Quotation mark" is one of these properties.

How many different types of quotes exist?

29, according to Unicode, see below.

The Unicode standard brings us a definitive text file on Unicode properties, PropList.txt, among which a list of quotation marks. Since Python does not support all Unicode properties in regular expressions, you cannot currently use \p{QuotationMark}. However, it's trivial to create a regular expression character class:

// placed on multiple lines for readability, remove spaces
// and then place in your regex in place of the current quotes
[\u0022   \u0027    \u00AB    \u00BB
\u2018    \u2019    \u201A    \u201B
\u201C    \u201D    \u201E    \u201F
\u2039    \u203A    \u300C    \u300D
\u300E    \u300F    \u301D    \u301E
\u301F    \uFE41    \uFE42    \uFE43
\uFE44    \uFF02    \uFF07    \uFF62
\uFF63]

As "tchrist" pointed out above, you can save yourself the trouble by using Matthew Barnett's regex library which supports \p{QuotationMark}.

回答3:

Turns out there's a much easier way to do this. Just append the literal 'u' in front of your regex you write in python.

regexp = ru'\"*\“*'

Make sure you use the re.UNICODE flag when you want to compile/search/match your regex to your string.

re.findall(regexp, string, re.UNICODE)

Don't forget to include the

#!/usr/bin/python
# -*- coding:utf-8 -*-

at the start of the source file to make sure unicode strings can be written in your source file.

来源：https://stackoverflow.com/questions/9860400/accommodate-two-types-of-quotes-in-a-regex

标签

python

regex

quotes

double-quotes