I have a file that contains \"straight\" (normal, ASCII) quotes, and I\'m trying to convert them to real quotation mark glyphs (“curly” quotes, U+2018 to U+201D). Since the tran
guess which curly quote character to use, if possible
It is not, in the general case.
The simple algorithm that most automatic converters use is just to look at the previous letter you typed before the ' or ". If it's a space, start of line, opening bracket or other opening quote, choose opening quote, else closing. The advantage of this method is that it can run as-you-type, so when it chooses the wrong one you can generally correct it.
we want to leave apostrophes alone
I agree! But not many people do. It's normal typesetting practice to turn an apostrophe into a left-facing single quote. Personally I prefer to leave them as they are, to distinguish them from enclosing quotes, making the text easier (I find) to read, and possible to process automatically.
However this really is just my taste and is not generally considered justified merely because the character is defined by the Unicode standard as being APOSTROPHE.
'tis possible apostrophes are at the beginning of words
Indeed. There is no way to tell an apostrophe from a potential open quote in cases like the classic Fish 'n' Chips, short of enormous amounts of cultural context.
(Not to mention primes, okinas, glottal stops and various other uses of the apostrophe...)
The best thing to do, of course, is install a keyboard layout that can type smart quotes directly. I have ‘’ on AltGr+[], “” on AltGr+Shift+[], –— on AltGr+[Shift]+dash, and so on.
["I like 'That '70s show'", she said]
I originally thought maybe using multiple passes over the text to gain context insight might help but that would not solve all instances.
The best thing you could do is run up a list of possible word sets/expressions like 'twas, 'tis, '70's etc. and throw them in the dictionary with auto-correction on them to convert the straights to curls and vice versa. Spell checks run on every word anyway don't they? (sorry that doesn't help your emacs problem)
OO ignores the single quote curving all together from what I can tell.
Wikipedia has a bit of info on these pesky things.
It looks like your initial post covers most of the ideas I was going to write here, this is what I've got left...
For the apostrophe example ("I like 'That '70s show'", she said), it's unlikely that quotes will be nested directly inside quotes of the same type. You could take advantage of that.
Best way to do this in my opinion is to make the code only handle unambiguous cases (double quotes are pretty simple). For the ones with multiple possible choices, store their position in a list and examine it when it's finished. You might find a few more easily-coded cases in there, or you might just decide to fix them manually.
The basic thing is to always try to find matching pairs. Given that every quote has a matching quote you could make your program ask for your help only where it's unsure which is the matching quote.
Opening quotes are always at the opening of a line or have a space in front of them. Closing quotes always a space after them. If you find a colon with a following quote it's probably a closing quote.
If the letter following the quote is upper case it's probably an opening quote.
If there's a punctuation mark in front of the quote it's probably a closing quote.
Try to do it iteratively. The program should ask you first for all the quotes that it can definitely assign to a function. (Just to make sure it hasn't made any errors.)
In the second round something like all the quotes that it's unsure whether they are opening quotes or apostrophes. For all opening quotes it has to find automatically the closing quote.
Another, maybe less complex, idea could be:
Find all non-quotes by asking the user about each one that could potentially be a quote or a non-quote.
All the remaining quotes should be fairly easy to convert. Opening quotes have a spaces or newline in front of them and closing after them.
One last piece of thought:
You should break the process apart like processing only paragraph-wise. If your program makes an error, which it probably will given the complexity of language, it's easier for you to correct it and the program can start fresh with the new paragraph.
Here is a regular expression that might help for double-quotes:
/([^\s\(]?)"(\s*)([^\\]*?(\\.[^\\]*)*)(\s*)("|\n\n)([^\s\)\.\,;]?)/gms
It will restart at each paragraph, and it will identify pairs of quotes (and will also allow you to check that the spacing is correct before and after the quotes, if that's useful).
Numbered element identification
1 non-white-space before quote quote
2 white-space after leading quote
5 white-space before trailing quote
6 trailing quote (or double-newline, i.e. start of a paragraph
7 character after trailing quote if not whitespace or right paren
I think it would be reasonable to extend this for your other cases (I just haven't had the need to yet.)
It's javascript syntax. It's pretty fast, but I haven't done more optimizing than my "good enough". It will do a, say, 400 page book in about a second. I think it would be hard to match its speed procedurally.
Computational linguistics anyone?
Somebody mentioned if you had a vast amount of cultural context, it might be feasible. So the overkill but most accurate automated solution to the problem is shallow parsing. This requires a corpus of whatever language and mode you're dealing with (e.g. the Brown corpus for general English).
Develop a classifier for curly quotes based on the syntactic context of the curly quotes occurring in the corpus. Finally, give your arbitrary syntactic context with a straight quote to your classifier and out pops the most probable quote character!