Discover identically adjacent strings with regex and python

问题

Consider this text:

...
bedeubedeu France The Provençal name for tripe
bee balmbee balm Bergamot
beechmastbeechmast Beech nut
beech nutbeech nut A small nut from the beech tree,

genus Fagus and Nothofagus, similar in
flavour to a hazelnut but not commonly used.
A flavoursome oil can be extracted from
them. Also called beechmast

beechwheatbeechwheat Buckwheat
beefbeef The meat of the animal known as a cow

(female) or bull (male) (NOTE: The Anglo-
saxon name ‘Ox’ is still used for some of what
were once the less desirable parts e.g. oxtail,
ox liver)

beef bourguignonnebeef bourguignonne See boeuf à la
bourguignonne
...

I would like to parse with python this text and keep only the strings that appear exactly twice and are adjacent. For example an acceptable result should be

bedeu
bee balm
beechmast
beech nut
beechwheat
beef
beef bourguignonne

because the trend is that each string appears adjacent to an identical one, just like this:

bedeubedeu
bee balmbee balm
beechmastbeechmast
beech nutbeech nut
beechwheatbeechwheat
beefbeef
beef bourguignonnebeef bourguignonne

So how can someone search for adjacent and identical strings with a regular expression? I am testing my trials here. Thanks!

回答1:

You can use the following regex:

(\b.+)\1

See demo

Or, to just match and capture the unique substring part:

(\b.+)(?=\1)

Another demo

The word boundary \b makes sure we only match at the beginning of a word, and then match 1 or more characters other than a newline (in a singleline mode, . will also match a newline), and then with the help of a backreference we match exactly the same sequence of characters that was captured with (\b.+).

When using the version with a (?=\1) look-ahead, the matched text does not contain the duplicate part because look-aheads do not consume text and the match does not contain those chunks.

UPDATE

See Python demo:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import re
p = re.compile(ur'(\b.+)\1')
test_str = u"zymezyme Yeast, the origin of the word enzyme, as the first enzymes were extracted from yeast Page 632 Thursday, August 19, 2004 7:50 PM\nabbrühenabbrühen"
for i in p.finditer(test_str):
    print i.group(1).encode('utf-8')

Output:

zyme
abbrühen

来源：https://stackoverflow.com/questions/32207449/discover-identically-adjacent-strings-with-regex-and-python

标签

python

regex

regex-negation

regex-lookarounds