regex for catching abbreviations

问题

I am trying to make a regex that matches abbreviations and their full forms in a string. I have a regex that catches some cases but on the example below, it catches more words than it should. Could anyone please help me fix this?

x = 'Confirmatory factor analysis (CFA)  is a special case of what is known as structural equation modelling (SEM).'

re.findall(r'\b([A-Za-z][a-z]+(?:\s[A-Za-z][a-z]+)+)\s+\(([A-Z][A-Z]*[A-Z]\b\.?)',x)

out:

[('Confirmatory factor analysis', 'CFA'),
 ('special case of what is known as structural equation modeling', 'SEM')]

回答1:

There is only one way of knowing how many words prior to (CFA) constitute the so-called full form: Look at the number of alphas in group 2 (assign to l), split group 1 on whitespace, take the last l words based on the length of group 2 and then rejoin.
Your regex would accept (CFA.) but not (C.F.A.) so a slight modification to your regex is in order to allow an optional period after each alpha and it appears you are attempting to say that the abbreviation must consist of two or more alpha characters -- there is an easier way to express that.

Change to Group 2 in the regex:

(                    # start of group 2
  (?:                # start of non-capturing group
     [A-Z]           # an alpha character
     \.?             # optionally followed by a period
  )                  # end of non-capturing group
  {2,}               # the non-capturing group is repeated 2 or more times
)                    # end of group 2

The code:

#!/usr/bin/env python3

import re

x = 'Confirmatory factor analysis (CFA)  is a special case of what is known as structural equation modelling (S.E.M.).'
results = []
split_regex = re.compile(r'\s+')
for m in re.finditer(r'\b([A-Za-z][a-z]*(?:\s[A-Za-z][a-z]*)+)\s+\(((?:[A-Z]\.?){2,})\)', x):
    abbreviation = m[2]
    l = sum(c.isalpha() for c in abbreviation)
    full_form = ' '.join(split_regex.split(m[1])[-l:])
    results.append([full_form, abbreviation])
print(results)

Prints

[['Confirmatory factor analysis', 'CFA'], ['structural equation modelling', 'S.E.M.']]

Python Demo

回答2:

try this-- it works by looking for an uppercase string enclosed by parenthesis. then we validate the preceding words match the abbrv.


import re

string = 'Confirmatory factor analysis (CFA)  is a special case of what is known as structural equation modelling (SEM).'

abbrvs  = re.findall("\(([A-Z][A-Z]+)\)", string) #find potential abbrvs

words = re.split("\s|\.|,", string) 

validated_abbrvs = []
for abbrv in abbrvs:
    end = words.index(f"({abbrv})")
    start = end - len(abbrv) 
    full_name = words[start:end] #locate preceeding words
    if "".join([w[0].upper() for w in full_name]) == abbrv: #validate it matches abbrv
        validated_abbrvs.append((abbrv, " ".join(full_name)))

print(validated_abbrvs)

回答3:

I used regular expression and split the string by ( or ). Then create a list of tuples in sequential index.

import re
x = 'Confirmatory factor analysis (CFA)  is a special case of what is known as structural equation modelling (SEM).'
lst = re.split('\(|\)', x)
lst = [(lst[i*2].strip(), lst[i*2+1].strip()) for i in range(0, len(lst)//2)]
final = []
for i in range(len(lst)):
    abbr = lst[i][1]
    text = ' '.join(lst[i][0].split(' ')[-len(abbr):])
    final.append((abbr, text)) 
final

Result:

 [('CFA', 'Confirmatory factor analysis'),
 ('SEM', 'structural equation modelling')]

来源：https://stackoverflow.com/questions/60658473/regex-for-catching-abbreviations

标签

python

regex

python-3.x

string