问题
In the course of processing a large textual chemical database with Perl, I had been faced with the problem of using a regex to match chemical formulae. I have seen these two previous topics, but the suggested answers there are too loose for my requirements.
Specifically, my (admittedly limited) research has led me to this posting that gives a regex for the currently accepted chemical symbols, which I'll copy here for reference
[BCFHIKNOPSUVWY]|[ISZ][nr]|[ACELP][ru]|A[cglmst]|B[aehikr]|C[adeflos]|D[bsy]|Es|F[elmr]|G[ade]|H[efgos]|Kr|L[aiv]|M[cdgnot]|N[abdehiop]|O[gs]|P[abdmot]|R[abe-hnu]|S[bcegim]|T[abcehilms]|Xe|Yb
(Thus e.g. C
, Cm
, and Cn
will pass, but not Cg
or Cx
.)
As with the previous questions, I also need to match numbers, complete sets of parentheses and complete sets of square brackets, so that both e.g. C2H6O
and (CH3)2CFCOO(CH2)2Si(CH3)2Cl
are matched.
So how do I combine the previous solutions with the grand regex for matching valid chemical elements to strictly match a chemical formula?
(If it's not too much trouble to add, a blow-by-blow account of how to humanly parse the regex would be appreciated greatly, though not strictly necessary.)
回答1:
Brief
I decided why not create a massive regex to do what you want (but still maintain a clean regex). This regex would be used in conjunction with a loop to go over matches for bracket or parentheses groups.
Assumptions
I am assuming the following since the OP has not given a full list of positive and negative matches:
- Nested parentheses aren't possible
- Nested square brackets aren't possible
- Square bracket groups that surround a single parentheses group are redundant and therefore incorrect
- Square bracket groups must contain at least 2 groups, of which 1 such group must be a parentheses group
If any of these assumptions are incorrect, please let me know so that I may fix the regex accordingly
Answer
View this regex in use here
Code
(?(DEFINE)
(?# Periodic elements )
(?<Hydrogen>H)
(?<Helium>He)
(?<Lithium>Li)
(?<Beryllium>Be)
(?<Boron>B)
(?<Carbon>C)
(?<Nitrogen>N)
(?<Oxygen>O)
(?<Fluorine>F)
(?<Neon>Ne)
(?<Sodium>Na)
(?<Magnesium>Mg)
(?<Aluminum>Al)
(?<Silicon>Si)
(?<Phosphorus>P)
(?<Sulfur>S)
(?<Chlorine>Cl)
(?<Argon>Ar)
(?<Potassium>K)
(?<Calcium>Ca)
(?<Scandium>Sc)
(?<Titanium>Ti)
(?<Vanadium>V)
(?<Chromium>Cr)
(?<Manganese>Mn)
(?<Iron>Fe)
(?<Cobalt>Co)
(?<Nickel>Ni)
(?<Copper>Cu)
(?<Zinc>Zn)
(?<Gallium>Ga)
(?<Germanium>Ge)
(?<Arsenic>As)
(?<Selenium>Se)
(?<Bromine>Br)
(?<Krypton>Kr)
(?<Rubidium>Rb)
(?<Strontium>Sr)
(?<Yttrium>Y)
(?<Zirconium>Zr)
(?<Niobium>Nb)
(?<Molybdenum>Mo)
(?<Technetium>Tc)
(?<Ruthenium>Ru)
(?<Rhodium>Rh)
(?<Palladium>Pd)
(?<Silver>Ag)
(?<Cadmium>Cd)
(?<Indium>In)
(?<Tin>Sn)
(?<Antimony>Sb)
(?<Tellurium>Te)
(?<Iodine>I)
(?<Xenon>Xe)
(?<Cesium>Cs)
(?<Barium>Ba)
(?<Lanthanum>La)
(?<Cerium>Ce)
(?<Praseodymium>Pr)
(?<Neodymium>Nd)
(?<Promethium>Pm)
(?<Samarium>Sm)
(?<Europium>Eu)
(?<Gadolinium>Gd)
(?<Terbium>Tb)
(?<Dysprosium>Dy)
(?<Holmium>Ho)
(?<Erbium>Er)
(?<Thulium>Tm)
(?<Ytterbium>Yb)
(?<Lutetium>Lu)
(?<Hafnium>Hf)
(?<Tantalum>Ta)
(?<Tungsten>W)
(?<Rhenium>Re)
(?<Osmium>Os)
(?<Iridium>Ir)
(?<Platinum>Pt)
(?<Gold>Au)
(?<Mercury>Hg)
(?<Thallium>Tl)
(?<Lead>Pb)
(?<Bismuth>Bi)
(?<Polonium>Po)
(?<Astatine>At)
(?<Radon>Rn)
(?<Francium>Fr)
(?<Radium>Ra)
(?<Actinium>Ac)
(?<Thorium>Th)
(?<Protactinium>Pa)
(?<Uranium>U)
(?<Neptunium>Np)
(?<Plutonium>Pu)
(?<Americium>Am)
(?<Curium>Cm)
(?<Berkelium>Bk)
(?<Californium>Cf)
(?<Einsteinium>Es)
(?<Fermium>Fm)
(?<Mendelevium>Md)
(?<Nobelium>No)
(?<Lawrencium>Lr)
(?<Rutherfordium>Rf)
(?<Dubnium>Db)
(?<Seaborgium>Sg)
(?<Bohrium>Bh)
(?<Hassium>Hs)
(?<Meitnerium>Mt)
(?<Darmstadtium>Ds)
(?<Roentgenium>Rg)
(?<Copernicium>Cn)
(?<Nihonium>Nh)
(?<Flerovium>Fl)
(?<Moscovium>Mc)
(?<Livermorium>Lv)
(?<Tennessine>Ts)
(?<Oganesson>Og)
(?# Regex )
(?<Element>(?&Actinium)|(?&Silver)|(?&Aluminum)|(?&Americium)|(?&Argon)|(?&Arsenic)|(?&Astatine)|(?&Gold)|(?&Barium)|(?&Beryllium)|(?&Bohrium)|(?&Bismuth)|(?&Berkelium)|(?&Bromine)|(?&Boron)|(?&Calcium)|(?&Cadmium)|(?&Cerium)|(?&Californium)|(?&Chlorine)|(?&Curium)|(?&Copernicium)|(?&Cobalt)|(?&Chromium)|(?&Cesium)|(?&Copper)|(?&Carbon)|(?&Dubnium)|(?&Darmstadtium)|(?&Dysprosium)|(?&Erbium)|(?&Einsteinium)|(?&Europium)|(?&Iron)|(?&Flerovium)|(?&Fermium)|(?&Francium)|(?&Fluorine)|(?&Gallium)|(?&Gadolinium)|(?&Germanium)|(?&Helium)|(?&Hafnium)|(?&Mercury)|(?&Holmium)|(?&Hassium)|(?&Hydrogen)|(?&Indium)|(?&Iridium)|(?&Iodine)|(?&Krypton)|(?&Potassium)|(?&Lanthanum)|(?&Lithium)|(?&Lawrencium)|(?&Lutetium)|(?&Livermorium)|(?&Moscovium)|(?&Mendelevium)|(?&Magnesium)|(?&Manganese)|(?&Molybdenum)|(?&Meitnerium)|(?&Sodium)|(?&Niobium)|(?&Neodymium)|(?&Neon)|(?&Nihonium)|(?&Nickel)|(?&Nobelium)|(?&Neptunium)|(?&Nitrogen)|(?&Oganesson)|(?&Osmium)|(?&Oxygen)|(?&Protactinium)|(?&Lead)|(?&Palladium)|(?&Promethium)|(?&Polonium)|(?&Praseodymium)|(?&Platinum)|(?&Plutonium)|(?&Phosphorus)|(?&Radium)|(?&Rubidium)|(?&Rhenium)|(?&Rutherfordium)|(?&Roentgenium)|(?&Rhodium)|(?&Radon)|(?&Ruthenium)|(?&Antimony)|(?&Scandium)|(?&Selenium)|(?&Seaborgium)|(?&Silicon)|(?&Samarium)|(?&Tin)|(?&Strontium)|(?&Sulfur)|(?&Tantalum)|(?&Terbium)|(?&Technetium)|(?&Tellurium)|(?&Thorium)|(?&Titanium)|(?&Thallium)|(?&Thulium)|(?&Tennessine)|(?&Uranium)|(?&Vanadium)|(?&Tungsten)|(?&Xenon)|(?&Ytterbium)|(?&Yttrium)|(?&Zirconium)|(?&Zinc))
(?<Num>(?:[1-9]\d*)?)
(?<ElementGroup>(?:(?&Element)(?&Num))+)
(?<ElementParenthesesGroup>\((?&ElementGroup)+\)(?&Num))
(?<ElementSquareBracketGroup>\[(?:(?:(?&ElementParenthesesGroup)(?:(?&ElementGroup)|(?&ElementParenthesesGroup))+)|(?:(?:(?&ElementGroup)|(?&ElementParenthesesGroup))+(?&ElementParenthesesGroup)))\](?&Num))
)
^((?<Brackets>(?&ElementSquareBracketGroup))|(?<Parentheses>(?&ElementParenthesesGroup))|(?<Group>(?&ElementGroup)))+$
Explanation
- The first part of the
(?(DEFINE))
section lists each periodic element (ordered by atomic number for easy lookup). - The
Element
group acts as a simple or|
between each of the elements listed in 1. ensuring that each element's symbol is ordered alphabetically by the first character, and then by symbol character length (so as not to catch, for example, CarbonC
instead of CalciumCa
) ElementGroup
specifies a group of chemicals in the format: one or moreElement
followed by zero or more digits, excluding zero (specified by the groupNum
)- Valid Examples
C
-Element
CH
-Element
followed by anotherElement
CH3
-Element
followed by anotherElement
and aNum
O2
-Element
followed by aNum
- Invalid Examples
N0
-0
cannot be used explicitlyN01
-Num
group specifies the number must begin with1-9
or not have a numberA
- Element does not existc
- Element does not exist - case sensitive regex
- Valid Examples
ElementParenthesesGroup
specifies one or more groupings ofElementGroup
between parentheses(
)
but containing at least oneElementGroup
- Valid Examples
(CH)
-ElementGroup
surrounded by parentheses(CH3)
-ElementGroup
surrounded by parentheses(CH3NO4)
- multipleElementGroup
surrounded by parentheses(CH3N04)2
- multipleElementGroup
surrounded by parentheses followed by aNum
- Invalid Examples
(CH[NO4])
- OnlyElementGroup
is valid insideElementParenthesesGroup
- Valid Examples
ElementSquareBracketGroup
specifies a grouping ofElementParenthesesGroup
orElementGroup
between square brackets[
]
but containing at least oneElementParenthesesGroup
and one other group (ElementParenthesesGroup
orElementGroup
)- Valid Examples
[CH3(NO4)]
- Contains at least oneElementParenthesesGroup
and one otherElementParenthesesGroup
orElementGroup
[(NO4)CH]2
- Contains at least oneElementParenthesesGroup
and one otherElementParenthesesGroup
orElementGroup
followed byNum
[(NO4)(CH3)]
- Contains at least oneElementParenthesesGroup
and one otherElementParenthesesGroup
orElementGroup
- Invalid Examples
[(NO4)]
- Does not contain second group, brackets[
]
are redundant[NO4]
- Does not containElementParenthesesGroup
- Valid Examples
Additional Information
I realize this is a very long answer, but the OP is asking a very specific question and wants to ensure specific criteria are met.
Ensure the following flags are set:
g
- ensures global matchesx
- ensures whitespace is ignored- if the data is across multiple lines (separated by a newline character) use
m
for multi line
Note: Regex will only capture the last group of type X
that it finds (and overwrite the previously captured group of said type X
. This is the default behaviour of regex and there is no way to currently override this behaviour. This may give you undesirable results. You can see this with the last example in the linked regex as well as with your example of (CH3)2CFCOO(CH2)2Si(CH3)2Cl
since there are multiple of each group type.
回答2:
It is best not to assemble such a large regex manually. Instead, let's assume we have an array of atoms @atoms
. We can then create a regex matching any of these atoms like:
my ($atoms_regex) = map qr/$_/, join '|', map quotemeta, sort @atoms;
(Sort all items so that shorter atom names come first, then escape all items with quotemeta
, join them with a |
for alternatives, and compile the regex.)
You can add any used abbreviations to the @atoms
array.
Next, we can write a regex that allows grouping and numbering. Our regex will match any number of items, where an item may be an atom or a group, and may be followed by a number:
my $chemical_formula_regex = qr/
(?&item)++
(?(DEFINE)
(?<item> (?: \((?&item)++\) | \[(?&item)++\] | $atoms_regex ) [0-9]* )
)
/x;
Within the
(?(DEFINE) ...)
group we can define named subpatterns with(?<name> ...)
. A subpattern is like a subroutine for a regex. We can call those subpatterns with(?&name)
. This allows us to structure the regex without unnecessary repetition.The
/x
flag allows us to use whitespace and linebreaks and comments to lay out the regex in a more readable fashion. Regexes don't have to be an incomprehensible mess!The
++
quantifier instead of+
is not strictly necessary, but prevents unwanted backtracking. That may be a bit faster when a match fails.
来源:https://stackoverflow.com/questions/46200305/a-strict-regular-expression-for-matching-chemical-formulae