Parsing a string that represents a chemical reaction and verify if the reaction is possible

前端 未结 3 1133
独厮守ぢ
独厮守ぢ 2021-01-13 01:55

I have to write a program that takes a user\'s chemical equation as an input, like 12 CO2 + 6 H2O -> 2 C6H12O6 + 12 O2, and watch if the amount of Atoms is on both sites the

相关标签:
3条回答
  • 2021-01-13 02:09

    The way a proper parser such as ANTLR works is to 1) convert the text into a stream of lexical tokens, then 2) parse the tokens with lookahead into a parse tree.

    Lookahead is useful to know when to "end" a particular structural level of parsing.

    For your requirements, you might be able to skip the distinction between lexing and parsing and just parse from the text directly -- however, an appreciation and use of lookahead would potentially be useful.

    In particular a buffer to hold the upcoming (remaining) text, test matches (eg regex) against it, and consume matches from the front could be useful. This could be implemented either by modifying the remaining string or by advancing an index within it.

    Given such a buffer, your pseudocode might look like:

    class EquationParser {
        protected ParseBuffer buffer;
    
        // parse Equation;      
        //      -- of form "Sum -> Sum".
        //      -- eg. "12 CO2 + 6 H2O -> 2 C6H12O6 + 12 O2"
        public Equation parseEquation() {
    
            Sum left = parseSum();
            matchAndConsume("->");
            Sum right = parseSum();
            return new Equation( left, right);
        }
    
        // parse Sum;
        //      -- eg. "12 CO2 + 6 H2O"
        public Sum parseSum() {
            // parse 1 or more Product terms;
            Sum result = new Sum();
            result.add( parseProduct());
            while (buffer.lookaheadMatch("\\+")) {
                buffer.consumeMatch("\\+");
                result.add( parseProduct());
            }
            return result;
        }
    
        // parse Product term;
        //      -- of form "N formula", or just "formula".
        //      -- eg. "12 CO2" or "CO2"
        public Product parseProduct() {
            int quantity = 1;
            if (buffer.lookaheadMatch("\\d+")) {
                quantity = Integer.parseInt( buffer.consumeMatch("\\d+"));
            }
            Formula formula = parseFormula();
            return new Product( quantity, formula);
        }
    
        // parse Formula;
        //      -- eg. "C6H12O6" or "CO2"
        public Formula parseFormula() {
            Formula result = new Formula();
            result.add( parseTerm());
            while (buffer.lookaheadMatch("[A-Z][a-z]?\\d*")) {
                result.add( parseTerm());
            }
            return result;
        }
    
        // parse Term;
        //      -- eg. "C6", "C", "Co6", or "Co6"
        public Term parseTerm() {
            // ... reader exercise to implement...
        }
    
    
        protected void matchAndConsume (String patt) {
            if (! buffer.lookaheadMatch( patt))
                throw ParseFailed("parse failed:  expected "+patt);
            buffer.consumeMatch( patt);
        }
    }
    

    This is conceptual example code, not tested & does not include the buffer or complete parser -- it is the reader's job to flesh these out to a complete solution.

    0 讨论(0)
  • 2021-01-13 02:24

    This question is asking for a simple parser for a simple type of equation. I am assuming that you do not need to support all kinds of irregular equations with parentheses and weird symbols.

    Just to be safe, I would use a lot of String.split() instead of regexes.

    A (relatively) simple solution would do the following:

    1. Split on ->
    2. Make sure there are two pieces
    3. Sum up each piece:
      1. Split on +
      2. Parse each molecule and sum up the atoms:
        1. Parse optional multiplier
        2. Find all matches to molecule regex
        3. Convert the numbers and add them up by element
    4. Compare the results

    Each level of parsing can be handily done in a separate method. Using regex is probably the best way to parse the individual molecules, so I borrowed the expression from here: https://codereview.stackexchange.com/questions/2345/simplify-splitting-a-string-into-alpha-and-numeric-parts. The regex is pretty much trivial, so please bear with me:

    import java.util.Map;
    import java.util.HashMap;
    import java.util.regex.Pattern;
    import java.util.regex.Matcher;
    
    public class SimpleChemicalEquationParser
    {
        // Counts of elements on each side
        private Map<String, Integer> left;
        private Map<String, Integer> right;
    
        public SimpleChemicalEquationParser(String eqn)
        {
            this.left = new HashMap<>();
            this.right = new HashMap<>();
            parse(eqn);
        }
    
        public boolean isBalanced()
        {
            return left.equals(right);
        }
    
        public boolean isSimpleBalanced()
        {
            return leftCount() == rightCount();
        }
    
        public int leftCount()
        {
            return left.values().stream().mapToInt(Integer::intValue).sum();
        }
    
        public int rightCount()
        {
            return right.values().stream().mapToInt(Integer::intValue).sum();
        }
    
        private void parse(String eqn)
        {
            String[] sides = eqn.split("->");
            if(sides.length != 2) {
                throw new RuntimeException("Check your equation. There should be exactly one -> symbol somewhere");
            }
            parseSide(sides[0], this.left);
            parseSide(sides[1], this.right);
        }
    
        private void parseSide(String side, Map<String, Integer> counter)
        {
            String[] molecules = side.split("\\+");
            for(String molecule : molecules) {
                parseMolecule(molecule, counter);
            }
        }
    
        private void parseMolecule(String molecule, Map<String, Integer> counter)
        {
            molecule = molecule.trim();
            Matcher matcher = Pattern.compile("([a-zA-Z]+)\\s*([0-9]*)").matcher(molecule);
            int multiplier = 1;
            int endIndex = 0;
            while(matcher.find()) {
                String separator = molecule.substring(endIndex, matcher.start()).trim();
                if(!separator.isEmpty()) {
                    // Check if there is a premultiplier before the first element
                    if(endIndex == 0) {
                        String multiplierString = molecule.substring(0, matcher.start()).trim();
                        try {
                            multiplier = Integer.parseInt(multiplierString);
                        } catch(NumberFormatException nfe) {
                            throw new RuntimeException("Invalid prefix \"" + multiplierString +
                                                       "\" to molecule \"" + molecule.substring(matcher.start()) + "\"");
                        }
                    } else {
                        throw new RuntimeException("Nonsensical characters \"" + separator +
                                                   "\" in molecule \"" + molecule + "\"");
                    }
                }
                parseElement(multiplier, matcher.group(1), matcher.group(2), counter);
                endIndex = matcher.end();
            }
            if(endIndex != molecule.length()) {
                throw new RuntimeException("Invalid end to side: \"" + molecule.substring(endIndex) + "\"");
            }
        }
    
        private void parseElement(int multiplier, String element, String atoms, Map<String, Integer> counter)
        {
            if(!atoms.isEmpty())
                multiplier *= Integer.parseInt(atoms);
            if(counter.containsKey(element))
                multiplier += counter.get(element);
            counter.put(element, multiplier);
        }
    
        public static void main(String[] args)
        {
            // Collect all command line arguments into one equation
            StringBuilder sb = new StringBuilder();
            for(String arg : args)
                sb.append(arg).append(' ');
    
            String eqn = sb.toString();
            SimpleChemicalEquationParser parser = new SimpleChemicalEquationParser(eqn);
            boolean simpleBalanced = parser.isSimpleBalanced();
            boolean balanced = parser.isBalanced();
    
            System.out.println("Left: " + parser.leftCount());
            for(Map.Entry<String, Integer> entry : parser.left.entrySet()) {
                System.out.println("    " + entry.getKey() + ": " + entry.getValue());
            }
            System.out.println();
    
            System.out.println("Right: " + parser.rightCount());
            for(Map.Entry<String, Integer> entry : parser.right.entrySet()) {
                System.out.println("    " + entry.getKey() + ": " + entry.getValue());
            }
            System.out.println();
    
            System.out.println("Atom counts match: " + simpleBalanced);
            System.out.println("Elements match: " + balanced);
        }
    }
    

    All the work is done by the parse method and it's subordinates, which make a sort of virtual call tree. Since this approach makes it especially easy to make sure that the atoms of each element are actually balanced out, I have gone ahead and done that here. This class prints the counts of the atoms on each side of the equation, whether or not the raw counts balance out, as well as whether or not they match my element type. Here are a couple of example runs:

    OP's original example:

    $ java -cp . SimpleChemicalEquationParser '12 C O2 + 6 H2O -> 2 C6H12O6 + 12 O2'
    Left: 54
        C: 12
        H: 12
        O: 30
    
    Right: 72
        C: 12
        H: 24
        O: 36
    
    Atom counts match: false
    Elements match: false
    

    Added Ozone to make the number of atoms match up

    $ java -cp . SimpleChemicalEquationParser '12 C O2 + 6 H2O + 6 O3 -> 2 C6H12O6 + 12 O2'
    Left: 72
        C: 12
        H: 12
        O: 48
    
    Right: 72
        C: 12
        H: 24
        O: 36
    
    Atom counts match: true
    Elements match: false 
    

    Added water to make everything match up

    $ java -cp . SimpleChemicalEquationParser '12 C O2 + 12 H2O -> 2 C6H12O6 + 12 O2'
    Left: 72
        C: 12
        H: 24
        O: 36
    
    Right: 72
        C: 12
        H: 24
        O: 36
    
    Atom counts match: true
    Elements match: true
    

    Notice that I added a space between C and O in CO2. This is because my current regex for molecules, ([a-zA-Z]+)\\s*([0-9]*), allows any combination of letters to represent an element. If your elements are always going to be simple one-letter elements, change this to ([a-zA-Z])\\s*([0-9]*) (remove the + quantifier). If they are going to be properly named, two letter combinations with the second letter always lowercase, do this instead: ([A-Z][a-z]?)\\s*([0-9]*). I recommend the latter option. For both modified versions, the space in C O2 will no longer be necessary.

    0 讨论(0)
  • 2021-01-13 02:32

    So, every time I need to parse some text with Java, I mostly end up just using Regex. So I'd recommend you to also do so.

    You can test regular expressions at regex101.com.

    And also easily use it in Java:

    final inputText = ...
    final Pattern pattern = Patern.compile("Some regex code");
    final Matcher matcher = pattern.matcher(input);
    if (matcher.find()) {
        System.out.println(matcher.group(0));
    }
    

    Inside Regex you can define capturing groups with ( and ) and then grab the results by matcher.group(int).

    For example, you may first separate the equation using (.*) -> (.*).

    Then loop the left and right group using find with: (\d+) (\w+)(?: \+| -|$).

    After that you can use group(1) for the amount and group(2) for the element.

    And if needed also iterate the second group (the element) for the exact element distribution using (\w)(\d?). Then the first group is the element, for example for the text CO2 it yields two hits, the first hit has group(1) -> C and no second group. The second hit has group(1) -> O and group(2) -> 2.

    Test your regex here: regex101#Q6KMJo

    0 讨论(0)
提交回复
热议问题