Elisp mechanism for converting PCRE regexps to emacs regexps

前端 未结 4 1769
傲寒
傲寒 2021-01-31 18:17

I admit significant bias toward liking PCRE regexps much better than emacs, if no no other reason that when I type a \'(\' I pretty much always want a grouping operator. And, o

4条回答
  •  北恋
    北恋 (楼主)
    2021-01-31 18:31

    https://github.com/joddie/pcre2el is the up-to-date version of this answer.

    pcre2el or rxt (RegeXp Translator or RegeXp Tools) is a utility for working with regular expressions in Emacs, based on a recursive-descent parser for regexp syntax. In addition to converting (a subset of) PCRE syntax into its Emacs equivalent, it can do the following:

    • convert Emacs syntax to PCRE
    • convert either syntax to rx, an S-expression based regexp syntax
    • untangle complex regexps by showing the parse tree in rx form and highlighting the corresponding chunks of code
    • show the complete list of strings (productions) matching a regexp, provided the list is finite
    • provide live font-locking of regexp syntax (so far only for Elisp buffers – other modes on the TODO list)

    The text of the original answer follows...


    Here's a quick and ugly Emacs lisp solution (EDIT: now located more permanently here). It's based mostly on the description in the pcrepattern man page, and works token by token, converting only the following constructions:

    • parenthesis grouping ( .. )
    • alternation |
    • numerical repeats {M,N}
    • string quoting \Q .. \E
    • simple character escapes: \a, \c, \e, \f, \n, \r, \t, \x, and \ + octal digits
    • character classes: \d, \D, \h, \H, \s, \S, \v, \V
    • \w and \W left as they are (using Emacs' own idea of word and non-word characters)

    It doesn't do anything with more complicated PCRE assertions, but it does try to convert escapes inside character classes. In the case of character classes including something like \D, this is done by converting into a non-capturing group with alternation.

    It passes the tests I wrote for it, but there are certainly bugs, and the method of scanning token-by-token is probably slow. In other words, no warranty. But perhaps it will do enough of the simpler part of the job for some purposes. Interested parties are invited to improve it ;-)

    (eval-when-compile (require 'cl))
    
    (defvar pcre-horizontal-whitespace-chars
      (mapconcat 'char-to-string
                 '(#x0009 #x0020 #x00A0 #x1680 #x180E #x2000 #x2001 #x2002 #x2003
                          #x2004 #x2005 #x2006 #x2007 #x2008 #x2009 #x200A #x202F
                          #x205F #x3000)
                 ""))
    
    (defvar pcre-vertical-whitespace-chars
      (mapconcat 'char-to-string
                 '(#x000A #x000B #x000C #x000D #x0085 #x2028 #x2029) ""))
    
    (defvar pcre-whitespace-chars
      (mapconcat 'char-to-string '(9 10 12 13 32) ""))
    
    (defvar pcre-horizontal-whitespace
      (concat "[" pcre-horizontal-whitespace-chars "]"))
    
    (defvar pcre-non-horizontal-whitespace
      (concat "[^" pcre-horizontal-whitespace-chars "]"))
    
    (defvar pcre-vertical-whitespace
      (concat "[" pcre-vertical-whitespace-chars "]"))
    
    (defvar pcre-non-vertical-whitespace
      (concat "[^" pcre-vertical-whitespace-chars "]"))
    
    (defvar pcre-whitespace (concat "[" pcre-whitespace-chars "]"))
    
    (defvar pcre-non-whitespace (concat "[^" pcre-whitespace-chars "]"))
    
    (eval-when-compile
      (defmacro pcre-token-case (&rest cases)
        "Consume a token at point and evaluate corresponding forms.
    
    CASES is a list of `cond'-like clauses, (REGEXP FORMS
    ...). Considering CASES in order, if the text at point matches
    REGEXP then moves point over the matched string and returns the
    value of FORMS. Returns `nil' if none of the CASES matches."
        (declare (debug (&rest (sexp &rest form))))
        `(cond
          ,@(mapcar
             (lambda (case)
               (let ((token (car case))
                     (action (cdr case)))
                 `((looking-at ,token)
                   (goto-char (match-end 0))
                   ,@action)))
             cases)
          (t nil))))
    
    (defun pcre-to-elisp (pcre)
      "Convert PCRE, a regexp in PCRE notation, into Elisp string form."
      (with-temp-buffer
        (insert pcre)
        (goto-char (point-min))
        (let ((capture-count 0) (accum '())
              (case-fold-search nil))
          (while (not (eobp))
            (let ((translated
                   (or
                    ;; Handle tokens that are treated the same in
                    ;; character classes
                    (pcre-re-or-class-token-to-elisp)   
    
                    ;; Other tokens
                    (pcre-token-case
                     ("|" "\\|")
                     ("(" (incf capture-count) "\\(")
                     (")" "\\)")
                     ("{" "\\{")
                     ("}" "\\}")
    
                     ;; Character class
                     ("\\[" (pcre-char-class-to-elisp))
    
                     ;; Backslash + digits => backreference or octal char?
                     ("\\\\\\([0-9]+\\)"
                      (let* ((digits (match-string 1))
                             (dec (string-to-number digits)))
                        ;; from "man pcrepattern": If the number is
                        ;; less than 10, or if there have been at
                        ;; least that many previous capturing left
                        ;; parentheses in the expression, the entire
                        ;; sequence is taken as a back reference.   
                        (cond ((< dec 10) (concat "\\" digits))
                              ((>= capture-count dec)
                               (error "backreference \\%s can't be used in Emacs regexps"
                                      digits))
                              (t
                               ;; from "man pcrepattern": if the
                               ;; decimal number is greater than 9 and
                               ;; there have not been that many
                               ;; capturing subpatterns, PCRE re-reads
                               ;; up to three octal digits following
                               ;; the backslash, and uses them to
                               ;; generate a data character. Any
                               ;; subsequent digits stand for
                               ;; themselves.
                               (goto-char (match-beginning 1))
                               (re-search-forward "[0-7]\\{0,3\\}")
                               (char-to-string (string-to-number (match-string 0) 8))))))
    
                     ;; Regexp quoting.
                     ("\\\\Q"
                      (let ((beginning (point)))
                        (search-forward "\\E")
                        (regexp-quote (buffer-substring beginning (match-beginning 0)))))
    
                     ;; Various character classes
                     ("\\\\d" "[0-9]")
                     ("\\\\D" "[^0-9]")
                     ("\\\\h" pcre-horizontal-whitespace)
                     ("\\\\H" pcre-non-horizontal-whitespace)
                     ("\\\\s" pcre-whitespace)
                     ("\\\\S" pcre-non-whitespace)
                     ("\\\\v" pcre-vertical-whitespace)
                     ("\\\\V" pcre-non-vertical-whitespace)
    
                     ;; Use Emacs' native notion of word characters
                     ("\\\\[Ww]" (match-string 0))
    
                     ;; Any other escaped character
                     ("\\\\\\(.\\)" (regexp-quote (match-string 1)))
    
                     ;; Any normal character
                     ("." (match-string 0))))))
              (push translated accum)))
          (apply 'concat (reverse accum)))))
    
    (defun pcre-re-or-class-token-to-elisp ()
      "Consume the PCRE token at point and return its Elisp equivalent.
    
    Handles only tokens which have the same meaning in character
    classes as outside them."
      (pcre-token-case
       ("\\\\a" (char-to-string #x07))  ; bell
       ("\\\\c\\(.\\)"                  ; control character
        (char-to-string
         (- (string-to-char (upcase (match-string 1))) 64)))
       ("\\\\e" (char-to-string #x1b))  ; escape
       ("\\\\f" (char-to-string #x0c))  ; formfeed
       ("\\\\n" (char-to-string #x0a))  ; linefeed
       ("\\\\r" (char-to-string #x0d))  ; carriage return
       ("\\\\t" (char-to-string #x09))  ; tab
       ("\\\\x\\([A-Za-z0-9]\\{2\\}\\)"
        (char-to-string (string-to-number (match-string 1) 16)))
       ("\\\\x{\\([A-Za-z0-9]*\\)}"
        (char-to-string (string-to-number (match-string 1) 16)))))
    
    (defun pcre-char-class-to-elisp ()
      "Consume the remaining PCRE character class at point and return its Elisp equivalent.
    
    Point should be after the opening \"[\" when this is called, and
    will be just after the closing \"]\" when it returns."
      (let ((accum '("["))
            (pcre-char-class-alternatives '())
            (negated nil))
        (when (looking-at "\\^")
          (setq negated t)
          (push "^" accum)
          (forward-char))
        (when (looking-at "\\]") (push "]" accum) (forward-char))
    
        (while (not (looking-at "\\]"))
          (let ((translated
                 (or
                  (pcre-re-or-class-token-to-elisp)
                  (pcre-token-case              
                   ;; Backslash + digits => always an octal char
                   ("\\\\\\([0-7]\\{1,3\\}\\)"    
                    (char-to-string (string-to-number (match-string 1) 8)))
    
                   ;; Various character classes. To implement negative char classes,
                   ;; we cons them onto the list `pcre-char-class-alternatives' and
                   ;; transform the char class into a shy group with alternation
                   ("\\\\d" "0-9")
                   ("\\\\D" (push (if negated "[0-9]" "[^0-9]")
                                  pcre-char-class-alternatives) "")
                   ("\\\\h" pcre-horizontal-whitespace-chars)
                   ("\\\\H" (push (if negated
                                      pcre-horizontal-whitespace
                                    pcre-non-horizontal-whitespace)
                                  pcre-char-class-alternatives) "")
                   ("\\\\s" pcre-whitespace-chars)
                   ("\\\\S" (push (if negated
                                      pcre-whitespace
                                    pcre-non-whitespace)
                                  pcre-char-class-alternatives) "")
                   ("\\\\v" pcre-vertical-whitespace-chars)
                   ("\\\\V" (push (if negated
                                      pcre-vertical-whitespace
                                    pcre-non-vertical-whitespace)
                                  pcre-char-class-alternatives) "")
                   ("\\\\w" (push (if negated "\\W" "\\w") 
                                  pcre-char-class-alternatives) "")
                   ("\\\\W" (push (if negated "\\w" "\\W") 
                                  pcre-char-class-alternatives) "")
    
                   ;; Leave POSIX syntax unchanged
                   ("\\[:[a-z]*:\\]" (match-string 0))
    
                   ;; Ignore other escapes
                   ("\\\\\\(.\\)" (match-string 0))
    
                   ;; Copy everything else
                   ("." (match-string 0))))))
            (push translated accum)))
        (push "]" accum)
        (forward-char)
        (let ((class
               (apply 'concat (reverse accum))))
          (when (or (equal class "[]")
                    (equal class "[^]"))
            (setq class ""))
          (if (not pcre-char-class-alternatives)
              class
            (concat "\\(?:"
                    class "\\|"
                    (mapconcat 'identity
                               pcre-char-class-alternatives
                               "\\|")
                    "\\)")))))
    

提交回复
热议问题