Extracting urls from @font-face by searching within @font-face for replacement

后端 未结 2 791
-上瘾入骨i
-上瘾入骨i 2020-12-19 19:59

I have a web service that rewrites urls in css files so that they can be served via a CDN.

The css files can contain urls to images or fonts.

I currently hav

相关标签:
2条回答
  • 2020-12-19 20:35

    You can use this:

    $pattern = <<<'LOD'
    ~
    (?(DEFINE)
        (?<quoted_content>
            (["']) (?>[^"'\\]++ | \\{2} | \\. | (?!\g{-1})["'] )*+ \g{-1}
        )
        (?<comment> /\* .*? \*/ )
        (?<url_skip> (?: https?: | data: ) [^"'\s)}]*+ )
        (?<other_content>
            (?> [^u}/"']++ | \g<quoted_content> | \g<comment>
              | \Bu | u(?!rl\s*+\() | /(?!\*) 
              | \g<url_start> \g<url_skip> ["']?+
            )++
        )
        (?<anchor> \G(?<!^) ["']?+ | @font-face \s*+ { )
        (?<url_start> url\( \s*+ ["']?+ )
    )
    
    \g<comment> (*SKIP)(*FAIL) |
    
    \g<anchor> \g<other_content>?+ \g<url_start> \K [./]*+ 
    
    ( [^"'\s)}]*+ )    # url
    ~xs
    LOD;
    
    $result = preg_replace($pattern, 'http://cdn.test.com/fonts/$8', $data);
    print_r($result);
    

    test string

    $data = <<<'LOD'
    @font-face {
      font-family: 'FontAwesome';
      src: url("fonts/fontawesome-webfont.eot?v=4.0.3");
      src: url(fonts/fontawesome-webfont.eot?#iefix&v=4.0.3) format("embedded-opentype"),
         /*url("fonts/fontawesome-webfont.woff?v=4.0.3") format("woff"),*/
           url("http://domain.com/fonts/fontawesome-webfont.ttf?v=4.0.3") format("truetype"),
           url('fonts/fontawesome-webfont.svg?v=4.0.3#fontawesomeregular') format("svg");
      font-weight: normal;
      font-style: normal;
    }
    /*
    @font-face {
      font-family: 'Font1';
      src: url("fonts/font1.eot");
    } */
    @font-face {
      font-family: 'Fon\'t2';
      src: url("fonts/font2.eot");
    }
    @font-face {
      font-family: 'Font3';
      src: url("../fonts/font3.eot");
    }
    LOD;
    

    Main idea:

    For more readability the pattern is divided into named subpatterns. The (?(DEFINE)...) doesn't match anything, it is only a definition section.

    The main trick of this pattern is the use of the \G anchor that means: start of the string or contiguous to a precedent match. I added a negative lookbehind (?<!^) to avoid the first part of this definition.

    The <anchor> named subpattern is the most important because it allows a match only if @font-face { is found or immediately after the end of an url (this is the reason why you can see a ["']?+).

    <other_content> represents all that is not an url section but matches url sections that must be skipped too(urls that begin with "http:", "data:"). The important detail of this subpattern is that it can't match the closing curly bracket of @font-face.

    The mission of <url_start> is only to match url(".

    \K resets all the substring that has been matched before from the match result.

    ([^"'\s)}]*+) matches the url (the only thing that stay in the match result with the leading ./../ )

    Since <other_content> and the url subpattern can't match a } (that is outside quoted or comment parts), you are sure to never match something outside of the @font-face definition, the second consequence is that the pattern always fails after the last url. Thus, at the next attempt the "contiguous branch" will fail until the next @font-face.

    another trick:

    The main pattern begins with \g<comment> (*SKIP)(*FAIL) | to skip all content inside comments /*....*/. \g<comment> refers to the basic subpattern that describes how a comment look like. (*SKIP) forbids to retry the substring that has been matched before (on his left, by g<comment>), if the pattern fails on his right. (*FAIL) forces the pattern to fail. With this trick, comments are skipped and are not a match result (since the pattern fails).

    subpatterns details:

    quoted_content: It's used in <other_content> to avoid to match url( or /* that are inside quotes.

    (["'])              # capture group: the opening quote
    (?>                 # atomic group: all possible content between quotes
        [^"'\\]++       # all that is not a quote or a backslash
      |                 # OR
        \\{2}           # two backslashes: (two \ doesn't escape anything)
      |                 # OR
        \\.             # any escaped character
      |                 # OR
        (?!\g{-1})["']  # the other quote (this one that is not in the capture group)
    )*+                 # repeat zero or more time the atomic group
    \g{-1}              # backreference to the last capturing group
    

    other_content: all that is not the closing curly bracket, or an url without http: or data:

    (?>                     # open an atomic group
        [^u}/"']++          # all character that are not problematic!
      |
        \g<quoted_content>  # string inside quotes
      |
        \g<comment>         # string inside comments
      |
        \Bu                 # "u" not preceded by a word boundary
      |
        u(?!rl\s*+\()       # "u" not followed by "rl("  (not the start of an url definition)
      |                   
        /(?!\*)             # "/" not followed by "*" (not the start of a comment)
      |
        \g<url_start>       # match the url that begins with "http:"
        \g<url_skip> ["']?+ # until the possible quote
    )++                     # repeat the atomic group one or more times
    

    anchor

    \G(?<!^) ["']?+    # contiguous to a precedent match with a possible closing quote
    |                  # OR
    @font-face \s*+ {  # start of the @font-face definition
    

    Notice:

    You can improve the main pattern:

    After the last url of @font-face, the regex engine attempts to match with the "contiguous branch" of <anchor> and match all characters until the } that makes the pattern fail. Then, on each same characters, the regex engine must try the two branches or <anchor> (that will always fail until the }.

    To avoid these useless tries, you can change the main pattern to:

    \g<comment> (*SKIP)(*FAIL) |
    
    \g<anchor> \g<other_content>?+
    (?>
        \g<url_start> \K [./]*+  ([^"'\s)}]*+)
      | 
        } (*SKIP)(*FAIL)
    )
    

    With this new scenario, the first character after the last url is matched by the "contiguous branch", \g<other_content> matches all until the }, \g<url_start> fails immediatly, the } is matched and (*SKIP)(*FAIL) make the pattern fail and forbids to retry these characters.

    0 讨论(0)
  • 2020-12-19 20:52

    Disclaimer : You're maybe off using a library, because it's tougher than you think. I also want to start this answer on how to match URL's that are not within @font-face {}. I also suppose/define that the brackets {} are balanced within @font-face {}.
    Note : I'm going to use "~" as delimiters instead of "/", this will releave me from escaping later on in my expressions. Also note that I will be posting online demos from regex101.com, on that site I'll be using the g modifier. You should remove the g modifier and just use preg_match_all().
    Let's use some regex Fu !!!

    Part 1 : matching url's that are not within @font-face {}

    1.1 Matching @font-face {}

    Oh yes, this might sound "weird" but you will notice later on why :)
    We'll need some recursive regex here:

    @font-face\s*    # Match @font-face and some spaces
    (                # Start group 1
       \{            # Match {
       (?:           # A non-capturing group
          [^{}]+     # Match anything except {} one or more times
          |          # Or
          (?1)       # Recurse/rerun the expression of group 1
       )*            # Repeat 0 or more times
       \}            # Match }
    )                # End group 1
    

    demo

    1.2 Escaping @font-face {}

    We'll use (*SKIP)(*FAIL) just after the previous regex, it will skip it. See this answer to get an idea how it works.

    demo

    1.3 Matching url()

    We'll use something like this:

    url\s*\(         # Match url, optionally some whitespaces and then (
    \s*              # Match optionally some whitespaces
    ("|'|)           # It seems that the quotes are optional according to http://www.w3.org/TR/CSS2/syndata.html#uri
    (?!["']?(?:https?://|ftp://))  # Put your negative-rules here (do not match url's with http, https or ftp)
    (?:[^\\]|\\.)*?  # Match anything except a backslash or backslash and a character zero or more times ungreedy
    \2               # Match what was matched in group 2
    \s*              # Match optionally some whitespaces
    \)               # Match )
    

    Note that I'm using \2 because I've appended this to the previous regex which has group 1.
    Here's another use of ("|')(?:[^\\]|\\.)*?\1.

    demo

    1.4 Matching the value inside url()

    You might have guessed we need to use some lookaround-fu, the problem is with a lookbehind since it needs to be fixed length. I've got a workaround for that, I'll introduce you to the \K escape sequence. It will reset the beginning of the match to the current position in the token list. more-info
    Well let's drop \K somewhere in our expression and use a lookahead, our final regex will be :

    @font-face\s*    # Match @font-face and some spaces
    (                # Start group 1
       \{            # Match {
       (?:           # A non-capturing group
          [^{}]+     # Match anything except {} one or more times
          |          # Or
          (?1)       # Recurse/rerun the expression of group 1
       )*            # Repeat 0 or more times
       \}            # Match }
    )                # End group 1
    (*SKIP)(*FAIL)   # Skip it
    |                # Or
    url\s*\(         # Match url, optionally some whitespaces and then (
    \s*              # Match optionally some whitespaces
    ("|'|)           # It seems that the quotes are optional according to http://www.w3.org/TR/CSS2/syndata.html#uri
    \K               # Reset the match
    (?!["']?(?:https?://|ftp://))  # Put your negative-rules here (do not match url's with http, https or ftp)
    (?:[^\\]|\\.)*?  # Match anything except a backslash or backslash and a character zero or more times ungreedy
    (?=              # Lookahead
       \2            # Match what was matched in group 2
       \s*           # Match optionally some whitespaces
       \)            # Match )
    )
    

    demo

    1.5 Using the pattern in PHP

    We'll need to escape some things like quotes, backslashes \\\\ = \, use the right function and the right modifiers:

    $regex = '~
    @font-face\s*    # Match @font-face and some spaces
    (                # Start group 1
       \{            # Match {
       (?:           # A non-capturing group
          [^{}]+     # Match anything except {} one or more times
          |          # Or
          (?1)       # Recurse/rerun the expression of group 1
       )*            # Repeat 0 or more times
       \}            # Match }
    )                # End group 1
    (*SKIP)(*FAIL)   # Skip it
    |                # Or
    url\s*\(         # Match url, optionally some whitespaces and then (
    \s*              # Match optionally some whitespaces
    ("|\'|)          # It seems that the quotes are optional according to http://www.w3.org/TR/CSS2/syndata.html#uri
    \K               # Reset the match
    (?!["\']?(?:https?://|ftp://))  # Put your negative-rules here (do not match url's with http, https or ftp)
    (?:[^\\\\]|\\\\.)*?  # Match anything except a backslash or backslash and a character zero or more times ungreedy
    (?=              # Lookahead
       \2            # Match what was matched in group 2
       \s*           # Match optionally some whitespaces
       \)            # Match )
    )
    ~xs';
    
    $input = file_get_contents($css_file);
    preg_match_all($regex, $input, $m);
    echo '<pre>'. print_r($m[0], true) . '</pre>';
    

    demo

    Part 2 : matching url's that are within @font-face {}

    2.1 Different approach

    I want to do this part in 2 regexes because it will be a pain to match URL's that are within @font-face {} while taking care of the state of braces {} in a recursive regex.

    And since we already have the pieces we need, we'll only need to apply them in some code:

    1. Match all @font-face {} instances
    2. Loop through these and match all url()'s

    2.2 Putting it into code

    $results = array(); // Just an empty array;
    $fontface_regex = '~
    @font-face\s*    # Match @font-face and some spaces
    (                # Start group 1
       \{            # Match {
       (?:           # A non-capturing group
          [^{}]+     # Match anything except {} one or more times
          |          # Or
          (?1)       # Recurse/rerun the expression of group 1
       )*            # Repeat 0 or more times
       \}            # Match }
    )                # End group 1
    ~xs';
    
    $url_regex = '~
    url\s*\(         # Match url, optionally some whitespaces and then (
    \s*              # Match optionally some whitespaces
    ("|\'|)          # It seems that the quotes are optional according to http://www.w3.org/TR/CSS2/syndata.html#uri
    \K               # Reset the match
    (?!["\']?(?:https?://|ftp://))  # Put your negative-rules here (do not match url\'s with http, https or ftp)
    (?:[^\\\\]|\\\\.)*?  # Match anything except a backslash or backslash and a character zero or more times ungreedy
    (?=              # Lookahead
       \1            # Match what was matched in group 2
       \s*           # Match optionally some whitespaces
       \)            # Match )
    )
    ~xs';
    
    $input = file_get_contents($css_file);
    
    preg_match_all($fontface_regex, $input, $fontfaces); // Get all font-face instances
    if(isset($fontfaces[0])){ // If there is a match then
        foreach($fontfaces[0] as $fontface){ // Foreach instance
            preg_match_all($url_regex, $fontface, $r); // Let's match the url's
            if(isset($r[0])){ // If there is a hit
                $results[] = $r[0]; // Then add it to the results array
            }
        }
    }
    echo '<pre>'. print_r($results, true) . '</pre>'; // Show the results
    

    demo

                                                                        Join the regex chatroom !

    0 讨论(0)
提交回复
热议问题