I have a web service that rewrites urls in css files so that they can be served via a CDN.
The css files can contain urls to images or fonts.
I currently hav
You can use this:
$pattern = <<<'LOD'
~
(?(DEFINE)
(?<quoted_content>
(["']) (?>[^"'\\]++ | \\{2} | \\. | (?!\g{-1})["'] )*+ \g{-1}
)
(?<comment> /\* .*? \*/ )
(?<url_skip> (?: https?: | data: ) [^"'\s)}]*+ )
(?<other_content>
(?> [^u}/"']++ | \g<quoted_content> | \g<comment>
| \Bu | u(?!rl\s*+\() | /(?!\*)
| \g<url_start> \g<url_skip> ["']?+
)++
)
(?<anchor> \G(?<!^) ["']?+ | @font-face \s*+ { )
(?<url_start> url\( \s*+ ["']?+ )
)
\g<comment> (*SKIP)(*FAIL) |
\g<anchor> \g<other_content>?+ \g<url_start> \K [./]*+
( [^"'\s)}]*+ ) # url
~xs
LOD;
$result = preg_replace($pattern, 'http://cdn.test.com/fonts/$8', $data);
print_r($result);
test string
$data = <<<'LOD'
@font-face {
font-family: 'FontAwesome';
src: url("fonts/fontawesome-webfont.eot?v=4.0.3");
src: url(fonts/fontawesome-webfont.eot?#iefix&v=4.0.3) format("embedded-opentype"),
/*url("fonts/fontawesome-webfont.woff?v=4.0.3") format("woff"),*/
url("http://domain.com/fonts/fontawesome-webfont.ttf?v=4.0.3") format("truetype"),
url('fonts/fontawesome-webfont.svg?v=4.0.3#fontawesomeregular') format("svg");
font-weight: normal;
font-style: normal;
}
/*
@font-face {
font-family: 'Font1';
src: url("fonts/font1.eot");
} */
@font-face {
font-family: 'Fon\'t2';
src: url("fonts/font2.eot");
}
@font-face {
font-family: 'Font3';
src: url("../fonts/font3.eot");
}
LOD;
For more readability the pattern is divided into named subpatterns. The (?(DEFINE)...)
doesn't match anything, it is only a definition section.
The main trick of this pattern is the use of the \G
anchor that means: start of the string or contiguous to a precedent match. I added a negative lookbehind (?<!^)
to avoid the first part of this definition.
The <anchor>
named subpattern is the most important because it allows a match only if @font-face {
is found or immediately after the end of an url (this is the reason why you can see a ["']?+
).
<other_content>
represents all that is not an url section but matches url sections that must be skipped too(urls that begin with "http:", "data:"). The important detail of this subpattern is that it can't match the closing curly bracket of @font-face.
The mission of <url_start>
is only to match url("
.
\K
resets all the substring that has been matched before from the match result.
([^"'\s)}]*+)
matches the url (the only thing that stay in the match result with the leading ./../
)
Since <other_content>
and the url subpattern can't match a }
(that is outside quoted or comment parts), you are sure to never match something outside of the @font-face definition, the second consequence is that the pattern always fails after the last url. Thus, at the next attempt the "contiguous branch" will fail until the next @font-face.
The main pattern begins with \g<comment> (*SKIP)(*FAIL) |
to skip all content inside comments /*....*/
. \g<comment>
refers to the basic subpattern that describes how a comment look like. (*SKIP)
forbids to retry the substring that has been matched before (on his left, by g<comment>
), if the pattern fails on his right. (*FAIL)
forces the pattern to fail.
With this trick, comments are skipped and are not a match result (since the pattern fails).
quoted_content:
It's used in <other_content>
to avoid to match url(
or /*
that are inside quotes.
(["']) # capture group: the opening quote
(?> # atomic group: all possible content between quotes
[^"'\\]++ # all that is not a quote or a backslash
| # OR
\\{2} # two backslashes: (two \ doesn't escape anything)
| # OR
\\. # any escaped character
| # OR
(?!\g{-1})["'] # the other quote (this one that is not in the capture group)
)*+ # repeat zero or more time the atomic group
\g{-1} # backreference to the last capturing group
other_content: all that is not the closing curly bracket, or an url without http:
or data:
(?> # open an atomic group
[^u}/"']++ # all character that are not problematic!
|
\g<quoted_content> # string inside quotes
|
\g<comment> # string inside comments
|
\Bu # "u" not preceded by a word boundary
|
u(?!rl\s*+\() # "u" not followed by "rl(" (not the start of an url definition)
|
/(?!\*) # "/" not followed by "*" (not the start of a comment)
|
\g<url_start> # match the url that begins with "http:"
\g<url_skip> ["']?+ # until the possible quote
)++ # repeat the atomic group one or more times
anchor
\G(?<!^) ["']?+ # contiguous to a precedent match with a possible closing quote
| # OR
@font-face \s*+ { # start of the @font-face definition
You can improve the main pattern:
After the last url of @font-face, the regex engine attempts to match with the "contiguous branch" of <anchor>
and match all characters until the }
that makes the pattern fail. Then, on each same characters, the regex engine must try the two branches or <anchor>
(that will always fail until the }
.
To avoid these useless tries, you can change the main pattern to:
\g<comment> (*SKIP)(*FAIL) |
\g<anchor> \g<other_content>?+
(?>
\g<url_start> \K [./]*+ ([^"'\s)}]*+)
|
} (*SKIP)(*FAIL)
)
With this new scenario, the first character after the last url is matched by the "contiguous branch", \g<other_content>
matches all until the }
, \g<url_start>
fails immediatly, the }
is matched and (*SKIP)(*FAIL)
make the pattern fail and forbids to retry these characters.
Disclaimer : You're maybe off using a library, because it's tougher than you think. I also want to start this answer on how to match URL's that are not within @font-face {}. I also suppose/define that the brackets {} are balanced within @font-face {}.
Note : I'm going to use "~" as delimiters instead of "/", this will releave me from escaping later on in my expressions. Also note that I will be posting online demos from regex101.com, on that site I'll be using the g modifier. You should remove the g modifier and just use preg_match_all().
Let's use some regex Fu !!!
Oh yes, this might sound "weird" but you will notice later on why :)
We'll need some recursive regex here:
@font-face\s* # Match @font-face and some spaces
( # Start group 1
\{ # Match {
(?: # A non-capturing group
[^{}]+ # Match anything except {} one or more times
| # Or
(?1) # Recurse/rerun the expression of group 1
)* # Repeat 0 or more times
\} # Match }
) # End group 1
demo
We'll use (*SKIP)(*FAIL)
just after the previous regex, it will skip it. See this answer to get an idea how it works.
demo
We'll use something like this:
url\s*\( # Match url, optionally some whitespaces and then (
\s* # Match optionally some whitespaces
("|'|) # It seems that the quotes are optional according to http://www.w3.org/TR/CSS2/syndata.html#uri
(?!["']?(?:https?://|ftp://)) # Put your negative-rules here (do not match url's with http, https or ftp)
(?:[^\\]|\\.)*? # Match anything except a backslash or backslash and a character zero or more times ungreedy
\2 # Match what was matched in group 2
\s* # Match optionally some whitespaces
\) # Match )
Note that I'm using \2
because I've appended this to the previous regex which has group 1.
Here's another use of ("|')(?:[^\\]|\\.)*?\1
.
demo
You might have guessed we need to use some lookaround-fu, the problem is with a lookbehind since it needs to be fixed length. I've got a workaround for that, I'll introduce you to the \K
escape sequence. It will reset the beginning of the match to the current position in the token list. more-info
Well let's drop \K
somewhere in our expression and use a lookahead, our final regex will be :
@font-face\s* # Match @font-face and some spaces
( # Start group 1
\{ # Match {
(?: # A non-capturing group
[^{}]+ # Match anything except {} one or more times
| # Or
(?1) # Recurse/rerun the expression of group 1
)* # Repeat 0 or more times
\} # Match }
) # End group 1
(*SKIP)(*FAIL) # Skip it
| # Or
url\s*\( # Match url, optionally some whitespaces and then (
\s* # Match optionally some whitespaces
("|'|) # It seems that the quotes are optional according to http://www.w3.org/TR/CSS2/syndata.html#uri
\K # Reset the match
(?!["']?(?:https?://|ftp://)) # Put your negative-rules here (do not match url's with http, https or ftp)
(?:[^\\]|\\.)*? # Match anything except a backslash or backslash and a character zero or more times ungreedy
(?= # Lookahead
\2 # Match what was matched in group 2
\s* # Match optionally some whitespaces
\) # Match )
)
demo
We'll need to escape some things like quotes, backslashes \\\\
= \
, use the right function and the right modifiers:
$regex = '~
@font-face\s* # Match @font-face and some spaces
( # Start group 1
\{ # Match {
(?: # A non-capturing group
[^{}]+ # Match anything except {} one or more times
| # Or
(?1) # Recurse/rerun the expression of group 1
)* # Repeat 0 or more times
\} # Match }
) # End group 1
(*SKIP)(*FAIL) # Skip it
| # Or
url\s*\( # Match url, optionally some whitespaces and then (
\s* # Match optionally some whitespaces
("|\'|) # It seems that the quotes are optional according to http://www.w3.org/TR/CSS2/syndata.html#uri
\K # Reset the match
(?!["\']?(?:https?://|ftp://)) # Put your negative-rules here (do not match url's with http, https or ftp)
(?:[^\\\\]|\\\\.)*? # Match anything except a backslash or backslash and a character zero or more times ungreedy
(?= # Lookahead
\2 # Match what was matched in group 2
\s* # Match optionally some whitespaces
\) # Match )
)
~xs';
$input = file_get_contents($css_file);
preg_match_all($regex, $input, $m);
echo '<pre>'. print_r($m[0], true) . '</pre>';
demo
I want to do this part in 2 regexes because it will be a pain to match URL's that are within @font-face {}
while taking care of the state of braces {}
in a recursive regex.
And since we already have the pieces we need, we'll only need to apply them in some code:
@font-face {}
instances$results = array(); // Just an empty array;
$fontface_regex = '~
@font-face\s* # Match @font-face and some spaces
( # Start group 1
\{ # Match {
(?: # A non-capturing group
[^{}]+ # Match anything except {} one or more times
| # Or
(?1) # Recurse/rerun the expression of group 1
)* # Repeat 0 or more times
\} # Match }
) # End group 1
~xs';
$url_regex = '~
url\s*\( # Match url, optionally some whitespaces and then (
\s* # Match optionally some whitespaces
("|\'|) # It seems that the quotes are optional according to http://www.w3.org/TR/CSS2/syndata.html#uri
\K # Reset the match
(?!["\']?(?:https?://|ftp://)) # Put your negative-rules here (do not match url\'s with http, https or ftp)
(?:[^\\\\]|\\\\.)*? # Match anything except a backslash or backslash and a character zero or more times ungreedy
(?= # Lookahead
\1 # Match what was matched in group 2
\s* # Match optionally some whitespaces
\) # Match )
)
~xs';
$input = file_get_contents($css_file);
preg_match_all($fontface_regex, $input, $fontfaces); // Get all font-face instances
if(isset($fontfaces[0])){ // If there is a match then
foreach($fontfaces[0] as $fontface){ // Foreach instance
preg_match_all($url_regex, $fontface, $r); // Let's match the url's
if(isset($r[0])){ // If there is a hit
$results[] = $r[0]; // Then add it to the results array
}
}
}
echo '<pre>'. print_r($results, true) . '</pre>'; // Show the results
demo
Join the regex chatroom !