What does the underscore mean in the following regex?
[a-zA-Z0-9_]
The _
seems to make no difference so I don\'t understand the purpose
With the exception of character sequences ([.
, [:
, and [=
), range expressions (e.g., [a-z]
), and the circumflex in the beginning ([^
), every character inside a bracket expression means the character itself, just like that underscore.
As a side note, that expression is commonly represented by \w
(word character, ignoring unicode and locale), and is commonly used to define the set of characters that are allowed to be used in variable names.
It means to match the underscore character in addition to lowercase letters, uppercase letters, and numbers.
It means the underscore is also matched.
Regular expressions are documented in perlre. That's the place to check whenever you have a question about regular expressions. The Regular-Expressions.info site is very helpful too.
To get you started, the thing you are looking at is called a "character class". Any of the characters inside a character class can match.
You can make a range of characters with the -
, so a-z
is any of the lowercase letters in that range. A-Z
are the uppercase letters and 0-9
are the digits. The _
is a literal underscore. Taken together those are the legal characters for a Perl identifier (variable names and so on). That's the \w
character class in the ASCII sense (and not the expanded Unicode sense).
People often use that to match a Perl variable name but there's a rule that people forget. The first character of a user-defined name has to be a letter or underscore (not a digit). That means that you should use a different character class for the initial letter:
[A-Za-z_][A-Za-z0-9_]*
The underscore means an underscore.