I\'m trying to extract a JIRA identifier from a line of text.
JIRA identifiers are of the form [A-Z]+-[0-9] - I have the following pattern:
foreach m
If you include sample data with your question, you get the best shot at answers from those who might not have Jira, etc.
Here's another take on it-
my $matcher = qr/ (?: (?<=\A) | (?<=\s) )
([A-Z]{1,4}-[1-9][0-9]{0,6})
(?=\z|\s|[[:punct:]]) /x;
while ( <DATA> )
{
chomp;
my @matches = /$matcher/g;
printf "line: %s\n\tmatches: %s\n",
$_,
@matches ? join(", ", @matches) : "none";
}
__DATA__
JIRA-001 is not valid but JIRA-1 is and so is BIN-10000,
A-1, and TACO-7133 but why look for BIN-10000000 or BINGO-1?
Remember that [0-9]
will match 0001 and friends which you probably don't want. I think, but can't verify, Jira truncates issue prefixes to 4 characters max. So the regex I did only allows 1-4 capital letters; easy to change if wrong. 10 million tickets seems like a reasonably high top end for issue numbers. I also allowed for trailing punctuation. You may have to season that kind of thing to taste, wild data. You need the g
and capture to an array instead of a scalar if you're matching strings that could have more than one issue id.
line: JIRA-001 is not valid but JIRA-1 is and so is BIN-10000,
matches: JIRA-1, BIN-10000
line: A-1, and TACO-7133 but why look for BIN-10000000 or BINGO-1?
matches: A-1, TACO-7133
You can make sure that character before your pattern is either a whitespace, or the beginning of the string using alternation. Similarly make sure, it is followed by either whitespace or end of the string.
You can use this regex:
my ( $id ) = ( $line =~ /(?:\s|^)([A-Z]+-[0-9]+)(?=\s|$)/ );
Atlassian themselves have a couple webpages floating around that suggest a good (java) regex is this:
((?<!([A-Z]{1,10})-?)[A-Z]+-\d+)
(Source: https://confluence.atlassian.com/display/STASHKB/Integrating+with+custom+JIRA+issue+key)
Test String:
"BF-18 abc-123 X-88 ABCDEFGHIJKL-999 abc XY-Z-333 abcDEF-33 ABC-1"
Matches:
BF-18, X-88, ABCDEFGHIJKL-999, DEF-33, ABC-1
But, I don't really like it because it will match the "DEF-33" from "abcDEF-33", whereas I prefer to ignore "abcDEF-33" altogether. So in my own code I'm using:
((?<!([A-Za-z]{1,10})-?)[A-Z]+-\d+)
Notice how "DEF-33" is no longer matched:
Test String:
"BF-18 abc-123 X-88 ABCDEFGHIJKL-999 abc XY-Z-333 abcDEF-33 ABC-1"
Matches:
BF-18, X-88, ABCDEFGHIJKL-999, ABC-1
I also needed this regex in JavaScript. Unfortunately, JavaScript does not support the LookBehind (?<!a)b
, and so I had to port it to LookAhead a(?!b)
and reverse everything:
var jira_matcher = /\d+-[A-Z]+(?!-?[a-zA-Z]{1,10})/g
This means the string to be matched needs to be reversed ahead of time, too:
var s = "BF-18 abc-123 X-88 ABCDEFGHIJKL-999 abc XY-Z-333 abcDEF-33 ABC-1"
s = reverse(s)
var m = s.match(jira_matcher);
// Also need to reverse all the results!
for (var i = 0; i < m.length; i++) {
m[i] = reverse(m[i])
}
m.reverse()
console.log(m)
// Output:
[ 'BF-18', 'X-88', 'ABCDEFGHIJKL-999', 'ABC-1' ]