RegEx pattern to limit dashes in these circumstances

∥☆過路亽.° 提交于 2019-12-12 11:19:26

问题


Scenario

I'm using a 3rd party file renaming software which is written in Delphi and has pascal-script support: http://www.den4b.com/?x=products&product=renamer

The application allows the usage of regular expressions to rename files. this means that if what I need to do with a filename cannot be accomplished only using one RegEx, then I could use simultaneous various expressions or also a pascal-script code to accommodate the filename until I can properly format the filename for the needs of this question or anything else...

Problem

I need to format song filenames like these below, in these filenames the "...featuring artist" part is at the right of the string, I need to match that and position it in the left part of the string.

  • Carbin & Sirmark - Sorry Feat. Sevener
  • Kristjan Cash Cash - Take Me Home Feat. Bebe Rexha (Revoke Remix)

To make this simple to understand, we could imaginary tokenize the filename like this:

[0]ARTIST   [1]DASH   [2]TRACK   [3]FEAT_ARTIST   [4]POSSIBLE_ADDITIONAL_INFO_INSIDE:()[]{}

Then what I need to do with a RegEx, is format the filename to positionate the tokens in this order:

[0]ARTIST   [3]FEAT_ARTIST   [1]DASH   [2]TRACK   [4]POSSIBLE_ADDITIONAL_INFO_INSIDE:()[]{}

I actually do that using this RegEx:

\A([^-]?)\s-\s*(.?)\s([([])?((ft[.\s]|feat[.\s]|featuring[.\s])[^(){}[]]*)([)]])?(.+)?\Z

Replacing with:

$1 $4 - $2$7

The problem begins here, because the [0]ARTIST and [2]TRACK tokens could contains dashes like for example this filename:

  • Dj E-nergy C-21 - My Super-hero track! feat Dj Ass-hole

Then, correct me if I'm wrong, but I think its just impossible to solve this in any way, because a machine can't predict when to separate one token for the other, what is a name or what isn't, because I can't know the number of dashes that contains the filename.

For that reason, instead of looking for ingenuos perfection that could cause bad filenames because the amount of dashes inside, I prefer to look for a filename exclusion solution, by limiting the dashes that the expression should match in the filename.

Question

Taking as example the RegEx that I shown above to extend/improve it, how I could exclude filenames that contains an [0]ARTIST or an [2]TRACK tokens with dashes?

...Or in other words, how I can tell my RegEx to avoid modifying a filename when the filename contains more than 1 dash before the "...featuring artist" part? (not after)

Basically the Regex should determine whether [1]DASH is found more than once before [3]FEAT_ARTIST, if yes then exclude that filename (don't modify it)

I know how to limit the occurrence of a Regex group something more or less like this ([\-]){1} to match only 1 dash occurrence, but I'm not sure how to implement it in the expression I'm using.


Expected Results

Just some random examples...

One dash only before the [3]FEAT_ARTIST so we can know when to separate [0]ARTIST from [2]TRACK tokens.

  • From: 'Carbin & Sirmark - Sorry Feat. Sevener'
  • To: 'Carbin & Sirmark Feat. Sevener - Sorry'

One dash only before the [3]FEAT_ARTIST so we can know when to separate [0]ARTIST from [2]TRACK tokens. With [4]POSSIBLE_ADDITIONAL_INFO_INSIDE:()[]{}.

  • From: 'Flight Facilities - Heart Attack Feat. Owl Eyes (Snakehips Remix)'
  • To: 'Flight Facilities Feat. Owl Eyes - Heart Attack (Snakehips Remix)'

One dash only before the [3]FEAT_ARTIST so we can know when to separate [0]ARTIST from [2]TRACK tokens. With [4]POSSIBLE_ADDITIONAL_INFO_INSIDE:()[]{} which also contains dashes.

  • From: 'Flight Facilities - Heart Attack Feat. Owl Eyes [Snake--hips Remix]'
  • To: 'Flight Facilities Feat. Owl Eyes - Heart Attack [Snake--hips Remix]'

One dash only between [0]ARTIST an [2]TRACK tokens, but the filename doesn't have a [3]FEAT_ARTIST so we don't touch it.

  • From: 'Fedde Le Grand - Cinematic'
  • To: 'Fedde Le Grand - Cinematic'

One dash only between [0]ARTIST an [2]TRACK tokens, but the [3]FEAT_ARTIST is before the [1]DASH so we don't touch it.

  • From: 'Fedde Le Grand Feat. Denny White - Cinematic'
  • To: 'Fedde Le Grand Feat. Denny White - Cinematic'

[0]ARTIST has dashes, so we can't know when to separate [0]ARTIST and [2]TRACK tokens, so the Regex should excludes this to don't modify this filename.

  • From: 'Artist-Name - Track Name feat someone'
  • To: 'Artist-Name - Track Name feat someone'

[2]TRACK has dashes, so we can't know when to separate [0]ARTIST and [2]TRACK tokens, so the Regex should excludes this to don't modify this filename.

  • From: 'Artist Name - Track-Name feat someone'
  • To: 'Artist Name - Track-Name feat someone'

[0]ARTIST and [2]TRACK tokens has dashes, so we can't know when to separate them, so the Regex should excludes this to don't modify this filename.

  • From: 'Dj E-nergy C-21 - My Super-hero track! feat Dj Ass-hole'
  • To: 'Dj E-nergy C-21 - My Super-hero track! feat Dj Ass-hole'

[0]ARTIST and [2]TRACK tokens has dashes and also [3]FEAT_ARTIST doesn't exists, again nothing to do here.

  • From: 'Dj E-nergy C-21 - My Super-hero track!'
  • To: 'Dj E-nergy C-21 - My Super-hero track!'

I hope this helps to understand what I need.


回答1:


Try with:

^(.+)\s+-\s+(.+?)\s+[fF](t|eat(uring)?)?\.?([^([\])\n]+)(.+)?$

DEMO

and use replace with: $1 Feat.$5 - $2$6

I tried it with ReNamer and Regex101, and it works also if there is - ( + - + ) in artist name, like artist - name, BUT it will fail if there is such fragment in title part.

The ^(.+)\s+-\s+ part use a greedy quantifier .+ before a sequence space-dash-space, which is treated as delimiter between artist name and title of track. So it will match as much as it can, up to last occurrence of -, because of that, it will "ignore" the dashes with spaces in names of artist, but it will case invalid match, if such element occur in track title. So the:

  • Artist - name - track title feat. someone - it will be matched and modified properly,
  • Artist name - track - title feat. someone - it will fail, as text will be splitted on last dash.

Instead of (ft[.\s]|feat[.\s]|featuring[.\s]) I used [fF](t|eat(uring)?)?\.? which match similar, but should work faster (it should restrain backtracing a little bit).

in my demo, there is a + instead \s+ (like above) as it would match multiline in the demonstration, and show invalid results, but in oneline cases, like in your problem, it should work fine.




回答2:


I think the only thing you need to realize/change is that there is a distinguishable difference between the "separator hyphen" and the "embedded hyphens". Namely none of the embedded hyphens would have spaces on BOTH sides (I expect; you'll need to verify that). All you should need to do is change the beginning of your regexp above from \A([^-]?)\s-\s* to \A(.?)\s-\s+...




回答3:


I put all your file names into text editor UltraEdit version 22.10:

Carbin & Sirmark - Sorry Feat. Sevener
Kristjan Cash Cash - Take Me Home Feat. Bebe Rexha (Revoke Remix)
Dj E-nergy C-21 - My Super-hero track! feat Dj Ass-hole
Flight Facilities - Heart Attack Feat. Owl Eyes (Snakehips Remix)
Flight Facilities - Heart Attack Feat. Owl Eyes [Snake--hips Remix]
Fedde Le Grand - Cinematic
Fedde Le Grand Feat. Denny White - Cinematic
Artist-Name - Track Name feat someone
Artist Name - Track-Name feat someone
Dj E-nergy C-21 - My Super-hero track! feat Dj Ass-hole
Dj E-nergy C-21 - My Super-hero track!

With the Perl regular expression search string

^(.+) - (.+?) ((?:featuring|feat\.?|ft\.?) +(?:[^\r\n (\[{]| (?![(\[{]))+)

and the replace string

$1 $3 - $2

those file names were modified with a case insensitive Replace All to

Carbin & Sirmark Feat. Sevener - Sorry
Kristjan Cash Cash Feat. Bebe Rexha - Take Me Home (Revoke Remix)
Dj E-nergy C-21 feat Dj Ass-hole - My Super-hero track!
Flight Facilities Feat. Owl Eyes - Heart Attack (Snakehips Remix)
Flight Facilities Feat. Owl Eyes - Heart Attack [Snake--hips Remix]
Fedde Le Grand - Cinematic
Fedde Le Grand Feat. Denny White - Cinematic
Artist-Name feat someone - Track Name
Artist Name feat someone - Track-Name
Dj E-nergy C-21 feat Dj Ass-hole - My Super-hero track!
Dj E-nergy C-21 - My Super-hero track!

which looks like is what you want. UltraEdit uses the Boost Perl regular expression library.

If the file renaming tool supports also negative lookaheads and greedy matching behavior, the expression perhaps useful for this task is:

\A(.+) - (.+?) ((?:featuring|feat\.?|ft\.?) +(?:[^ (\[{]| (?![(\[{]))+)

and the replace string is also:

$1 $3 - $2

Explanation of the search string:

^ ... start of a line
\A ... start of buffer

(.+) -  ... a greedy expression which matches any character 1 or more times (except newline characters) up to last occurrence of spacedashspace in a marking group not including  -  which results still in a positive match for the entire expression.

(.+?)  ... a non greedy expression also in a capturing group matching any character (except newline characters) one or more times up to next occurrence of a space and ...

(?:featuring|feat\.?|ft\.?) + ... word featuring OR abbreviation feat with or without a dot OR abbreviation ft with or without a dot AND 1 or more spaces.

( ... begin of third capturing group.

(?:[^\r\n (\[{]| (?![(\[{]))+ ... a non marking group matching either

  • a character not being
    • a carriage return or a line-feed (UE search string only), or
    • an opening parenthesis, or
    • an opening square bracket, or
    • an opening brace

or

  • a space with using a negative lookahead expression checking if next character not being
    • an opening parenthesis, or
    • an opening square bracket, or
    • an opening brace

one or more times. In other words this last expression matches everything up to end of file name or ( or [ or { not including the space left to those characters to avoid getting spacespacedash after FEAT_ARTIST after replace.

) ... finally ends third capturing group.


Edit 1: Also working (in UltraEdit) is the search string:

^(.+) - (.+?) ((?:featuring|feat|ft)[ .]+(?:[^\r\n (\[{]| (?![(\[{]))+)

which would except also featuring., but makes expression a little bit easier.


Edit 2: Also working (in UltraEdit) is the search string:

^((?:.(?! - ))+.) - ((?:.(?! - ))+) ((?:featuring|feat|ft)[ .]+(?:[^\r\n (\[{]| (?![(\[{]))+)

which ignores all lines containing two spacedashspace left to FEAT_ARTIST.

This expression matches character by character with using a negative lookahead if string after current character is not spacedashspace. This is allowed for first capturing group which selects the string up to last character left of first spacedashspace, but for second capturing group there should be no more spacedashspace as this would definitely result in a negative result for the entire expression.




回答4:


With the help of @m.cekiera's regex I solved this by using a pascal-script that prevents any replacement when more than one dash is found in the filename:

// Formats an audio filename that has the "...featuring artist" part at the end of filename.
//------------------------------------------------------------------------------------------


// Pseudo-Example:
//
// From: [0]ARTIST_NAME  [1]DASH  [2]TRACK_TITLE  [3]FEAT_ARTIST  [4]POSSIBLE_ADDITIONAL_INFO_INSIDE:()[]{}
// To:   [0]ARTIST_NAME  [3]FEAT_ARTIST  [1]DASH  [2]TRACK_TITLE  [4]POSSIBLE_ADDITIONAL_INFO_INSIDE:()[]{}

// Real-Example:
//
// From: Carbin & Sirmark - Sorry Feat. Sevener.mp3
// To:   Carbin & Sirmark Feat. Sevener - Sorry.mp3

// Known limitations:
//
// • If [0]ARTIST_NAME or [2]TRACK_TITLE parts contains any " - " the script will not work properlly.
//   By default the script prevents any replacement on that kind of filenames, so don't worry.


var
  rgxPattern: string;
  rgxReplace: string;
  dashCount: integer;
  baseName: string;
  extension: WideString;

begin

  baseName  := WideExtractBaseName(FileName)
  extension := WideExtractFileExt(FileName);

  // The regular expression that matches the filename parts.
  // http://stackoverflow.com/questions/32807698/regex-pattern-to-limit-dashes-in-these-circumstances
  rgxPattern := '^(.+)\s+-\s+(.+?)\s+[fF](t|eat(uring)?)?\.?([^([\])\n]+)(.+)?$'
  rgxReplace := '$1 Feat.$5 - $2$6'

  // The amount of " - " that contains the filename.
  dashCount := high(MatchesRegEx(baseName, '\s-\s' , false));

  // If only one " - " is found then...
  If (dashCount = 0) Then
    begin // Do the replacement.
      baseName := ReplaceRegEx(baseName, rgxPattern, rgxReplace, false, true)
      FileName := baseName + extension;
    end;

end.   


来源:https://stackoverflow.com/questions/32807698/regex-pattern-to-limit-dashes-in-these-circumstances

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!