Parse subtitle file using regex C#

前端 未结 5 473
南笙
南笙 2021-01-26 19:49

I need to find the number, the in and out timecode points and all lines of the text.

9
00:09:48,347 --> 00:09:52,818
- Let\'s see... what else she\'s got?
-          


        
相关标签:
5条回答
  • 2021-01-26 19:59

    I think there's two problems with the regex. The first is that the . near the end in (?<Sub>.+) is not matching newlines. So you could modify it to:

    (?<Sub>(.|[\r\n])+?)
    

    Or you could specify RegexOptions.Singleline as an option to the regex. The only thing the option does is make the dot match newlines.

    The second problem is that .+ matches as many lines as it can. You can make it non-greedy like:

    (?<Sub>(.|[\r\n])+?(?=\r\n\r\n|$))
    

    This matches the least amount of text that ends with an empty line or the end of the string.

    0 讨论(0)
  • 2021-01-26 20:07

    I used this regex in my Ruby parser:

    slines.scan(/(^[0-9]+)\r?\n(.*? --> .*?)\r?\n(.*?)(?=^[0-9]+\r?\n|\s+\Z)/im).map{|z| [z[0],[z[1],z[2].strip]]}
    

    where "slines" is the whole subtitle file read into memory.

    0 讨论(0)
  • 2021-01-26 20:08

    I am using following regular expression to parse .srt files:

    @"(?<number>\d+)\r\n(?<start>\S+)\s-->\s(?<end>\S+)\r\n(?<text>(.|[\r\n])+?)\r\n\r\n"
    

    Regular Expression Language - Quick Reference

    0 讨论(0)
  • 2021-01-26 20:17

    If I were you, I'd step back from a regex-based implementation and look at a state machine, walking through the file line by line. Your format looks simple enough to handle with maybe 20-40 lines of easy-to-understand code, but too complex for a reasonable regex.

    0 讨论(0)
  • 2021-01-26 20:23

    I would personally split the lines into an array and loop through the array examining each line, just doing a regex match for the StartTime->EndTime lines, then you can use some fairly simple logic to grab Order from the previous line, and grab the text from lines following(by searching ahead to find the next StartTime->Endtime and backtracking two lines).

    I think this way chops the problem up a little so that you don't have a regex expression trying to do it all.

    0 讨论(0)
提交回复
热议问题