Parsing srt subtitles

后端 未结 6 1779
粉色の甜心
粉色の甜心 2021-02-08 12:31

I want to parse srt subtitles:

    1
    00:00:12,815 --> 00:00:14,509
    Chlapi, jak to jde s
    těma pracovníma světlama?.

    2
    00:00:14,815 -->          


        
相关标签:
6条回答
  • 2021-02-08 12:52
    splits = [s.strip() for s in re.split(r'\n\s*\n', text) if s.strip()]
    regex = re.compile(r'''(?P<index>\d+).*?(?P<start>\d{2}:\d{2}:\d{2},\d{3}) --> (?P<end>\d{2}:\d{2}:\d{2},\d{3})\s*.*?\s*(?P<text>.*)''', re.DOTALL)
    for s in splits:
        r = regex.search(s)
        print r.groups()
    
    0 讨论(0)
  • 2021-02-08 12:52

    Here's a snippet I wrote which converts SRT files into dictionaries:

    import re
    def srt_time_to_seconds(time):
        split_time=time.split(',')
        major, minor = (split_time[0].split(':'), split_time[1])
        return int(major[0])*1440 + int(major[1])*60 + int(major[2]) + float(minor)/1000
    
    def srt_to_dict(srtText):
        subs=[]
        for s in re.sub('\r\n', '\n', srtText).split('\n\n'):
            st = s.split('\n')
            if len(st)>=3:
                split = st[1].split(' --> ')
                subs.append({'start': srt_time_to_seconds(split[0].strip()),
                             'end': srt_time_to_seconds(split[1].strip()),
                             'text': '<br />'.join(j for j in st[2:len(st)])
                            })
        return subs
    

    Usage:

    import srt_to_dict
    with open('test.srt', "r") as f:
            srtText = f.read()
            print srt_to_dict(srtText)
    
    0 讨论(0)
  • 2021-02-08 12:56

    I became quite frustrated with srt libraries available for Python (often because they were heavyweight and eschewed language-standard types in favour of custom classes), so I've spent the last year or so working on my own srt library. You can get it at https://github.com/cdown/srt.

    I tried to keep it simple and light on classes (except for the core Subtitle class, which more or less just stores the SRT block data). It can read and write SRT files, and turn noncompliant SRT files into compliant ones.

    Here's a usage example with your sample input:

    >>> import srt, pprint
    >>> gen = srt.parse('''\
    ... 1
    ... 00:00:12,815 --> 00:00:14,509
    ... Chlapi, jak to jde s
    ... těma pracovníma světlama?.
    ... 
    ... 2
    ... 00:00:14,815 --> 00:00:16,498
    ... Trochu je zesilujeme.
    ... 
    ... 3
    ... 00:00:16,934 --> 00:00:17,814
    ... Jo, sleduj.
    ... 
    ... ''')
    >>> pprint.pprint(list(gen))
    [Subtitle(start=datetime.timedelta(0, 12, 815000), end=datetime.timedelta(0, 14, 509000), index=1, proprietary='', content='Chlapi, jak to jde s\ntěma pracovníma světlama?.'),
     Subtitle(start=datetime.timedelta(0, 14, 815000), end=datetime.timedelta(0, 16, 498000), index=2, proprietary='', content='Trochu je zesilujeme.'),
     Subtitle(start=datetime.timedelta(0, 16, 934000), end=datetime.timedelta(0, 17, 814000), index=3, proprietary='', content='Jo, sleduj.')]
    
    0 讨论(0)
  • 2021-02-08 13:00

    Here's some code I had lying around to parse SRT files:

    from __future__ import division
    
    import datetime
    
    class Srt_entry(object):
        def __init__(self, lines):
            def parsetime(string):
                hours, minutes, seconds = string.split(u':')
                hours = int(hours)
                minutes = int(minutes)
                seconds = float(u'.'.join(seconds.split(u',')))
                return datetime.timedelta(0, seconds, 0, 0, minutes, hours)
            self.index = int(lines[0])
            start, arrow, end = lines[1].split()
            self.start = parsetime(start)
            if arrow != u"-->":
                raise ValueError
            self.end = parsetime(end)
            self.lines = lines[2:]
            if not self.lines[-1]:
                del self.lines[-1]
        def __unicode__(self):
            def delta_to_string(d):
                hours = (d.days * 24) \
                        + (d.seconds // (60 * 60))
                minutes = (d.seconds // 60) % 60
                seconds = d.seconds % 60 + d.microseconds / 1000000
                return u','.join((u"%02d:%02d:%06.3f"
                                  % (hours, minutes, seconds)).split(u'.'))
            return (unicode(self.index) + u'\n'
                    + delta_to_string(self.start)
                    + ' --> '
                    + delta_to_string(self.end) + u'\n'
                    + u''.join(self.lines))
    
    
    srt_file = open("foo.srt")
    entries = []
    entry = []
    for line in srt_file:
        if options.decode:
            line = line.decode(options.decode)
        if line == u'\n':
            entries.append(Srt_entry(entry))
            entry = []
        else:
            entry.append(line)
    srt_file.close()
    
    0 讨论(0)
  • 2021-02-08 13:03

    Why not use pysrt?

    0 讨论(0)
  • 2021-02-08 13:04

    The text is followed by an empty line, or the end of file. So you can use:

    r' .... (?P<text>.*?)(\n\n|$)'
    
    0 讨论(0)
提交回复
热议问题