Parsing srt subtitles

后端未结

关注

 6  1779

I want to parse srt subtitles:

    1
    00:00:12,815 --> 00:00:14,509
    Chlapi, jak to jde s
    těma pracovníma světlama?.

    2
    00:00:14,815 -->


                      
              相关标签:


      
      
        
          6条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  無奈伤痛        
                
              
                            
                2021-02-08 12:52
              
            
            
                                                                       
splits = [s.strip() for s in re.split(r'\n\s*\n', text) if s.strip()]
regex = re.compile(r'''(?P<index>\d+).*?(?P<start>\d{2}:\d{2}:\d{2},\d{3}) --> (?P<end>\d{2}:\d{2}:\d{2},\d{3})\s*.*?\s*(?P<text>.*)''', re.DOTALL)
for s in splits:
    r = regex.search(s)
    print r.groups()

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  后悔当初        
                
              
                            
                2021-02-08 12:52
              
            
            
                                                                       
Here's a snippet I wrote which converts SRT files into dictionaries:

import re
def srt_time_to_seconds(time):
    split_time=time.split(',')
    major, minor = (split_time[0].split(':'), split_time[1])
    return int(major[0])*1440 + int(major[1])*60 + int(major[2]) + float(minor)/1000

def srt_to_dict(srtText):
    subs=[]
    for s in re.sub('\r\n', '\n', srtText).split('\n\n'):
        st = s.split('\n')
        if len(st)>=3:
            split = st[1].split(' --> ')
            subs.append({'start': srt_time_to_seconds(split[0].strip()),
                         'end': srt_time_to_seconds(split[1].strip()),
                         'text': '<br />'.join(j for j in st[2:len(st)])
                        })
    return subs


Usage:

import srt_to_dict
with open('test.srt', "r") as f:
        srtText = f.read()
        print srt_to_dict(srtText)

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  悲&欢浪女        
                
              
                            
                2021-02-08 12:56
              
            
            
                                                                       
I became quite frustrated with srt libraries available for Python (often because they were heavyweight and eschewed language-standard types in favour of custom classes), so I've spent the last year or so working on my own srt library. You can get it at https://github.com/cdown/srt.

I tried to keep it simple and light on classes (except for the core Subtitle class, which more or less just stores the SRT block data). It can read and write SRT files, and turn noncompliant SRT files into compliant ones.

Here's a usage example with your sample input:

>>> import srt, pprint
>>> gen = srt.parse('''\
... 1
... 00:00:12,815 --> 00:00:14,509
... Chlapi, jak to jde s
... těma pracovníma světlama?.
... 
... 2
... 00:00:14,815 --> 00:00:16,498
... Trochu je zesilujeme.
... 
... 3
... 00:00:16,934 --> 00:00:17,814
... Jo, sleduj.
... 
... ''')
>>> pprint.pprint(list(gen))
[Subtitle(start=datetime.timedelta(0, 12, 815000), end=datetime.timedelta(0, 14, 509000), index=1, proprietary='', content='Chlapi, jak to jde s\ntěma pracovníma světlama?.'),
 Subtitle(start=datetime.timedelta(0, 14, 815000), end=datetime.timedelta(0, 16, 498000), index=2, proprietary='', content='Trochu je zesilujeme.'),
 Subtitle(start=datetime.timedelta(0, 16, 934000), end=datetime.timedelta(0, 17, 814000), index=3, proprietary='', content='Jo, sleduj.')]

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  慢半拍i        
                
              
                            
                2021-02-08 13:00
              
            
            
                                                                       
Here's some code I had lying around to parse SRT files:

from __future__ import division

import datetime

class Srt_entry(object):
    def __init__(self, lines):
        def parsetime(string):
            hours, minutes, seconds = string.split(u':')
            hours = int(hours)
            minutes = int(minutes)
            seconds = float(u'.'.join(seconds.split(u',')))
            return datetime.timedelta(0, seconds, 0, 0, minutes, hours)
        self.index = int(lines[0])
        start, arrow, end = lines[1].split()
        self.start = parsetime(start)
        if arrow != u"-->":
            raise ValueError
        self.end = parsetime(end)
        self.lines = lines[2:]
        if not self.lines[-1]:
            del self.lines[-1]
    def __unicode__(self):
        def delta_to_string(d):
            hours = (d.days * 24) \
                    + (d.seconds // (60 * 60))
            minutes = (d.seconds // 60) % 60
            seconds = d.seconds % 60 + d.microseconds / 1000000
            return u','.join((u"%02d:%02d:%06.3f"
                              % (hours, minutes, seconds)).split(u'.'))
        return (unicode(self.index) + u'\n'
                + delta_to_string(self.start)
                + ' --> '
                + delta_to_string(self.end) + u'\n'
                + u''.join(self.lines))


srt_file = open("foo.srt")
entries = []
entry = []
for line in srt_file:
    if options.decode:
        line = line.decode(options.decode)
    if line == u'\n':
        entries.append(Srt_entry(entry))
        entry = []
    else:
        entry.append(line)
srt_file.close()

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  逝去的感伤        
                
              
                            
                2021-02-08 13:03
              
            
            
                                                                       
Why not use pysrt?
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  执笔经年        
                
              
                            
                2021-02-08 13:04
              
            
            
                                                                       
The text is followed by an empty line, or the end of file. So you can use:

r' .... (?P<text>.*?)(\n\n|$)'

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复