Remove words from a subtitle file that aren't in a wordlist (of common words)

青春壹個敷衍的年華 提交于 2021-02-10 14:51:16


I have some subtitle files, and I'm not intending to learn every single word in these subtitles, there is no need to learn some hard terms like: cleidocranial, dysplasia...

I found this script here: Remove words from a cell that aren't in a list. But I have no idea how to modify it or run it. (I'm using linux)

Here is our example:

subtitle file (.srt):

00:00:13,000 --> 00:00:15,000
People with cleidocranial dysplasia are good.

wordlist of 3000 common words (.txt):


Output we need (.srt):

00:00:13,000 --> 00:00:15,000
People with * * are good.

Or just mark them if it's possible (.srt):

00:00:13,000 --> 00:00:15,000
People with cleidocranial* dysplasia* are good.

If there is a solution working just with plain texts (without timecodes), it's ok, just explain how to run it
Thank you.


The following processes the 3rd line only of every '.srt' file. It can be easily adapted to process other lines and/or other files.

import os
import re
from glob import glob

with open('words.txt') as f:
    keep_words = {line.strip().lower() for line in f}

for filename_in in glob('*.srt'):
    filename_out = f'{os.path.splitext(filename_in)[0]}'
    with open(filename_in) as fin, open(filename_out, 'w') as fout:
        for i, line in enumerate(fin):
            if i == 2:
                parts = re.split(r"([\w']+)", line.strip())
                parts[1::2] = [w if w.lower() in keep_words else '*' for w in parts[1::2]]
                line = ''.join(parts) + '\n'

Result (for the subtitle.rst you gave as example:

! cat subtitle_new.rst
00:00:13,000 --> 00:00:15,000
People with * * are good.

Alternative: just add a '*' next to out-of-vocabulary words:

# replace:
#                 parts[1::2] = [w if w.lower() in keep_words else '*' for w in parts[1::2]]
                parts[1::2] = [w if w.lower() in keep_words else f'{w}*' for w in parts[1::2]]

The output is then:

00:00:13,000 --> 00:00:15,000
People with cleidocranial* dysplasia* are good.


  • The first open is used to read in all wanted words, make sure they are in lowercase, and put them into a set (for fast membership test).
  • We use glob to find all filenames ending in '.srt'.
  • For each such file, we construct a new filename derived from it as ''.
  • We read in all lines, but modify only line i == 2 (i.e. the 3rd line, since enumerate by default starts at 0).
  • line.strip() removes the trailing newline.
  • We could have used line.strip().split() to split the line into words, but it would have left 'good.' as the last word; not good. The regex used is often used to split words (in particular, it leaves in single quotes such as "don't"; it may or may not be what you want, adapt at will of course).
  • We use a capturing group split r"([\w']+)" instead of splitting on non-word chars, so that we have both words and what separates them in parts. For example, 'People, who are good.' becomes ['', 'People', ', ', 'who', ' ', 'are', ' ', 'good', '.'].
  • The words themselves are every other element of parts, starting at index 1.
  • We replace the words by '*' if their lowercase form is not in keep_words.
  • Finally we re-assemble that line, and generally output all lines to the new file.


you could simply run a python script like this:

with open("words.txt", "rt") as words:
    #create a list with every word
    wordList ="\n")

with open("", "rt") as subtitles:
    with open("", "wt") as out:
        for line in subtitles.readlines():
            if line[0].isdigit():
                #ignore the line as it starts with a digit
                for word in line.split():
                    if not word in wordList:
                        out.write(line.replace(word, f"*{word}*"))

this script will replace every word that's not in the common words file with the modified *word* keeping the original file and putting everything into a new output file

