问题
I have some subtitle files, and I'm not intending to learn every single word in these subtitles, there is no need to learn some hard terms like: cleidocranial, dysplasia...
I found this script here: Remove words from a cell that aren't in a list. But I have no idea how to modify it or run it. (I'm using linux)
Here is our example:
subtitle file (.srt):
2
00:00:13,000 --> 00:00:15,000
People with cleidocranial dysplasia are good.
wordlist of 3000 common words (.txt):
...
people
with
are
good
...
Output we need (.srt):
2
00:00:13,000 --> 00:00:15,000
People with * * are good.
Or just mark them if it's possible (.srt):
2
00:00:13,000 --> 00:00:15,000
People with cleidocranial* dysplasia* are good.
If there is a solution working just with plain texts (without timecodes), it's ok, just explain how to run it
Thank you.
回答1:
The following processes the 3rd line only of every '.srt'
file. It can be easily adapted to process other lines and/or other files.
import os
import re
from glob import glob
with open('words.txt') as f:
keep_words = {line.strip().lower() for line in f}
for filename_in in glob('*.srt'):
filename_out = f'{os.path.splitext(filename_in)[0]}_new.srt'
with open(filename_in) as fin, open(filename_out, 'w') as fout:
for i, line in enumerate(fin):
if i == 2:
parts = re.split(r"([\w']+)", line.strip())
parts[1::2] = [w if w.lower() in keep_words else '*' for w in parts[1::2]]
line = ''.join(parts) + '\n'
fout.write(line)
Result (for the subtitle.rst
you gave as example:
! cat subtitle_new.rst
2
00:00:13,000 --> 00:00:15,000
People with * * are good.
Alternative: just add a '*'
next to out-of-vocabulary words:
# replace:
# parts[1::2] = [w if w.lower() in keep_words else '*' for w in parts[1::2]]
parts[1::2] = [w if w.lower() in keep_words else f'{w}*' for w in parts[1::2]]
The output is then:
2
00:00:13,000 --> 00:00:15,000
People with cleidocranial* dysplasia* are good.
Explanation:
- The first
open
is used to read in all wanted words, make sure they are in lowercase, and put them into aset
(for fast membership test). - We use
glob
to find all filenames ending in'.srt'
. - For each such file, we construct a new filename derived from it as
'..._new.srt'
. - We read in all lines, but modify only line
i == 2
(i.e. the 3rd line, sinceenumerate
by default starts at 0). line.strip()
removes the trailing newline.- We could have used
line.strip().split()
to split the line into words, but it would have left'good.'
as the last word; not good. The regex used is often used to split words (in particular, it leaves in single quotes such as"don't"
; it may or may not be what you want, adapt at will of course). - We use a capturing group split
r"([\w']+)"
instead of splitting on non-word chars, so that we have both words and what separates them inparts
. For example,'People, who are good.'
becomes['', 'People', ', ', 'who', ' ', 'are', ' ', 'good', '.']
. - The words themselves are every other element of
parts
, starting at index 1. - We replace the words by
'*'
if their lowercase form is not inkeep_words
. - Finally we re-assemble that line, and generally output all lines to the new file.
回答2:
you could simply run a python script like this:
with open("words.txt", "rt") as words:
#create a list with every word
wordList = words.read().split("\n")
with open("subtitle.srt", "rt") as subtitles:
with open("subtitle_output.srt", "wt") as out:
for line in subtitles.readlines():
if line[0].isdigit():
#ignore the line as it starts with a digit
out.write(line)
continue
else:
for word in line.split():
if not word in wordList:
out.write(line.replace(word, f"*{word}*"))
this script will replace every word that's not in the common words file with the modified *word*
keeping the original file and putting everything into a new output file
来源:https://stackoverflow.com/questions/65550885/remove-words-from-a-subtitle-file-that-arent-in-a-wordlist-of-common-words