Reading only the words of a specific speaker and adding those words to a list

后端 未结 2 1296
既然无缘
既然无缘 2021-01-28 06:08

I have a transcript and in order to perform an analysis of each speaker I need to only add their words to a string. The problem I\'m having is that each line does not start with

2条回答
  •  说谎
    说谎 (楼主)
    2021-01-28 07:00

    Using a regex is the best way to go. As you'll be using it multiple times, you can save on a bit of processing by compiling it before using it to match each line.

    import re
    
    speaker_words = {}
    speaker_pattern = re.compile(r'^(\w+?):(.*)$')
    
    with open("transcript.txt", "r") as f:
            lines = f.readlines()
            current_speaker = None
            for line in lines:
                    line = line.strip()
                    match = speaker_pattern.match(line)
                    if match is not None:
                            current_speaker = match.group(1)
                            line = match.group(2).strip()
                            if current_speaker not in speaker_words.keys():
                                    speaker_words[current_speaker] = []
                    if current_speaker:
                            # you may want to do some sort of punctuation filtering too
                            words = [word.strip() for word in line.split(' ') if len(word.strip()) > 0]
                            speaker_words[current_speaker].extend(words)
    
    print speaker_words
    

    This outputs the following:

    {
        "BOB": ['blah', 'blah', 'blah', 'blah', 'blah', 'hello', 'goodbye', 'etc.', 'blah', 'blah', 'blah', 'blah', 'blah', 'blah', 'blah.'],
        "JERRY": ['.............................................', '...............']
    }
    

提交回复
热议问题