How do I program bigram as a table in python?

前端未结

关注

 1  1393

I\'m doing this homework, and I am stuck at this point. I can\'t program Bigram frequency in the English language, \'conditional probability\' in python?

相关标签:

1条回答

悲&欢浪女

2021-01-27 09:23

Assuming your file has no other punctuation (easy enough to strip out):

import itertools

def pairwise(s):
    a,b = itertools.tee(s)
    next(b)
    return zip(a,b)

counts = [[0 for _ in range(52)] for _ in range(52)]  # nothing has occurred yet
with open('path/to/input') as infile:
    for a,b in pairwise(char for line in infile for word in line.split() for char in word):  # get pairwise characters from the text
        given = ord(a) - ord('a')  # index (in `counts`) of the "given" character
        char = ord(b) - ord('a')   # index of the character that follows the "given" character
        counts[given][char] += 1

# now that we have the number of occurrences, let's divide by the totals to get conditional probabilities

totals = [sum(count[i] for i in range(52)) for count in counts]
for given in range(52):
    if not totals[given]:
        continue
    for i in range(len(counts[given])):
        counts[given][i] /= totals[given]

I haven't tested this, but it should be a good start

Here's a dictionary version, which should be easier to read and debug:

counts = {}
with open('path/to/input') as infile:
    for a,b in pairwise(char for line in infile for word in line.split() for char in word):
        given = ord(a) - ord('a')
        char = ord(b) - ord('a')
        if given not in counts:
            counts[given] = {}
        if char not in counts[given]:
            counts[given][char] = 0
        counts[given][char] += 1

answer = {}
for given, chardict in answer.items():
    total = sum(chardict.values())
    for char, count in chardict.items():
        answer[given][char] = count/total

Now, answer contains the probabilities you are after. If you want the probability of 'a', given 'b', look at answer['b']['a']

0 讨论(0)