I\'m doing this homework, and I am stuck at this point. I can\'t program Bigram frequency in the English language, \'conditional probability\' in python?
Assuming your file has no other punctuation (easy enough to strip out):
import itertools
def pairwise(s):
a,b = itertools.tee(s)
next(b)
return zip(a,b)
counts = [[0 for _ in range(52)] for _ in range(52)] # nothing has occurred yet
with open('path/to/input') as infile:
for a,b in pairwise(char for line in infile for word in line.split() for char in word): # get pairwise characters from the text
given = ord(a) - ord('a') # index (in `counts`) of the "given" character
char = ord(b) - ord('a') # index of the character that follows the "given" character
counts[given][char] += 1
# now that we have the number of occurrences, let's divide by the totals to get conditional probabilities
totals = [sum(count[i] for i in range(52)) for count in counts]
for given in range(52):
if not totals[given]:
continue
for i in range(len(counts[given])):
counts[given][i] /= totals[given]
I haven't tested this, but it should be a good start
Here's a dictionary version, which should be easier to read and debug:
counts = {}
with open('path/to/input') as infile:
for a,b in pairwise(char for line in infile for word in line.split() for char in word):
given = ord(a) - ord('a')
char = ord(b) - ord('a')
if given not in counts:
counts[given] = {}
if char not in counts[given]:
counts[given][char] = 0
counts[given][char] += 1
answer = {}
for given, chardict in answer.items():
total = sum(chardict.values())
for char, count in chardict.items():
answer[given][char] = count/total
Now, answer
contains the probabilities you are after. If you want the probability of 'a', given 'b', look at answer['b']['a']