How do I program bigram as a table in python?

前端 未结 1 1393
猫巷女王i
猫巷女王i 2021-01-27 08:37

I\'m doing this homework, and I am stuck at this point. I can\'t program Bigram frequency in the English language, \'conditional probability\' in python?

相关标签:
1条回答
  • 2021-01-27 09:23

    Assuming your file has no other punctuation (easy enough to strip out):

    import itertools
    
    def pairwise(s):
        a,b = itertools.tee(s)
        next(b)
        return zip(a,b)
    
    counts = [[0 for _ in range(52)] for _ in range(52)]  # nothing has occurred yet
    with open('path/to/input') as infile:
        for a,b in pairwise(char for line in infile for word in line.split() for char in word):  # get pairwise characters from the text
            given = ord(a) - ord('a')  # index (in `counts`) of the "given" character
            char = ord(b) - ord('a')   # index of the character that follows the "given" character
            counts[given][char] += 1
    
    # now that we have the number of occurrences, let's divide by the totals to get conditional probabilities
    
    totals = [sum(count[i] for i in range(52)) for count in counts]
    for given in range(52):
        if not totals[given]:
            continue
        for i in range(len(counts[given])):
            counts[given][i] /= totals[given]
    

    I haven't tested this, but it should be a good start

    Here's a dictionary version, which should be easier to read and debug:

    counts = {}
    with open('path/to/input') as infile:
        for a,b in pairwise(char for line in infile for word in line.split() for char in word):
            given = ord(a) - ord('a')
            char = ord(b) - ord('a')
            if given not in counts:
                counts[given] = {}
            if char not in counts[given]:
                counts[given][char] = 0
            counts[given][char] += 1
    
    answer = {}
    for given, chardict in answer.items():
        total = sum(chardict.values())
        for char, count in chardict.items():
            answer[given][char] = count/total
    

    Now, answer contains the probabilities you are after. If you want the probability of 'a', given 'b', look at answer['b']['a']

    0 讨论(0)
提交回复
热议问题