Building a Transition Matrix using words in Python/Numpy

后端 未结 6 1741
抹茶落季
抹茶落季 2020-12-09 22:32

Im trying to build a 3x3 transition matrix with this data

days=[\'rain\', \'rain\', \'rain\', \'clouds\', \'rain\', \'sun\', \'clouds\', \'clouds\', 
  \'rai         


        
相关标签:
6条回答
  • 2020-12-09 23:11

    I like a combination of pandas and itertools for this. The code block is a bit longer than the above, but don't conflate verbosity with speed. (The window func should be very fast; the pandas portion will be slower admittedly.)

    First, make a "window" function. Here's one from the itertools cookbook. This gets you to a list of tuples of transitions (state1 to state2).

    from itertools import islice
    
    def window(seq, n=2):
        "Sliding window width n from seq.  From old itertools recipes."""
        it = iter(seq)
        result = tuple(islice(it, n))
        if len(result) == n:
            yield result
        for elem in it:
            result = result[1:] + (elem,)
            yield result
    
    # list(window(days))
    # [('rain', 'rain'),
    #  ('rain', 'rain'),
    #  ('rain', 'clouds'),
    #  ('clouds', 'rain'),
    #  ('rain', 'sun'),
    # ...
    

    Then use a pandas groupby + value counts operation to get a transition matrix from each state1 to each state2:

    import pandas as pd
    
    pairs = pd.DataFrame(window(days), columns=['state1', 'state2'])
    counts = pairs.groupby('state1')['state2'].value_counts()
    probs = (counts / counts.sum()).unstack()
    

    Your result looks like this:

    print(probs)
    state2  clouds  rain   sun
    state1                    
    clouds    0.13  0.09  0.10
    rain      0.06  0.11  0.09
    sun       0.13  0.06  0.23
    
    0 讨论(0)
  • 2020-12-09 23:12

    Here is a "pure" numpy solution it creates 3x3 tables where the zeroth dim (row number) corresponds to today and the last dim (column number) corresponds to tomorrow.

    The conversion from words to indices is done by truncating after the first letter and then using a lookup table.

    For counting numpy.add.at is used.

    This was written with efficiency in mind. It does a million words in less than a second.

    import numpy as np
    
    report = [
      'rain', 'rain', 'rain', 'clouds', 'rain', 'sun', 'clouds', 'clouds', 
      'rain', 'sun', 'rain', 'rain', 'clouds', 'clouds', 'sun', 'sun', 
      'clouds', 'clouds', 'rain', 'clouds', 'sun', 'rain', 'rain', 'sun',
      'sun', 'clouds', 'clouds', 'rain', 'rain', 'sun', 'sun', 'rain', 
      'rain', 'sun', 'clouds', 'clouds', 'sun', 'sun', 'clouds', 'rain', 
      'rain', 'rain', 'rain', 'sun', 'sun', 'sun', 'sun', 'clouds', 'sun', 
      'clouds', 'clouds', 'sun', 'clouds', 'rain', 'sun', 'sun', 'sun', 
      'clouds', 'sun', 'rain', 'sun', 'sun', 'sun', 'sun', 'clouds', 
      'rain', 'clouds', 'clouds', 'sun', 'sun', 'sun', 'sun', 'sun', 'sun', 
      'clouds', 'clouds', 'clouds', 'clouds', 'clouds', 'sun', 'rain', 
      'rain', 'rain', 'clouds', 'sun', 'clouds', 'clouds', 'clouds', 'rain', 
      'clouds', 'rain', 'sun', 'sun', 'clouds', 'sun', 'sun', 'sun', 'sun',
      'sun', 'sun', 'rain']
    
    # create np array, keep only first letter (by forcing dtype)
    # obviously, this only works because rain, sun, clouds start with different
    # letters
    # cast to int type so we can use for indexing
    ri = np.array(report, dtype='|S1').view(np.uint8)
    # create lookup
    c, r, s = 99, 114, 115 # you can verify this using chr and ord
    lookup = np.empty((s+1,), dtype=int)
    lookup[[c, r, s]] = np.arange(3)
    # translate c, r, s to 0, 1, 2
    rc = lookup[ri]
    # get counts (of pairs (today, tomorrow))
    cnts = np.zeros((3, 3), dtype=int)
    np.add.at(cnts, (rc[:-1], rc[1:]), 1)
    # or as probs
    probs = cnts / cnts.sum()
    # or as condional probs (if today is sun how probable is rain tomorrow etc.)
    cond = cnts / cnts.sum(axis=-1, keepdims=True)
    
    print(cnts)
    print(probs)
    print(cond)
    
    # [13  9 10]
    #  [ 6 11  9]
    #  [13  6 23]]
    # [[ 0.13  0.09  0.1 ]
    #  [ 0.06  0.11  0.09]
    #  [ 0.13  0.06  0.23]]
    # [[ 0.40625     0.28125     0.3125    ]
    #  [ 0.23076923  0.42307692  0.34615385]
    #  [ 0.30952381  0.14285714  0.54761905]]
    
    0 讨论(0)
  • 2020-12-09 23:18

    If you don't mind using pandas, there's a one-liner for extracting the transition probabilities:

    pd.crosstab(pd.Series(days[1:],name='Tomorrow'),
                pd.Series(days[:-1],name='Today'),normalize=1)
    

    Output:

    Today      clouds      rain       sun
    Tomorrow                             
    clouds    0.40625  0.230769  0.309524
    rain      0.28125  0.423077  0.142857
    sun       0.31250  0.346154  0.547619
    

    Here the (forward) probability that tomorrow will be sunny given that today it rained is found at the column 'rain', row 'sun'. If you would like to have backward probabilities (what might have been the weather yesterday given the weather today), switch the first two parameters.

    If you would like to have the probabilities stored in rows rather than columns, then set normalize=0 but note that if you would do that directly on this example, you obtain backwards probabilities stored as rows. If you would like to obtain the same result as above but transposed you could a) yes, transpose or b) switch the order of the first two parameters and set normalize to 0.

    If you just want to keep the results as numpy 2-d array (and not as a pandas dataframe), type .values after the last parenthesis.

    0 讨论(0)
  • 2020-12-09 23:19

    It seems you want to create a matrix of the probability of rain coming after sun or clouds coming after sun (or etc). You can spit out the probability matrix (not a math term) like so:

    def probabilityMatrix():
        tomorrowsProbability=np.zeros((3,3))
        occurancesOfEach = Counter(data)
        myMatrix = Counter(zip(data, data[1:]))
        probabilityMatrix = {key : myMatrix[key] / occurancesOfEach[key[0]] for key in myMatrix}
        return probabilityMatrix
    
    print(probabilityMatrix())
    

    However, you probably want to spit out the probability for every type of weather based on today's weather:

    def getTomorrowsProbability(weather):
        probMatrix = probabilityMatrix()
        return {key[1] : probMatrix[key]  for key in probMatrix if key[0] == weather}
    
    print(getTomorrowsProbability('sun'))
    
    0 讨论(0)
  • 2020-12-09 23:23

    Below another alternative using pandas. Transitions list can be replaced with 'rain','clouds' etc.

    import pandas as pd
    transitions = ['A', 'B', 'B', 'C', 'B', 'A', 'D', 'D', 'A', 'B', 'A', 'D'] * 2
    df = pd.DataFrame(columns = ['state', 'next_state'])
    for i, val in enumerate(transitions[:-1]): # We don't care about last state
        df_stg = pd.DataFrame(index=[0])
        df_stg['state'], df_stg['next_state'] = transitions[i], transitions[i+1]
        df = pd.concat([df, df_stg], axis = 0)
    cross_tab = pd.crosstab(df['state'], df['next_state'])
    cross_tab.div(cross_tab.sum(axis=1), axis=0)
    
    0 讨论(0)
  • 2020-12-09 23:30
    1. Convert the reports from the days into index codes.
    2. Iterate through the array, grabbing the codes for yesterday's weather and today's.
    3. Use those indices to tally the combination in your 3x3 matrix.

    Here's the coding set-up to get you started.

    report = [
      'rain', 'rain', 'rain', 'clouds', 'rain', 'sun', 'clouds', 'clouds', 
      'rain', 'sun', 'rain', 'rain', 'clouds', 'clouds', 'sun', 'sun', 
      'clouds', 'clouds', 'rain', 'clouds', 'sun', 'rain', 'rain', 'sun',
      'sun', 'clouds', 'clouds', 'rain', 'rain', 'sun', 'sun', 'rain', 
      'rain', 'sun', 'clouds', 'clouds', 'sun', 'sun', 'clouds', 'rain', 
      'rain', 'rain', 'rain', 'sun', 'sun', 'sun', 'sun', 'clouds', 'sun', 
      'clouds', 'clouds', 'sun', 'clouds', 'rain', 'sun', 'sun', 'sun', 
      'clouds', 'sun', 'rain', 'sun', 'sun', 'sun', 'sun', 'clouds', 
      'rain', 'clouds', 'clouds', 'sun', 'sun', 'sun', 'sun', 'sun', 'sun', 
      'clouds', 'clouds', 'clouds', 'clouds', 'clouds', 'sun', 'rain', 
      'rain', 'rain', 'clouds', 'sun', 'clouds', 'clouds', 'clouds', 'rain', 
      'clouds', 'rain', 'sun', 'sun', 'clouds', 'sun', 'sun', 'sun', 'sun',
      'sun', 'sun', 'rain']
    
    weather_dict = {"sun":0, "clouds":1, "rain": 2}
    weather_code = [weather_dict[day] for day in report]
    print weather_code
    
    for n in range(1, len(weather_code)):
        yesterday_code = weather_code[n-1]
        today_code     = weather_code[n]
    
    # You now have the indicies you need for your 3x3 matrix.
    
    0 讨论(0)
提交回复
热议问题