conditionally replace values in one list using another list of different length and ranges based on %age overlap in python

后端 未结 3 427
予麋鹿
予麋鹿 2021-01-14 17:47

One text file \'Truth\' contains these following values :

0.000000    3.810000    Three
3.810000    3.910923    NNNN
3.910923    5.429000    AAAA
5.429000            


        
相关标签:
3条回答
  • 2021-01-14 18:23

    This is "just" number crunching - here is one way:

    raw_test = [[0.000000   , 3.810000  ,  'Three'],
            [3.810000   , 3.910923  ,  'Three'],
            [3.910923   , 5.429000  ,  'AAAA '],
            [5.429000   , 7.060000  ,  'Three'],
            [7.060000   , 8.411000  ,  'Three'],
            [8.411000   , 8.971000  ,  'Zero'],
            [8.971000   , 13.40600  ,  'Three'],
            [13.40600   , 13.82700  ,  'Zero'], 
            [13.82700   , 15.935554 ,  'Two'], 
            [15.935554  , 20.138337 ,  'Two'],]
    
    raw_truth = [[0.000000 ,   1.00000   ,  'MMMM'],
       [1.000    ,   3.810000  ,  'Three'],
       [3.810000 ,   3.910923  ,  'NNNN'],
       [3.910923 ,   5.429000  ,  'AAAA'],
       [5.429000 ,   6.0000    ,  'MMMM'],
       [6.0000   ,   7.060000  ,  'AAAA'],
       [7.060000 ,   8.411000  ,  'MMMM'],
       [8.411000 ,   8.971000  ,  'MMMM'],
       [8.971000 ,   11.00     ,  'abcd'],
       [11.00    ,   13.40600  ,  'MMMM'],
       [13.40600 ,   13.82700  ,  'Zero'],
       [13.82700 ,   15.935554 ,  'One'],]
    
    truth = {}
    for mi,ma,key in raw_truth:
      truth.setdefault((mi,ma), key)
    
    test = [ (mi,ma,ma - mi,lab) for mi,ma,lab in raw_test ]
    
    overlap = []
    overlap.append(["test-min","test-max","test-size","test-lab",
                    "#","truth-min","truth-max","truth-lab",
                    "#","min-over","max-over","over-size","%"])
    
    for mi,ma,siz,lab in test:
      for key in truth:
        truMi,truMa = key
        truVal = truth[key]
    
        if  ma >= truMi and ma <=truMa or mi >= truMi and mi <=truMa: # coarse filter
          minOv = max(truMi,mi)
          maxOv = min(truMa,ma)
          sizOv = maxOv-minOv
          perc = sizOv/(siz/100.0)
          if perc > 0: # fine filter
            overlap.append([mi,ma,siz,lab,
                            '#',truMi,truMa,truVal,
                            '#',minOv,maxOv, sizOv, perc ])
    
    # just some printing:    
    print(truth)
    print()    
    
    print(test)
    print()    
    
    for d in overlap:
      for x in d:
        if type(x) is str:
          if x == '#':
            print( '  |  ', end ="")    
           else:
            print( '{:<10}'.format(x), end ="")  
        else:
          print( '{:<10.5f}'.format(x), end ="")
      print(" %")
    
    # the print statements are python3 - at the time this answer was written, the question
    # had no python 2 tag. Replace the python 3 print statements with
    #    print '  |  ',
    #    print '{:<10}'.format(x),  
    #    print '{:<10.5f}'.format(x),    
    # etc. or adapt them accordingly - see https://stackoverflow.com/a/2456292/7505395
    

    Output:

    test-min  test-max  test-size test-lab    |  truth-min truth-max truth-lab   |  min-over  max-over  over-size %          %
    0.00000   3.81000   3.81000   Three       |  0.00000   1.00000   MMMM        |  0.00000   1.00000   1.00000   26.24672   %
    0.00000   3.81000   3.81000   Three       |  1.00000   3.81000   Three       |  1.00000   3.81000   2.81000   73.75328   %
    3.81000   3.91092   0.10092   Three       |  3.81000   3.91092   NNNN        |  3.81000   3.91092   0.10092   100.00000  %
    3.91092   5.42900   1.51808   AAAA        |  3.91092   5.42900   AAAA        |  3.91092   5.42900   1.51808   100.00000  %
    5.42900   7.06000   1.63100   Three       |  5.42900   6.00000   MMMM        |  5.42900   6.00000   0.57100   35.00920   %
    5.42900   7.06000   1.63100   Three       |  6.00000   7.06000   AAAA        |  6.00000   7.06000   1.06000   64.99080   %
    7.06000   8.41100   1.35100   Three       |  7.06000   8.41100   MMMM        |  7.06000   8.41100   1.35100   100.00000  %
    8.41100   8.97100   0.56000   Zero        |  8.41100   8.97100   MMMM        |  8.41100   8.97100   0.56000   100.00000  %
    8.97100   13.40600  4.43500   Three       |  8.97100   11.00000  abcd        |  8.97100   11.00000  2.02900   45.74972   %
    8.97100   13.40600  4.43500   Three       |  11.00000  13.40600  MMMM        |  11.00000  13.40600  2.40600   54.25028   %
    13.40600  13.82700  0.42100   Zero        |  13.40600  13.82700  Zero        |  13.40600  13.82700  0.42100   100.00000  %
    13.82700  15.93555  2.10855   Two         |  13.82700  15.93555  One         |  13.82700  15.93555  2.10855   100.00000  %
    

    Disclaimer: I haven't number crunched everything by hand to check this is correct - just took a glance at the output. Verify it yourself. You would need to apply the truth-lab where ever your % fits.

    0 讨论(0)
  • 2021-01-14 18:27

    Assuming that the ranges never overlap, that they're ordered, and that the smaller ranges inside test will always fit fully inside the larger ranges of truth.

    You can perform a merge similar to the merge in merge sort. Here's a code snippet that should do what you like:

    def in_range(truth_item, test_item):
        return truth_item[0] <= test_item[0] and truth_item[1] >= test_item[1]
    
    
    def update_test_items(truth_items, test_items):
        current_truth_index = 0
        for test_item in test_items:
            while not in_range(truth_items[current_truth_index], test_item):
                current_truth_index += 1
                if current_truth_index >= len(truth_items):
                    return
    
            test_item[2] = truth_items[current_truth_index][2]
    
    
    update_test_items(truth, test)
    

    Calling update_test_items will modify test by adding in the appropriate values from truth.

    Now you can set a condition for update if you like, say 80% coverage and leave the value unchanged if this isn't met.

    def has_enough_coverage(truth_item, test_item):
        truth_item_size = truth_item[1] - truth_item[0]
        test_item_size = test_item[1] - test_item[0]
        return test_item_size / truth_item_size >= .8
    
    
    def in_range(truth_item, test_item):
        return truth_item[0] <= test_item[0] and truth_item[1] >= test_item[1]
    
    
    def update_test_items(truth_items, test_items):
        current_truth_index = 0
        for test_item in test_items:
            while not in_range(truth_items[current_truth_index], test_item):
                current_truth_index += 1
                if current_truth_index >= len(truth_items):
                    return
    
            if has_enough_coverage(truth_items[current_truth_index], test_item):
                test_item[2] = truth_items[current_truth_index][2]
    
    
    update_test_items(truth, test)
    

    This will only update the test item if it covers 80%+ of the truth range.

    Note that these will only work if the initial assumptions are correct, otherwise you'll run into issues. This approach will also run very efficiently O(N) time.

    0 讨论(0)
  • 2021-01-14 18:41

    I am not sure I fully understand your question but if you are referring to what I think you are, then you need to worry about "out of bounds" and the fact that "truth" and test won`t have the same correspondence in j - as you mentioned.

    A way around that would be to use two different indices for truth[j] and test[k] (or whatever you want to call it). You could obviously use two loops to continuously iterate over the whole test, but that wouldn`t make the code efficient.

    I would suggest using the second index as a counter that continuously goes up by 1 (think of it as a while loop that is while "value test[k] in range of value truth[j] and do what you are currently doing.

    Whenever you reached a point that test[k] value is over the range of your current truth[j] you continue to the next j (value interval in truth).

    Hope that helps and makes sense


    l_truth = len(truth)
    l_test = len(test)
    
    count = 0
    
    res = []
    
    for j in range(l_truth):
        count2= count
        for k in range(count2,l_test):
            if truth[j][2]== 'MMMM': 
                min_truth = truth[j][0]
                max_truth = truth[j][1]
                min_test = test[k][0]
                max_test = test[k][1]
    
                #diff_truth = max_truth - min_truth
                diff_test = max_test - min_test
    
                if (min_truth <= min_test) and (max_truth >= max_test):
                    res.append((test[k][0], test[k][1],truth[j][2]))
                    count +=1
                elif (min_truth <= min_test) and (max_truth <= max_test):
                    #diff_min = min_truth - min_test
                    diff_max = max_test - max_truth
                    ratio = diff_max/diff_test
                    if ratio <= 0.2:
                        res.append((test[k][0], test[k][1],truth[j][2]))
                        count +=1
                elif (min_truth >= min_test) and (max_truth >= max_test):
                    diff_min = min_truth - min_test
                    #diff_max = max_test - max_truth
                    ratio = diff_min/diff_test
                    if ratio <= 0.2:
                        res.append((test[k][0], test[k][1],truth[j][2]))
                        count+=1
                elif (min_truth >= min_test) and (max_truth <= max_test):
                    diff_min = min_truth - min_test
                    diff_max = max_test - max_truth
                    ratio = (diff_min+diff_max)/diff_test
                    if ratio <= 0.2:
                        res.append((test[k][0], test[k][1],truth[j][2]))
                        count+=1
                else:
                    pass
            else:
                continue
    
    for i in range(len(res)):
        print res[i]
    

    Check if this works. I actually had to use two loops, but I am sure there are other more efficient ways of doing this.

    0 讨论(0)
提交回复
热议问题