Creating confusion matrix from multiple .csv files

I have a lot of .csv files with the following format.

From column 1, I wanted to read current row and compare it with the value of the previous row. If it is greater OR equal, continue comparing and if the value of the current cell is smaller than the previous row - then i divide the current value with the previous value and proceed. For example in the table given above: the smaller value we will get depending on my requirement from Column 1 is 327 (because 327 is smaller than the previous value 340) - and then we divide 327 by 340 and we get the value 0.96. My python script should exit right after we print the criteria (A) as given below.

from __future__ import division
import csv

def category(val):
    if 0.8 < val <= 0.9:
        return "A"
    if abs(val - 0.7) < 1e-10:
        return "B"
    if 0.5 < val < 0.7:
        return "C"
    if abs(val - 0.5) < 1e-10:
        return "E"
    return "D"

    with open("test.csv", "r") as csvfile:
    ff = csv.reader(csvfile)

    results = []
    previous_value = 0
    for col1, col2 in ff:
        if not col1.isdigit():
            continue
        value = int(col1)
        if value >= previous_value:
            previous_value = value
            continue
        else:
            result =  int(col1)/ int(previous_value)
            results.append(result)
            print category(result)
            previous_value = value
    print (results)
    print (sum(results))
    print (category(sum(results) / len(results)))

Finally, i want to run my scrip for all the .csv files i have in the current directory and build a confusion matrix like the following. Let's say A1.csv, A2.csv, A3.csv are supposed (or predicted) to print A, B1.csv, B2.csv, B3.csv are supposed (or predicted) to print B and C1.csv, C2.csv and C3.csv are supposed (or predicted) to print C, ... etc. How can we automatically create a confusion matrix from multiple .csv files for example like the following using Python?

As it is shown below, the colored blocks of the matrix (row-labels) will show us the number of counts of A (count of true values for A), B (count of true values for b) and C (count of true values for C), ..etc from the control logic of our function category()- given above. The column labels from the control logic we have inside the if-else statement (A, B, C, D and E).

Add a def get_predict(filename)

def get_predict(filename):
    if 'Alex' in filename:
        return 'Alexander'
    else:
        return filename [0]

Reading n files, compute confusion matrix using pandas crosstab:

import os
import pandas as pd

def get_category(filepath):
    def category(val):
        print('predict({}; abs({})'.format(val, abs(val)))
        if 0.8 < val <= 0.9:
            return "A"
        if abs(val - 0.7) < 1e-10:
            return "B"
        if 0.5 < val < 0.7:
            return "C"
        if abs(val - 0.5) < 1e-10:
            return "E"
        return "D"

    with open(filepath, "r") as csvfile:
        ff = csv.reader(csvfile)

        results = []
        previous_value = 0
        for col1, col2 in ff:
            value = int(col1)
            if value >= previous_value:
                previous_value = value
            else:
                results.append(value / previous_value)
                previous_value = value

    return category(sum(results) / len(results))

matrix = {'actual':[], 'predict':[]}
path = 'test/confusion'
for filename in os.listdir( path ):
    # The first Char in filename is Predict Key
    matrix['predict'].append(filename[0])
    matrix['actual'].append(get_category(os.path.join(path, filename)))

df = pd.crosstab(pd.Series(matrix['actual'], name='Actual'),
                 pd.Series(matrix['predict'], name='Predicted')
                 )
print(df)

Output: (Reading "A.csv, B.csv, C.csv" with the given example Data three times)
Predicted  A  B  C
Actual            
A          3  0  0
B          0  3  0
C          0  0  3

Tested with Python:3.4.2 - pandas:0.19.2

Using Scikit-Learn is the best option to go for in your case as it provides a confusion_matrix function. Here is an approach you can easily extend.

from sklearn.metrics import confusion_matrix

# Read your csv files
with open('A1.csv', 'r') as readFile:
    true_values = [int(ff) for ff in readFile]
with open('B1.csv', 'r') as readFile:
    predictions = [int(ff) for ff in readFile]

# Produce the confusion matrix
confusionMatrix = confusion_matrix(true_values, predictions)

print(confusionMatrix)

This is the output you would expect.

[[0 2]
 [0 2]]

For more hint - check out the following link:

How to write a confusion matrix in Python?

来源：https://stackoverflow.com/questions/44215561/creating-confusion-matrix-from-multiple-csv-files

标签

python

csv

confusion-matrix