I do a lot of data analysis in perl and I am trying to replicate this work in python using pandas, numpy, matplotlib, etc.
The general workflow goes as follows:
You are getting the following:
NameError: name 'MultiIndex' is not defined
because you are not importing MultiIndex directly when you import Series and DataFrame.
You have -
from pandas import Series, DataFrame
You need -
from pandas import Series, DataFrame, MultiIndex
or you can instead refer to MultiIndex using pd.MultiIndex since you are importing pandas as pd
Hopefully this helps you get started?
import sys, os
def regex_match(line) :
return 'LOOPS' in line
my_hash = {}
for fd in os.listdir(sys.argv[1]) : # for each file in this directory
for line in open(sys.argv[1] + '/' + fd) : # get each line of the file
if regex_match(line) : # if its a line I want
line.rstrip('\n').split('\t') # get the data I want
my_hash[line[1]] = line[2] # store the data
for key in my_hash : # data science can go here?
do_something(key, my_hash[key] * 12)
# plots
p.s. make the first line
#!/usr/bin/python
(or whatever which python
returns ) to run as an executable
To glob your files, use the built-in glob
module in Python.
To read your csv files after globbing them, the read_csv
function that you can import using from pandas.io.parsers import read_csv
will help you do that.
As for MultiIndex
feature in the pandas dataframe that you instantiate after using read_csv
, you can then use them to organize your data and slice them anyway you want.
3 pertinent links for your reference.
MultiIndex
dataframes in pandas - understanding MultiIndex and Benefits of panda's multiindex?glob
in a directory to grab and manipulate your files - extract values/renaming filename in python