问题
I for the life of me cant seem to get the structure I want and have it function properly, so in a fit of rage I come to you guys.
Setup: I have a Directory called Futures_Contracts and inside is about 30 folders all named with the underlying asset, and finally inside the 6 nearest expiration contracts in csv format. Each csv is identical in format and contains Date,O,H,L,C,V,OI,Expiration Month.
Note: O H L C V OI is open, high, low, close, volume, open interest (for those not familiar) also assume close is synonymous with settlement below
Task: From here the goal is to load in the futures data into a multi-index pandas dataframe in such a way that the top-level index is the underlying commodity symbol, the mid-level index is the expiration Month-Year, and finally the OHLC data. The end goal is to have something that I can start hacking at the zipline module to get it running on futures. So visually:
My Feeble attempt:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from pandas import DataFrame, Series
import datetime
plt.figsize(16,8)
deliveries = {}
commoidities = {}
columns = 'open', 'high', 'low', 'settle', 'volume', 'interest', 'delivery' #Contract fields
path = os.getcwdu()+'/Futures_Contracts/' #Futures Path
for sym in os.listdir(path):
if sym[0] != '.': #Weed out hidden files
deliveries[sym] = []
i = 0
for contract in os.listdir(path + sym):
temp = pd.io.parsers.read_csv(path + sym + '/' + contract, index_col=0, parse_dates = True, names = columns)#pull in the csv
deliveries[sym].append(str(contract[:-4][-1] + contract[:-4][:-1][-2:])) #add contract to dict in form of MonthCode-YY
commodities[sym] = deliveries[sym]
commodities[sym][i] = temp
i += 1
This somewhat works, however this is really a nested dict that holds a dataframe at the end. Therefore slicing is extremely clunky:
commodities['SB2'][0]['settle'].plot()
commodities['SB2'][3]['settle'].plot()
commodities['SB2'][4]['settle'].plot()
commodities['SB2'][3]['settle'].plot()
commodities['SB2'][4]['settle'].plot()
commodities['SB2'][5]['settle'].plot()
and yields
Optimally I will be able to slice across each of the indexes so that I can compare data across assets, expiration, date and value. Furthermore label what I am looking at, as you can see in the matplotlib chart everything is simply named 'settle'
There is surely a way to do this, but I am just not smart enough to figure it out.
回答1:
I think you're going to be much better off getting this into one DataFrame, so consider using a MultiIndex. Here's a toy example, which I think will translate well to your code:
In [11]: dfN13 = pd.DataFrame([[1, 2]], columns=[['N13', 'N13'], ['a', 'b']])
In [12]: dfM13 = pd.DataFrame([[3, 4]], columns=[['M13', 'M13'], ['a', 'b']])
These are the DataFrames in your example, but the column's first level it just the asset name.
In [13]: df = pd.concat([dfN13, dfM13], axis=1)
In [14]: df
Out[14]:
N13 M13
a b a b
0 1 2 3 4
For convenience we can label the columns-levels and index.
In [15]: df.columns.names = ['asset', 'chart']
In [16]: df.index.names = ['date'] # well, not in this toy example
In [17]: df
Out[17]:
asset N13 M13
chart a b a b
date
0 1 2 3 4
Note: This looks quite like your spreadsheet.
And we can grab out a specific chart (e.g. ohlc) using xs
:
In [18]: df.xs('a', level='chart', axis=1)
Out[18]:
asset N13 M13
date
0 1 3
In [19]: df.xs('a', level='chart', axis=1).plot() # win
回答2:
Ok this seemed to work.
commodities = {}
columns = 'open', 'high', 'low', 'settle', 'volume', 'interest', 'delivery' #Contract fields
path = os.getcwdu()+'/Futures_Contracts/' #Futures Path
for sym in os.listdir(path):
if sym[0] != '.': #Weed out hidden files
i = 0
c_expirations = {}
for contract in os.listdir(path + sym):
expiry = (contract[:-4][-1].encode('ascii', 'ignore') + contract[:-4][:-1][-2:].encode('ascii', 'ignore'))
c_expirations[expiry] = pd.io.parsers.read_csv(path + sym + '/' + contract, index_col=0, parse_dates = True, names = columns)
commodities[sym] = pd.concat(c_expirations, axis =1)
df_data = pd.concat(commodities, axis=1)
df_data.columns.names = 'asset', 'expiry', 'data'
and a look at what it prints out
print df_data
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1568 entries, 2007-04-16 00:00:00 to 2013-06-17 00:00:00
Columns: 1197 entries, (CC2, H14, open) to (ZW, Z13, delivery)
dtypes: float64(1197)
Really just came down to tinkering with Andy's advice, and applying it large scale
来源:https://stackoverflow.com/questions/17178263/commodity-futures-hierarchical-data-structure