问题
I am new to Python and trying to create something equivalent to Matlab's "cell array". Lets say I have 100 customers index 'C001', 'C002' etc. and I have different data for each customer:
- Size of premises in square meters [real number]
- categorical data showing whether they are 'commercial', 'residential' or 'other'
- hourly time series of their electricity consumption in 2014 i.e. datetime-indexed array of 8760 real values
What is the best way to buildsuch a dataset in Python 2.7 that combines single values, categorical data and time-index arrays? I am trying to use pandas for this but no success so far.
Thank you very much in advance
回答1:
The equivalent of a MATLAB cell array is a numpy object array. However, these are rarely used because they are rarely what you want in practice. In most cases where someone would use a Cell in MATLAB, a list or nested list would suffice:
>>> a = [obj1, obj2, obj, obj4]
>>> b = [[obj1, obj2], [obj3, obj4]]
However, that is not what you want to do in your case. Your question is a classic example of X Y problem. You are asking how implement a particular solution to your problem, rather than asking how to solve the problem itself. Python can do a lot of things MATLAB can't, so trying to make Python behave like MATLAB will often result in sub-optimal solutions.
In this case, what you want is a pandas DataFrame. It is nothing at all like a MATLAB cell array, but fits your data set much better. You can use a MultiIndex to store the parameters, and columns to store the time series data. This allows you to index by name, size, category, date, etc. You can calculate, for example, the mean energy usage for each category of property in the third quarter for properties over 500 square meters in just one line of code.
So here is an example how you could structure such data:
>>> names = ['C001', 'C002', 'C003', 'C004']
>>> sizes = np.abs(np.random.random(4))*1000
>>> category = ['Commerical', 'Residential', 'Residential', 'Other']
>>> ts = np.random.random([100, 4])
>>> timestamps = pd.date_range('1/1/2011', periods=100, freq='W')
>>>
>>> cols = pd.MultiIndex.from_arrays([names, sizes, category])
>>>
>>> df = pd.DataFrame(ts, index=timestamps, columns=cols)
>>> df.columns.names = ['Name', 'Size', 'Category']
>>> df.index.name = 'Time'
>>>
>>> print(df)
Name C001 C002 C003 C004
Size 36.719201 732.278278 795.755755 551.383120
Category Commerical Residential Residential Other
Time
2011-01-02 0.108720 0.018492 0.057233 0.694548
2011-01-09 0.959845 0.968857 0.422210 0.975767
2011-01-16 0.709676 0.119963 0.004481 0.830328
2011-01-23 0.084271 0.535408 0.209943 0.668001
2011-01-30 0.626125 0.052301 0.212636 0.995429
2011-02-06 0.376399 0.199327 0.482884 0.632472
2011-02-13 0.302807 0.353679 0.599427 0.993996
2011-02-20 0.185445 0.005769 0.755981 0.923540
2011-02-27 0.109611 0.994292 0.873782 0.542741
2011-03-06 0.561404 0.778414 0.595238 0.082001
2011-03-13 0.056986 0.869344 0.459753 0.450071
2011-03-20 0.261320 0.675317 0.603043 0.371950
2011-03-27 0.890803 0.061619 0.831677 0.801890
2011-04-03 0.498199 0.846559 0.370336 0.225477
2011-04-10 0.248914 0.693038 0.145255 0.233058
2011-04-17 0.621441 0.683213 0.048944 0.650139
2011-04-24 0.459869 0.055751 0.912097 0.457605
2011-05-01 0.814447 0.780415 0.184241 0.429139
2011-05-08 0.586905 0.209121 0.428080 0.246584
2011-05-15 0.754021 0.909181 0.846984 0.948835
2011-05-22 0.513610 0.203925 0.338072 0.596325
2011-05-29 0.497080 0.557908 0.916812 0.680242
2011-06-05 0.646791 0.641024 0.399427 0.308346
2011-06-12 0.573922 0.539285 0.098703 0.461480
2011-06-19 0.062978 0.939339 0.713087 0.380326
2011-06-26 0.422484 0.109185 0.459734 0.800468
2011-07-03 0.962368 0.632361 0.388565 0.503425
2011-07-10 0.802551 0.261161 0.590494 0.526307
2011-07-17 0.261447 0.686405 0.636970 0.622476
2011-07-24 0.634331 0.630028 0.069925 0.504036
... ... ... ... ...
2012-05-06 0.185331 0.375717 0.658463 0.697377
2012-05-13 0.273510 0.665318 0.756944 0.083542
2012-05-20 0.895984 0.850881 0.680869 0.987420
2012-05-27 0.450593 0.262195 0.458893 0.199141
2012-06-03 0.696102 0.332312 0.419764 0.338074
2012-06-10 0.113108 0.167605 0.812625 0.329429
2012-06-17 0.527418 0.087454 0.868973 0.744649
2012-06-24 0.977674 0.831538 0.410719 0.598423
2012-07-01 0.577802 0.141307 0.310356 0.276271
2012-07-08 0.772117 0.288240 0.820701 0.548857
2012-07-15 0.699628 0.467952 0.429433 0.304482
2012-07-22 0.782641 0.337854 0.561191 0.572241
2012-07-29 0.010225 0.962770 0.793041 0.166877
2012-08-05 0.895516 0.628526 0.782264 0.908301
2012-08-12 0.787210 0.698185 0.255306 0.741693
2012-08-19 0.042833 0.556469 0.165885 0.408108
2012-08-26 0.942076 0.377714 0.927170 0.119004
2012-09-02 0.567978 0.007891 0.777752 0.869950
2012-09-09 0.120134 0.417996 0.328654 0.484447
2012-09-16 0.833769 0.946456 0.594471 0.569707
2012-09-23 0.515544 0.090017 0.344200 0.498175
2012-09-30 0.419152 0.315412 0.683195 0.498630
2012-10-07 0.879582 0.958591 0.531812 0.051948
2012-10-14 0.488241 0.683242 0.096560 0.197295
2012-10-21 0.425213 0.279539 0.476436 0.492512
2012-10-28 0.238334 0.836782 0.901589 0.132700
2012-11-04 0.030562 0.797666 0.238895 0.550427
2012-11-11 0.875454 0.973046 0.457116 0.154175
2012-11-18 0.557967 0.895320 0.478239 0.448102
2012-11-25 0.075152 0.047344 0.650615 0.293129
[100 rows x 4 columns]
来源:https://stackoverflow.com/questions/40609838/what-is-the-equivalent-to-a-matlab-cell-array