问题
Given a time series data, I'm trying to use panel OLS with fixed effects in Python. I found this way to do it:
Fixed effect in Pandas or Statsmodels
My input data looks like this (I will called it df
):
Name Permits_13 Score_13 Permits_14 Score_14 Permits_15 Score_15
0 P.S. 015 ROBERTO CLEMENTE 12.0 284 22 279 32 283
1 P.S. 019 ASHER LEVY 18.0 296 51 301 55 308
2 P.S. 020 ANNA SILVER 9.0 294 9 290 10 293
3 P.S. 034 FRANKLIN D. ROOSEVELT 3.0 294 4 292 1 296
4 P.S. 064 ROBERT SIMON 3.0 287 15 288 17 291
5 P.S. 110 FLORENCE NIGHTINGALE 0.0 313 3 306 4 308
6 P.S. 134 HENRIETTA SZOLD 4.0 290 12 292 17 288
7 P.S. 137 JOHN L. BERNSTEIN 4.0 276 12 273 17 274
8 P.S. 140 NATHAN STRAUS 13.0 282 37 284 59 284
9 P.S. 142 AMALIA CASTRO 7.0 290 15 285 25 284
10 P.S. 184M SHUANG WEN 5.0 327 12 327 9 327
So first I have to transform it to Multi-index (_13, _14, _15 represent data from 2013, 2014 and 2015, in that order):
df = df.dropna()
df = df.drop_duplicates()
rng = pandas.date_range(start=pandas.datetime(2013, 1, 1), periods=3, freq='A')
index = pandas.MultiIndex.from_product([rng, df['Name']], names=['date', 'id'])
d1 = numpy.array(df.ix[:, ['Score_13', 'Permits_13']])
d2 = numpy.array(df.ix[:, ['Score_14', 'Permits_14']])
d3 = numpy.array(df.ix[:, ['Score_15', 'Permits_15']])
data = numpy.concatenate((d1, d2, d3), axis=0)
s = pandas.DataFrame(data, index=index, columns=['y', 'x'])
s = s.drop_duplicates()
Which results in something like this:
y x
date id
2013-12-31 P.S. 015 ROBERTO CLEMENTE 284 12
P.S. 019 ASHER LEVY 296 18
P.S. 020 ANNA SILVER 294 9
P.S. 034 FRANKLIN D. ROOSEVELT 294 3
P.S. 064 ROBERT SIMON 287 3
P.S. 110 FLORENCE NIGHTINGALE 313 0
P.S. 134 HENRIETTA SZOLD 290 4
P.S. 137 JOHN L. BERNSTEIN 276 4
P.S. 140 NATHAN STRAUS 282 13
P.S. 142 AMALIA CASTRO 290 7
P.S. 184M SHUANG WEN 327 5
P.S. 188 THE ISLAND SCHOOL 279 4
HENRY STREET SCHOOL FOR INTERNATIONAL STUDIES 255 4
TECHNOLOGY, ARTS, AND SCIENCES STUDIO 282 18
THE EAST VILLAGE COMMUNITY SCHOOL 306 35
UNIVERSITY NEIGHBORHOOD MIDDLE SCHOOL 277 4
THE CHILDREN'S WORKSHOP SCHOOL 302 35
NEIGHBORHOOD SCHOOL 299 15
EARTH SCHOOL 305 3
SCHOOL FOR GLOBAL LEADERS 286 15
TOMPKINS SQUARE MIDDLE SCHOOL 306 3
P.S. 001 ALFRED E. SMITH 303 20
P.S. 002 MEYER LONDON 306 8
P.S. 003 CHARRETTE SCHOOL 325 62
P.S. 006 LILLIE D. BLAKE 333 89
P.S. 011 WILLIAM T. HARRIS 320 30
P.S. 033 CHELSEA PREP 313 5
P.S. 040 AUGUSTUS SAINT-GAUDENS 326 23
P.S. 041 GREENWICH VILLAGE 326 25
P.S. 042 BENJAMIN ALTMAN 314 30
... ... ... ...
2015-12-31 P.S. 054 CHARLES W. LENG 309 2
P.S. 055 HENRY M. BOEHM 311 3
P.S. 56 THE LOUIS DESARIO SCHOOL 323 4
P.S. 057 HUBERT H. HUMPHREY 287 2
SPACE SHUTTLE COLUMBIA SCHOOL 307 0
P.S. 060 ALICE AUSTEN 303 1
I.S. 061 WILLIAM A MORRIS 291 2
MARSH AVENUE SCHOOL FOR EXPEDITIONARY LEARNING 316 0
P.S. 069 DANIEL D. TOMPKINS 307 2
I.S. 072 ROCCO LAURIE 308 1
I.S. 075 FRANK D. PAULO 318 9
THE MICHAEL J. PETRIDES SCHOOL 310 0
STATEN ISLAND SCHOOL OF CIVIC LEADERSHIP 309 0
P.S. 075 MAYDA CORTIELLA 282 19
P.S. 086 THE IRVINGTON 286 38
P.S. 106 EDWARD EVERETT HALE 280 27
P.S. 116 ELIZABETH L FARRELL 291 3
P.S. 123 SUYDAM 287 14
P.S. 145 ANDREW JACKSON 285 4
P.S. 151 LYNDON B. JOHNSON 271 27
J.H.S. 162 THE WILLOUGHBY 283 22
P.S. 274 KOSCIUSKO 282 2
J.H.S. 291 ROLAND HAYES 279 13
P.S. 299 THOMAS WARREN FIELD 288 5
I.S. 347 SCHOOL OF HUMANITIES 284 45
I.S. 349 MATH, SCIENCE & TECH. 285 45
P.S. 376 301 9
P.S. 377 ALEJANDRINA B. DE GAUTIER 277 3
P.S. /I.S. 384 FRANCES E. CARTER 291 4
ALL CITY LEADERSHIP SECONDARY SCHOOL 325 18
However, when I try to call:
reg = PanelOLS(y=s['y'],x=s[['x']],time_effects=True)
I get an error:
ValueError: Can't convert non-uniquely indexed DataFrame to Panel
That's my first time using Pandas, this may be a simple question but I don't know what's the problem. As far as I got I have a multi-index object as required.
I don't get why I have duplicates (I put a lot of drop_duplicates()
try to get rid of any duplicated data -- which I don't think is the answer, though). If I have data for the same school for three years, shouldn't I have duplicate data somehow (looking just at the row Name
, for example)?
EDIT
df
is 935 rows × 7 columns, after getting rid of NaNs rows.
So I expected s
to be 2805 rows × 2 columns, which is exactly what I have.
If i run this:
s = s.reset_index().groupby(s.index.names).first()
reg = PanelOLS(y=s['y'],x=s[['x']],time_effects=True)
I get another error:
ValueError: operands could not be broadcast together with shapes (2763,) (3,)
Thank you.
回答1:
Using the provided pickle file, I ran the regression and it worked fine.
-------------------------Summary of Regression Analysis-------------------------
Formula: Y ~ <x>
Number of Observations: 2763
Number of Degrees of Freedom: 4
R-squared: 0.0268
Adj R-squared: 0.0257
Rmse: 16.4732
F-stat (1, 2759): 25.3204, p-value: 0.0000
Degrees of Freedom: model 3, resid 2759
-----------------------Summary of Estimated Coefficients------------------------
Variable Coef Std Err t-stat p-value CI 2.5% CI 97.5%
--------------------------------------------------------------------------------
x 0.1666 0.0191 8.72 0.0000 0.1292 0.2041
---------------------------------End of Summary---------------------------------
I ran this in Jupyter Notebook
来源:https://stackoverflow.com/questions/37260035/pandas-multi-index-cant-convert-non-uniquely-indexed-dataframe-to-panel