问题
My task:
For the next set of questions, we will be using census data from the United States Census Bureau. Counties are political and geographic subdivisions of states in the United States. This dataset contains population data for counties and states in the US from 2010 to 2015. See this document for a description of the variable names.
The census dataset (census.csv) should be loaded as census_df. Answer questions using this as appropriate.
Question 5
Which state has the most counties in it? (hint: consider the sumlevel
key carefully! You'll need this for future questions too...)
This function should return a single string value.
census_df = pd.read_csv('census.csv')
census_df = census_df[census_df['SUMLEV']==50]
census_df_2 = census_df.groupby(by='STNAME',axis=0)
This, however, does not group the DataFrame by 'STNAME'
, which can be seen when executing census_df_2.head()
I suppose this should work on a grouped DataFrame:
def answer_five():
return census_df_2[ census_df_2['COUNTY'].count() == max( census_df_2['COUNTY'].count() ) ].index().tolist()[0]
answer_five()
Why does the groupby function not work? I've tried changing the axis and using the set_index()
function instead but I can't get it to work.
If someone knows another way to solve this problem I'd appreciate it.
回答1:
groupby
simply returns a groupby object, you'll have to specify an aggregate function to be used on this object, e.g.
df.groupby(by='STNAME').aggregate({'COUNTY': 'nunique'}).idxmax()[0]
gives
'Texas'
See the pandas docs here for an introduction to grouping/aggregating.
来源:https://stackoverflow.com/questions/56883626/finding-the-index-with-maximum-number-of-rows