I am trying to create a histogram on a continuous value column Trip_distance
in a large 1.4M row pandas dataframe. Wrote the following code:
fig =
Here's another way to plot the data, involves turning the date_time into an index, this might help you for future slicing
#convert column to datetime
trip_data['lpep_pickup_datetime'] = pd.to_datetime(trip_data['lpep_pickup_datetime'])
#turn the datetime to an index
trip_data.index = trip_data['lpep_pickup_datetime']
#Plot
trip_data['Trip_distance'].plot(kind='hist')
plt.show()
After your comments this actually makes perfect sense why you don't get a histogram of each different value. There are 1.4 million rows, and ten discrete buckets. So apparently each bucket is exactly 10% (to within what you can see in the plot).
A quick rerun of your data:
In [25]: df.hist(column='Trip_distance')
Prints out absolutely fine.
The df.hist
function comes with an optional keyword argument bins=10
which buckets the data into discrete bins. With only 10 discrete bins and a more or less homogeneous distribution of hundreds of thousands of rows, you might not be able to see the difference in the ten different bins in your low resolution plot:
In [34]: df.hist(column='Trip_distance', bins=50)