Kaggle | ASHRAE The first step(EDA with Python)

可紊 提交于 2020-02-08 05:13:55

ASHRAE - The first step: EDA with Python

In this competition we will develop models to predict the energy usage in each building. The dataset contains 1450+ buildings information. Different buildings may have different types of meters.

Before we start to build a model, we should have some intuition about the data first. So using visualization tools such as Seaborn we can be more familiar with the data we can get.

If you find this notebook interesting or useful, please do not hesitate to vote for me!

1 Start from Here

1.1 Loading Data

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import *
import gc
sns.set()
PATH='/kaggle/ashrae-energy-prediction/'
train=pd.read_csv(PATH+'train.csv')
test=pd.read_csv(PATH+'test.csv')
weather_train=pd.read_csv(PATH+'weather_train.csv')
weather_test=pd.read_csv(PATH+'weather_test.csv')
building=pd.read_csv(PATH+'building_metadata.csv')
# merge the data

train = train.merge(building, on='building_id', how='left')
test = test.merge(building, on='building_id', how='left')
train = train.merge(weather_train, on=['site_id', 'timestamp'], how='left')
test = test.merge(weather_test, on=['site_id', 'timestamp'], how='left')
del weather_train, weather_test,building
gc.collect()
27

1.2 Check some Information about the Data

Before we start, let’s check the missing ratio of each column:

# check whether there are missing values

data_na=(train.isnull().sum()/len(train))*100
data_na=data_na.drop(data_na[data_na==0].index).sort_values(ascending=False)
missing_data=pd.DataFrame({'MissingRatio':data_na})
print(missing_data)
                    MissingRatio
floor_count            82.652772
year_built             59.990033
cloud_coverage         43.655131
precip_depth_1_hr      18.544739
wind_direction          7.167792
sea_level_pressure      6.092515
wind_speed              0.710701
dew_temperature         0.495348
air_temperature         0.478124

We can see that after merging the dataset some of the columns contain a large percentage of missing values, such as ‘floor_count’ and ‘year_bulit’. We can further check the information of the dataset.

train.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 20216100 entries, 0 to 20216099
Data columns (total 16 columns):
building_id           int16
meter                 int8
timestamp             datetime64[ns]
meter_reading         float64
site_id               int8
primary_use           category
square_feet           int32
year_built            float16
floor_count           float16
air_temperature       float32
cloud_coverage        float16
dew_temperature       float32
precip_depth_1_hr     float16
sea_level_pressure    float32
wind_direction        float16
wind_speed            float32
dtypes: category(1), datetime64[ns](1), float16(5), float32(4), float64(1), int16(1), int32(1), int8(2)
memory usage: 1.1 GB

We can save the store memory by transforming some of the data types.

# Saving the memory space

data_types = {'building_id': np.int16,
          'meter': np.int8,
          'site_id': np.int8,
          'square_feet': np.int32,
          'year_built': np.float16,
          'floor_count': np.float16,    
          'cloud_coverage': np.float16,
          'precip_depth_1_hr': np.float16,
           'wind_direction': np.float16,     
          'dew_temperature': np.float32,
          'air_temperature': np.float32,
          'sea_level_pressure': np.float32,
          'wind_speed': np.float32,
          'primary_use': 'category',}

for feature in data_types:
    train[feature] = train[feature].astype(data_types[feature])
    test[feature] = test[feature].astype(data_types[feature])
    
train["timestamp"] = pd.to_datetime(train["timestamp"])
test["timestamp"] = pd.to_datetime(test["timestamp"])
gc.collect();

Now let’s see the data sample.

train.head()
building_id meter timestamp meter_reading site_id primary_use square_feet year_built floor_count air_temperature cloud_coverage dew_temperature precip_depth_1_hr sea_level_pressure wind_direction wind_speed
0 0 0 2016-01-01 0.0 0 Education 7432 2008.0 NaN 25.0 6.0 20.0 NaN 1019.700012 0.0 0.0
1 1 0 2016-01-01 0.0 0 Education 2720 2004.0 NaN 25.0 6.0 20.0 NaN 1019.700012 0.0 0.0
2 2 0 2016-01-01 0.0 0 Education 5376 1991.0 NaN 25.0 6.0 20.0 NaN 1019.700012 0.0 0.0
3 3 0 2016-01-01 0.0 0 Education 23685 2002.0 NaN 25.0 6.0 20.0 NaN 1019.700012 0.0 0.0
4 4 0 2016-01-01 0.0 0 Education 116607 1975.0 NaN 25.0 6.0 20.0 NaN 1019.700012 0.0 0.0

2 Data Visualization

2.1 Boxplot

First let’s see the relationship between the usage of the building and the age of the building:

fig, ax = plt.subplots(figsize=(15,10))
sns.boxplot(x='primary_use', y='year_built', data=train)
plt.xticks(rotation=90)
(array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15]),
 <a list of 16 Text xticklabel objects>)

在这里插入图片描述

It seems that some kinds of buildings were built eariler, such as those used as technology building. What about the relationship between the building area and building types?

fig, ax = plt.subplots(figsize=(15,10))
sns.boxplot(x='primary_use', y='square_feet', data=train)
plt.xticks(rotation=90)
(array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15]),
 <a list of 16 Text xticklabel objects>)

在这里插入图片描述

The above is an intuitive connection, as different types of buildings may install different types of meters, we will the this factor into consideration.

fig, ax = plt.subplots(figsize=(15,10))
sns.boxplot(x='primary_use', y='square_feet', hue='meter',data=train)
plt.xticks(rotation=90)
(array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15]),
 <a list of 16 Text xticklabel objects>)

在这里插入图片描述

The result are clear. Some of the buildings only install two or three types of meters, and the corresponding building area may also have significant differences.

sample=pd.DataFrame(train,columns=['site_id','primary_use'])
sample.drop_duplicates(keep='first')
site_id primary_use
0 0 Education
6 0 Lodging/residential
9 0 Office
10 0 Entertainment/public assembly
41 0 Other
... ... ...
2098 15 Manufacturing/industrial
2129 15 Religious worship
2139 15 Lodging/residential
2153 15 Utility
2203 15 Healthcare

110 rows × 2 columns

We can see that the buildings with a same ‘site_id’ may have different primary usage.

2.2 View the trends of the time series data

In this competition the data is shown in time series form, each row has a corresponding Time Label. So it is important for us to visualize the trends of the data.

First of all, we will observe the change of data from a monthly perspective. Specifically, the data will be separated by ‘site_id’ and the meter types.

# We can see it by month
fig, axes = plt.subplots(8,2,figsize=(15, 30))
color_dic={'red':0,'blue':1,'orange':2,'purple':3}
for i in range(0,15):    
    for color,meter in color_dic.items():
        if(len(train[(train['site_id']==i)&(train['meter']==meter)])!=0):
            train[(train['site_id']==i)&(train['meter']==meter)][['timestamp', 'meter_reading']].set_index('timestamp').resample('M').mean()['meter_reading'].plot(ax=axes[i%8][i//8], alpha=0.9, label=str(meter), color=color).set_ylabel('Mean meter reading by month', fontsize=13)
        axes[i%8][i//8].legend();
        axes[i%8][i//8].set_title('site_id {}'.format(i), fontsize=13);
        plt.subplots_adjust(hspace=0.45)

在这里插入图片描述

From the visualization result, we can observe some rules of the data changing. For example, in most buildings the values of meter type 2 and 3 have a significant decrease during the summer, because type 2 provides steam and type 3 provides hot water. But the condition of site_id 13 have a little different.

We can also observe the data from a weekly perspective:

# We can also see it by week
fig, axes = plt.subplots(8,2,figsize=(15, 30))
color_dic={'red':0,'blue':1,'orange':2,'purple':3}
for i in range(0,15):    
    for color,meter in color_dic.items():
        if(len(train[(train['site_id']==i)&(train['meter']==meter)])!=0):
            train[(train['site_id']==i)&(train['meter']==meter)][['timestamp', 'meter_reading']].set_index('timestamp').resample('W').mean()['meter_reading'].plot(ax=axes[i%8][i//8], alpha=0.9, label=str(meter), color=color).set_ylabel('Mean meter reading by week', fontsize=13)
        axes[i%8][i//8].legend();
        axes[i%8][i//8].set_title('site_id {}'.format(i), fontsize=13);
        plt.subplots_adjust(hspace=0.45)

在这里插入图片描述

From the weekly perspective, some data show a cyclical pattern such as the condition of site_id 12.

And what about from a daily perspective?

# By day
fig, axes = plt.subplots(8,2,figsize=(15, 30))
color_dic={'red':0,'blue':1,'orange':2,'purple':3}
for i in range(0,15):    
    for color,meter in color_dic.items():
        if(len(train[(train['site_id']==i)&(train['meter']==meter)])!=0):
            train[(train['site_id']==i)&(train['meter']==meter)][['timestamp', 'meter_reading']].set_index('timestamp').resample('D').mean()['meter_reading'].plot(ax=axes[i%8][i//8], alpha=0.9, label=str(meter), color=color).set_ylabel('Mean meter reading by day', fontsize=13)
        axes[i%8][i//8].legend();
        axes[i%8][i//8].set_title('site_id {}'.format(i), fontsize=13);
        plt.subplots_adjust(hspace=0.45)

在这里插入图片描述

We can also further separate the data by hours:

# By hour
fig, axes = plt.subplots(8,2,figsize=(15, 30))
color_dic={'red':0,'blue':1,'orange':2,'purple':3}
for i in range(0,15):    
    for color,meter in color_dic.items():
        if(len(train[(train['site_id']==i)&(train['meter']==meter)])!=0):
            train[(train['site_id']==i)&(train['meter']==meter)][['timestamp', 'meter_reading']].set_index('timestamp').resample('H').mean()['meter_reading'].plot(ax=axes[i%8][i//8], alpha=0.9, label=str(meter), color=color).set_ylabel('Mean meter reading by hour', fontsize=13)
        axes[i%8][i//8].legend();
        axes[i%8][i//8].set_title('site_id {}'.format(i), fontsize=13);
        plt.subplots_adjust(hspace=0.45)

在这里插入图片描述

2.3 The Weather Condition

The weather condition may also have connections with the energy consumption, we can visualize it, too.

# the weather condition(by day)
fig, axes = plt.subplots(figsize=(20,8))
axes1=axes.twinx()
train[['timestamp', 'meter_reading']].set_index('timestamp').resample('D').mean()['meter_reading'].plot(ax=axes1,alpha=0.9, label='meter_reading', color='green')
train[['timestamp', 'air_temperature']].set_index('timestamp').resample('D').mean()['air_temperature'].plot(ax=axes,alpha=0.9, label='air_temperature', color='red')
train[['timestamp', 'dew_temperature']].set_index('timestamp').resample('D').mean()['dew_temperature'].plot(ax=axes,alpha=0.9, label='dew_temperature', color='blue')
plt.legend()

在这里插入图片描述

<matplotlib.legend.Legend at 0x233fe00b518>

The green line represents ‘meter_reading’, we may not find any correlation between the temperature and the meter_reading, and it is strange that the value come across a sudden drop in Jun.

We can see the cloud condition.

# the cloud condition(by day)
fig, axes = plt.subplots(figsize=(20,8))
axes1=axes.twinx()
train[['timestamp', 'meter_reading']].set_index('timestamp').resample('D').mean()['meter_reading'].plot(ax=axes1,alpha=0.9, label='meter_reading', color='green')
train[['timestamp', 'cloud_coverage']].set_index('timestamp').resample('D').mean()['cloud_coverage'].plot(ax=axes,alpha=0.9, label='cloud_coverage', color='cyan')
plt.legend()
<matplotlib.legend.Legend at 0x2340872ae10>

在这里插入图片描述

We can see the wind condition, too.

# the wind condition(by day)

fig, axes = plt.subplots(figsize=(20,8))
axes1=axes.twinx()
train[['timestamp', 'meter_reading']].set_index('timestamp').resample('D').mean()['meter_reading'].plot(ax=axes1,alpha=0.9, label='meter_reading', color='green')
train[['timestamp', 'wind_speed']].set_index('timestamp').resample('D').mean()['wind_speed'].plot(ax=axes,alpha=0.9, label='wind_speed', color='purple')
plt.legend()
<matplotlib.legend.Legend at 0x233fe380b00>

在这里插入图片描述

In the visualization report above, we can see that the weather conditions are almost random, and there seems no correlation between weather conditions and the meter_reading if we just take such a cursory look.

2.4 Check the correlation matrix

Using the heatmap in seaborn can benefit us a lot, we can observe the corrlation of various features in this map.

for i in range(0,4):
    corr = train[train.meter == i][['timestamp','meter_reading','square_feet','year_built','floor_count',
             'air_temperature','cloud_coverage','dew_temperature','sea_level_pressure','wind_direction','wind_speed']].corr()
    f, ax = plt.subplots(figsize=(18, 6))
    sns.heatmap(corr,annot=True,cmap='RdGy')

在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述

Now we have a basic understanding towards the data and the next step will be feature engineering based on the characteristics of the data and we will try a few types of models.

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!