问题
How to predict below and which algorithm is the best suit.
Employee has Work Activity Start_Date & End_Date (Columns). Sheet has few other columns such as Work_Complexity (High & Low) , no. of sub-tasks for each activity.
How to predict Work Activity End_Date for a Start_Date? Which ML Algorithm has to be used ?
Is this can be considered as a realistic use case ?
thanks!!!
回答1:
Yes, this is a realistic use case.
If you have a labelled data means, you have a sheet where employee start date and end date is known for existing tasks and now you want to predict the end date for any new task, you can use Linear Regression with multiple variable. For more info related to Linear Regression with multiple variable, go through this link: https://www.investopedia.com/terms/m/mlr.asp
Anyway, don't get much confused in that theory. In simple terms, Linear Regression is an approach to modelling a relationship between the variables (columns). Linear Regression with one variable means, you are trying to predict the end date with only using one variable(column) i.e. start date in your case. If you want to predict the end date with using more than one variable(columns) i.e. start date, complexity of task, sub-task etc; you have to use Linear Regression with multiple variable. I am using House Price Prediction model.
Below is the Implementation of Linear Regression with one variable using python, where we will predict the house price using only one variable:
import pandas as pd #used for uploading your datasets #you have to import machine learning libraries
import numpy as np #for array
from sklearn import linear_model #for prediction
df = pd.read_csv('/content/MLPractical2 - Sheet1.csv') #you need to upload your file
df
Output: File which I have uploaded, contains following data
Area || Price
2600 || 555000
3000 || 565000
3200 || 610000
3600 || 680000
4000 || 725000
Let's make a prediction of house price which is having area 3601:
reg = linear_model.LinearRegression()
reg.fit(df[['Area']], df.Price)
reg.predict([[3601]])
Output : array([669653.42465753])
We are predicting price on basis of only one variable(column) i.e Area
As you can observe in file which i have uploaded, Price of House having area 3600 is 680000 and price which our algorithm is predicting for area 3601 is 669653.42465753 which is very close.
Let's look at the implementation of Linear Regression with multiple variable using python; where we'll use multiple variable to predict our house price
import pandas as pd #same as above
import numpy as np
from sklearn import linear_model
df = pd.read_csv('/content/ML_Sheet_2.csv')
df
Output: File which I have uploaded in this case contains following data
Area || Bedroooms || Age || Price
2600 || 3.0 || 20 || 550000
3000 || 4.0 || 15 || 565000
3200 ||3.0 ||18 || 610000
3600 || 3.0 || 30 || 595000
4000 || 5.0 || 8 || 760000
Let's make a prediction of house price which is having area 3500, 3 bedrooms and 10 years old
reg = linear_model.LinearRegression()
reg.fit(df[['Area', 'Bedroooms', 'Age']], df.Price)
reg.predict([[3500, 3, 10]])
Output: array([717775])
We are predicting the house price on the basis of three variable i.e. Area, Number od bedrooms and Age of House.
As you can observe in the file which I have uploaded, Price of House having area 3200, 3 bedrooms and 18 years old is 610000 and price which our algorithm is predicting for area 3500(more than 3200), 3 bedrooms and 10 years old is 717775 which is very close and understandable because we are predicting for house which is having more area than 3200 and less age(New house has more price) than 18.
Similarly, you can also prepare a excel sheet of your existing data and save it in .csv format and proceed further as I did. I am using google colab for writing my code; I prefer you to use the same:
https://colab.research.google.com/notebooks/intro.ipynb#recent=true
Hope this helps you!
来源:https://stackoverflow.com/questions/61378919/how-to-predict-employee-task-end-date-through-machine-learning