问题
I have a dataset with 7k records [telecom dataset]
.
I want to split that dataset into 4 range based on one particular column ["tenure column"]
, which contains 1 to 72 number.
Need to split the whole data based on this tenure column like:-
1 to 18 Range [1-dataset], 19 to 36 Range [2-dataset], 37 to 54 Range [3-dataset], 55 to 72 Range[4-dataset]
My sample dataset with head(5)
out.head(5)
Out[51]:
customerID Date gender age region SeniorCitizen Partner \
0 9796-BPKIW 1/2/2008 1 57 1 1 0
1 4298-OYIFC 1/4/2008 1 50 2 0 1
2 9606-PBKBQ 1/6/2008 1 85 0 1 1
3 1704-NRWYE 1/9/2008 0 55 0 1 0
4 9758-MFWGD 1/6/2008 0 52 1 1 1
Dependents tenure PhoneService ... DeviceProtection TechSupport \
0 0 8 1 ... 0 0
1 0 15 1 ... 1 1
2 0 32 1 ... 0 0
3 0 9 1 ... 0 0
4 1 48 0 ... 0 0
StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod \
0 0 0 0 1 1
1 1 1 0 1 2
2 0 1 0 1 2
3 1 0 0 1 2
4 0 0 1 0 0
MonthlyCharges TotalCharges Churn
0 69.95 562.70 0
1 103.45 1539.80 0
2 85.00 2642.05 1
3 80.85 751.65 1
4 29.90 1388.75 0
回答1:
Use pandas to easily do this thing.
import pandas as pd
df = pd.read_csv('your_dataset_file.csv', sep=',', header=0)
# Sort it according to tenure
df.sort_values(by=['tenure'], inplace=True)
# Create bin edges
step_size = int(df.tenure.max()/4)
bin_edges = list(range(0,df.tenure.max()+step_size, step_size))
lbls = ['a','b','c','d']
df['bin'] = pd.cut(df.tenure,bin_edges, labels= lbls)
# Create separate dataframes from it
df1 = df[df.bin == 'a']
df2 = df[df.bin == 'b']
df3 = df[df.bin == 'c']
df4 = df[df.bin == 'd']
回答2:
I will create list of datasets
dflist = [df[df["tenure column"].isin(range(i*18 + 1,(i+1)*18+1))] for i in range(4)]
回答3:
Easy to understand code
i = 1
m = 0
out["tenure column"] = out["tenure column"].astype(int)
df = [None]*4
while i<72:
df[m] = out[(out["tenure column"]>=i) & (out["tenure column"]<=(i+17))]
m += 1
i += 18
Hope this solves your problem
来源:https://stackoverflow.com/questions/49376781/how-to-split-the-whole-dataset-into-4-range-based-on-one-column-using-python