Cleveland heart disease dataset - can’t describe the class

前端 未结 3 1273
难免孤独
难免孤独 2021-01-21 13:00

I’m using the Cleveland Heart Disease dataset from UCI for classification but i don’t understand the target attribute.

The dataset description says that

相关标签:
3条回答
  • 2021-01-21 13:32
    • is this dataset meant to be a multiclass or a binary classification problem?

      Without changes, the dataset is ready to be used for a multi-class classification problem.

    • And must i group values 1-4 to a single class (presence of disease)?

      Yes, you must, as long as you are interested in using the dataset for a binary classification problem.

    0 讨论(0)
  • 2021-01-21 13:34

    If you are working on imbalanced dataset, you should use re-sampling technique to get better results. In case of imbalanced datasets the classifier always "predicts" the most common class without performing any analysis of the features.

    You should try SMOTE, it's synthesizing elements for the minority class, based on those that already exist. It works randomly picking a point from the minority class and computing the k-nearest neighbors for this point.

    I also used cross validation K-fold method along with SMOTE, Cross validation assures that model gets the correct patterns from the data.

    While measuring the performance of model, accuracy metric mislead, its shows high accuracy even though there are more False Positive. Use metric such as F1-score and MCC.

    References :

    https://www.kaggle.com/rafjaa/resampling-strategies-for-imbalanced-datasets

    0 讨论(0)
  • 2021-01-21 13:41

    It basically means that the presence of different heart diseases have been denoted by 1, 2, 3, 4 while the absence is simply denoted by 0. Now, most of the experiments that have been conducted on this dataset have been based on binary classification, i.e. presence(1, 2, 3, 4) vs absence(0). One reason for such behavior might the class imbalance problem(0 has about 160 sample and the rest 1, 2, 3 and 4 make up the other half) and small number of samples(only around 300 total samples). So, it makes sense to treat this data as binary classification problem instead of multi-class classification, given the constraints that we have.

    0 讨论(0)
提交回复
热议问题