train-test-split

Thoughts about train_test_split for machine learning

强颜欢笑 提交于 2020-05-09 16:00:16
问题 I just noticed that many people tend to use train_test_split even before handling the missing data, and seem like they split the data at the very beginning and there are also a bunch of people, they tend to slipt the data right before model building step after they do all the data cleaning and feature engineering, feature selection stuff. The people tend to split the data at the very first saying that it is to prevent the data leakage. I am right now just so confused about the pipeline of

Thoughts about train_test_split for machine learning

痞子三分冷 提交于 2020-05-09 15:57:55
问题 I just noticed that many people tend to use train_test_split even before handling the missing data, and seem like they split the data at the very beginning and there are also a bunch of people, they tend to slipt the data right before model building step after they do all the data cleaning and feature engineering, feature selection stuff. The people tend to split the data at the very first saying that it is to prevent the data leakage. I am right now just so confused about the pipeline of

Managing Train/Develop Splits with the spaCy command line trainer

狂风中的少年 提交于 2020-03-03 07:34:29
问题 I am training an NER model using the python -m spacy train command line tool. I use gold.docs_to_json to convert my annotated documents to the JSON-serializable format. The command line training tool uses both a training set and a development set. I'm not sure how much assistance the command line tools give me for managing train/dev splits. Is there a command line tool to create train/dev splits from a single set of data? Will the spaCy training command do cross-validation for me instead of

scikit-learn error: The least populated class in y has only 1 member

纵饮孤独 提交于 2020-02-04 04:04:17
问题 I'm trying to split my dataset into a training and a test set by using the train_test_split function from scikit-learn, but I'm getting this error: In [1]: y.iloc[:,0].value_counts() Out[1]: M2 38 M1 35 M4 29 M5 15 M0 15 M3 15 In [2]: xtrain, xtest, ytrain, ytest = train_test_split(X, y, test_size=1/3, random_state=85, stratify=y) Out[2]: Traceback (most recent call last): File "run_ok.py", line 48, in <module> xtrain,xtest,ytrain,ytest = train_test_split(X,y,test_size=1/3,random_state=85

scikit-learn error: The least populated class in y has only 1 member

南楼画角 提交于 2020-02-04 04:03:34
问题 I'm trying to split my dataset into a training and a test set by using the train_test_split function from scikit-learn, but I'm getting this error: In [1]: y.iloc[:,0].value_counts() Out[1]: M2 38 M1 35 M4 29 M5 15 M0 15 M3 15 In [2]: xtrain, xtest, ytrain, ytest = train_test_split(X, y, test_size=1/3, random_state=85, stratify=y) Out[2]: Traceback (most recent call last): File "run_ok.py", line 48, in <module> xtrain,xtest,ytrain,ytest = train_test_split(X,y,test_size=1/3,random_state=85

Order between using validation, training and test sets

谁说我不能喝 提交于 2020-01-27 03:12:06
问题 I am trying to understand the process of model evaluation and validation in machine learning. Specifically, in which order and how the training, validation and test sets must be used. Let's say I have a dataset and I want to use linear regression. I am hesitating among various polynomial degrees (hyper-parameters). In this wikipedia article, it seems to imply that the sequence should be: Split data into training set, validation set and test set Use the training set to fit the model (find the

Order between using validation, training and test sets

不打扰是莪最后的温柔 提交于 2020-01-27 03:09:09
问题 I am trying to understand the process of model evaluation and validation in machine learning. Specifically, in which order and how the training, validation and test sets must be used. Let's say I have a dataset and I want to use linear regression. I am hesitating among various polynomial degrees (hyper-parameters). In this wikipedia article, it seems to imply that the sequence should be: Split data into training set, validation set and test set Use the training set to fit the model (find the

ImportError: cannot import name 'LatentDirichletAllocation'

↘锁芯ラ 提交于 2020-01-05 04:13:06
问题 I'm trying to import the following: from sklearn.model_selection import train_test_split and got following error, here's the stack trace : ImportError Traceback (most recent call last) <ipython-input-1-bdd2a2f20673> in <module> 2 import pandas as pd 3 from sklearn.model_selection import train_test_split ----> 4 from sklearn.tree import DecisionTreeClassifier 5 from sklearn.metrics import accuracy_score 6 from sklearn import tree ~/.local/lib/python3.6/site-packages/sklearn/tree/__init__.py in

Singleton array array(<function train at 0x7f3a311320d0>, dtype=object) cannot be considered a valid collection

若如初见. 提交于 2020-01-02 01:07:11
问题 Not sure how to fix . Any help much appreciate. I saw thi Vectorization: Not a valid collection but not sure if i understood this train = df1.iloc[:,[4,6]] target =df1.iloc[:,[0]] def train(classifier, X, y): X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=33) classifier.fit(X_train, y_train) print ("Accuracy: %s" % classifier.score(X_test, y_test)) return classifier trial1 = Pipeline([ ('vectorizer', TfidfVectorizer()), ('classifier', MultinomialNB()),

How to perform k-fold cross validation with tensorflow?

蹲街弑〆低调 提交于 2019-12-20 09:24:57
问题 I am following the IRIS example of tensorflow. My case now is I have all data in a single CSV file, not separated, and I want to apply k-fold cross validation on that data. I have data_set = tf.contrib.learn.datasets.base.load_csv(filename="mydata.csv", target_dtype=np.int) How can I perform k-fold cross validation on this dataset with multi-layer neural network as same as IRIS example? 回答1: I know this question is old but in case someone is looking to do something similar, expanding on