categorical-data

Appending pandas DataFrame with MultiIndex with data containing new labels, but preserving the integer positions of the old MultiIndex

狂风中的少年 提交于 2020-01-14 07:27:29
问题 Base scenario For a recommendation service I am training a matrix factorization model (LightFM) on a set of user-item interactions. For the matrix factorization model to yield the best results, I need to map my user and item IDs to a continuous range of integer IDs starting at 0. I'm using a pandas DataFrame in the process, and I have found a MultiIndex to be extremely convenient to create this mapping, like so: ratings = [{'user_id': 1, 'item_id': 1, 'rating': 1.0}, {'user_id': 1, 'item_id':

Weird behaviour with groupby on ordered categorical columns

倖福魔咒の 提交于 2020-01-14 07:04:16
问题 MCVE df = pd.DataFrame({ 'Cat': ['SF', 'W', 'F', 'R64', 'SF', 'F'], 'ID': [1, 1, 1, 2, 2, 2] }) df.Cat = pd.Categorical( df.Cat, categories=['R64', 'SF', 'F', 'W'], ordered=True) As you can see, I've define an ordered categorical column on Cat . To verify, check; 0 SF 1 W 2 F 3 R64 4 SF 5 F Name: Cat, dtype: category Categories (4, object): [R64 < SF < F < W] I want to find the largest category PER ID. Doing groupby + max works. df.groupby('ID').Cat.max() ID 1 W 2 F Name: Cat, dtype: object

How to fit predefined offsets to models containing categorical variables in R

雨燕双飞 提交于 2020-01-14 05:30:11
问题 Using the following data: http://pastebin.com/4wiFrsNg I am wondering how to fit a predefined offset to the raw relationship of another model i.e. how to fit the estimates from Model A, thus: ModelA<-lm(Dependent1~Explanatory) to model B thus: ModelB<-lm(Dependent2~Explanatory) Where the explanatory variable is either the variable "Categorical" in my dataset, or the variable "Continuous". I got a useful answer related to a similar question on CV: https://stats.stackexchange.com/questions

How do I categorize my data for a datamining procedure?

给你一囗甜甜゛ 提交于 2020-01-13 11:49:10
问题 I am doing a data mining procedure, using the apriori function. This function only works on categorical data, without values but only text. My dataset fulfills these requirements, as I have five categorial variables, without numerical values but only text (so the variable 'sex' is categorized into 'female' and 'male') If I now try the apriori() function, I get the following error: apriori(data) Error in asMethod(object) : column(s) 1, 2, 3, 4, 5 not logical or a factor. Use as.factor or

How to solve this problem in python jupyter notebook using deep learning

梦想与她 提交于 2020-01-11 14:33:34
问题 I am trying to run. But this error occurs TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType' Here is code data = np.asarray(data, dtype="float") / 255.0 labels = np.array(labels) print("Success") # partition the data into training and testing splits using 75% of # the data for training and the remaining 25% for testing (trainX, testX, trainY, testY) = train_test_split(data, labels, test_size=0.25, random_state=42) #this is run successfully but when

How Tensorflow handles categorical features with multiple inputs within one column?

我只是一个虾纸丫 提交于 2020-01-11 04:06:07
问题 For example, I have a data in the following csv format: csv col0 col1 col2 col3 1 A E|A|C 3 0 B D|F 2 2 C | 2 Each column seperated by comma represent one feature. Normally, a feature is one-hot(e.g. col0, col1, col3 ), but in this case, the feature for col2 has multiple inputs(seperated by |). I'm sure tensorflow can handle one-hot feature with sparse tensor, but I'm not sure whether it could handle features with multiple inputs like col2 ? How should it be represented in Tensorflow's sparse

How to run a regression which report all factor variables?

寵の児 提交于 2020-01-06 06:55:53
问题 I want to run a regression that calculates the estimated values for all levels of a factor variable. By default, Stata omits one dummy as a base level. When I use the allbaselevels option, it just shows a zero value for a base level: regress adjusted_volume i.rounded_time, allbaselevels SAS shows all the estimated values of categorical variables when the constant has been removed. How can i do the same thing in Stata? 回答1: The option allbaselevels is one of several display options , which can

Generating a new binomial variable from existing variables

时光怂恿深爱的人放手 提交于 2020-01-05 04:28:06
问题 Suppose I have the following data: Var1 = (1,1,0,1,0,1,0,1,1,0,1,1,0,0,0,1,0) Var2 = (1,0,0,1,1,0,0,1,0,1,0,1,1,1,0,1,1) Var3 = (0,0,0,1,1,1,0,0,1,0,1,0,0,0,1,0,0) Using if / else syntax in R, I need to create new Var4 , so that if var1=1 & var2=1 & var3=1 then var4=1 if var1=0 & var2=0 & var3=0 then var4=0 if var1=1 & var2=0 & var3=0 the var4=1 and so on. Basically, var4=0 when all three variables=0 only. 回答1: We can cbind the 'var' vectors, get the rowSums and check if it is greater than 0,

Tensorflow embedding lookup with unequal sized lists

亡梦爱人 提交于 2020-01-01 11:25:47
问题 Hej guys, I'm trying to project multi labeled categorical data into a dense space using embeddings. Here's an toy example. Let's say I have four categories and want to project them into a 2D space. Furthermore I got two instances, the first one belonging to category 0 and the second one to category 1. The code will look something like this: sess = tf.InteractiveSession() embeddings = tf.Variable(tf.random_uniform([4, 2], -1.0, 1.0)) sess.run(tf.global_variables_initializer()) y = tf.nn

Problems with a binary one-hot (one-of-K) coding in python

喜你入骨 提交于 2019-12-31 10:03:53
问题 Binary one-hot (also known as one-of-K) coding lies in making one binary column for each distinct value for a categorical variable. For example, if one has a color column (categorical variable) that takes the values 'red', 'blue', 'yellow', and 'unknown' then a binary one-hot coding replaces the color column with binaries columns 'color=red', 'color=blue', and 'color=yellow'. I begin with data in a pandas data-frame and I want to use this data to train a model with scikit-learn. I know two