categorical-data | 易学教程

Appending pandas DataFrame with MultiIndex with data containing new labels, but preserving the integer positions of the old MultiIndex

阅读更多关于 Appending pandas DataFrame with MultiIndex with data containing new labels, but preserving the integer positions of the old MultiIndex

问题 Base scenario For a recommendation service I am training a matrix factorization model (LightFM) on a set of user-item interactions. For the matrix factorization model to yield the best results, I need to map my user and item IDs to a continuous range of integer IDs starting at 0. I'm using a pandas DataFrame in the process, and I have found a MultiIndex to be extremely convenient to create this mapping, like so: ratings = [{'user_id': 1, 'item_id': 1, 'rating': 1.0}, {'user_id': 1, 'item_id':

Weird behaviour with groupby on ordered categorical columns

阅读更多关于 Weird behaviour with groupby on ordered categorical columns

问题 MCVE df = pd.DataFrame({ 'Cat': ['SF', 'W', 'F', 'R64', 'SF', 'F'], 'ID': [1, 1, 1, 2, 2, 2] }) df.Cat = pd.Categorical( df.Cat, categories=['R64', 'SF', 'F', 'W'], ordered=True) As you can see, I've define an ordered categorical column on Cat . To verify, check; 0 SF 1 W 2 F 3 R64 4 SF 5 F Name: Cat, dtype: category Categories (4, object): [R64 < SF < F < W] I want to find the largest category PER ID. Doing groupby + max works. df.groupby('ID').Cat.max() ID 1 W 2 F Name: Cat, dtype: object

How to fit predefined offsets to models containing categorical variables in R

阅读更多关于 How to fit predefined offsets to models containing categorical variables in R

问题 Using the following data: http://pastebin.com/4wiFrsNg I am wondering how to fit a predefined offset to the raw relationship of another model i.e. how to fit the estimates from Model A, thus: ModelA<-lm(Dependent1~Explanatory) to model B thus: ModelB<-lm(Dependent2~Explanatory) Where the explanatory variable is either the variable "Categorical" in my dataset, or the variable "Continuous". I got a useful answer related to a similar question on CV: https://stats.stackexchange.com/questions

How do I categorize my data for a datamining procedure?

阅读更多关于 How do I categorize my data for a datamining procedure?

问题 I am doing a data mining procedure, using the apriori function. This function only works on categorical data, without values but only text. My dataset fulfills these requirements, as I have five categorial variables, without numerical values but only text (so the variable 'sex' is categorized into 'female' and 'male') If I now try the apriori() function, I get the following error: apriori(data) Error in asMethod(object) : column(s) 1, 2, 3, 4, 5 not logical or a factor. Use as.factor or

How to solve this problem in python jupyter notebook using deep learning

阅读更多关于 How to solve this problem in python jupyter notebook using deep learning

问题 I am trying to run. But this error occurs TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType' Here is code data = np.asarray(data, dtype="float") / 255.0 labels = np.array(labels) print("Success") # partition the data into training and testing splits using 75% of # the data for training and the remaining 25% for testing (trainX, testX, trainY, testY) = train_test_split(data, labels, test_size=0.25, random_state=42) #this is run successfully but when

How Tensorflow handles categorical features with multiple inputs within one column?

阅读更多关于 How Tensorflow handles categorical features with multiple inputs within one column?

问题 For example, I have a data in the following csv format: csv col0 col1 col2 col3 1 A E|A|C 3 0 B D|F 2 2 C | 2 Each column seperated by comma represent one feature. Normally, a feature is one-hot(e.g. col0, col1, col3 ), but in this case, the feature for col2 has multiple inputs(seperated by |). I'm sure tensorflow can handle one-hot feature with sparse tensor, but I'm not sure whether it could handle features with multiple inputs like col2 ? How should it be represented in Tensorflow's sparse

How to run a regression which report all factor variables?

阅读更多关于 How to run a regression which report all factor variables?

问题 I want to run a regression that calculates the estimated values for all levels of a factor variable. By default, Stata omits one dummy as a base level. When I use the allbaselevels option, it just shows a zero value for a base level: regress adjusted_volume i.rounded_time, allbaselevels SAS shows all the estimated values of categorical variables when the constant has been removed. How can i do the same thing in Stata? 回答1: The option allbaselevels is one of several display options , which can

Generating a new binomial variable from existing variables

阅读更多关于 Generating a new binomial variable from existing variables

问题 Suppose I have the following data: Var1 = (1,1,0,1,0,1,0,1,1,0,1,1,0,0,0,1,0) Var2 = (1,0,0,1,1,0,0,1,0,1,0,1,1,1,0,1,1) Var3 = (0,0,0,1,1,1,0,0,1,0,1,0,0,0,1,0,0) Using if / else syntax in R, I need to create new Var4 , so that if var1=1 & var2=1 & var3=1 then var4=1 if var1=0 & var2=0 & var3=0 then var4=0 if var1=1 & var2=0 & var3=0 the var4=1 and so on. Basically, var4=0 when all three variables=0 only. 回答1: We can cbind the 'var' vectors, get the rowSums and check if it is greater than 0,

Tensorflow embedding lookup with unequal sized lists

阅读更多关于 Tensorflow embedding lookup with unequal sized lists

问题 Hej guys, I'm trying to project multi labeled categorical data into a dense space using embeddings. Here's an toy example. Let's say I have four categories and want to project them into a 2D space. Furthermore I got two instances, the first one belonging to category 0 and the second one to category 1. The code will look something like this: sess = tf.InteractiveSession() embeddings = tf.Variable(tf.random_uniform([4, 2], -1.0, 1.0)) sess.run(tf.global_variables_initializer()) y = tf.nn

Problems with a binary one-hot (one-of-K) coding in python

阅读更多关于 Problems with a binary one-hot (one-of-K) coding in python

问题 Binary one-hot (also known as one-of-K) coding lies in making one binary column for each distinct value for a categorical variable. For example, if one has a color column (categorical variable) that takes the values 'red', 'blue', 'yellow', and 'unknown' then a binary one-hot coding replaces the color column with binaries columns 'color=red', 'color=blue', and 'color=yellow'. I begin with data in a pandas data-frame and I want to use this data to train a model with scikit-learn. I know two