data-mining | 易学教程

package “fdapace” (R) - create a functional plot of the first principal component

阅读更多关于 package “fdapace” (R) - create a functional plot of the first principal component

问题 My question is about functional principal component analysis in R. I am working with a multi-dimensional time series looking something like this: My goal is to reduce the dimensions by applying functional PCA and then plot the first principal component like this: I have already used the FPCA function of the fdapace package on the dataset. Unfortunately, I don't understand how to interpret the resulting matrix of the FPCA estimates ( xiEst ). In my understanding the values of the Principal

Simplest feature selection algorithm

阅读更多关于 Simplest feature selection algorithm

问题 I am trying to create my own and simple feature selection algorithm. The data set that I am going to work with is here (very famous data set). Can someone give me a pointer on how to do so? I am planning to write a feature rank algorithm for a text classification. This is for a sentiment analysis of movie reviews, classifying them as either positive or negative. So my question is on how to write a simple feature selection for a text data set. 回答1: Feature selection methods are a big topic.

How do I create a new data table in Orange?

阅读更多关于 How do I create a new data table in Orange?

问题 I am using Orange (in Python) for some data mining tasks. More specifically, for clustering. Although I have gone through the tutorial and read most of the documentation, I still have a problem. All the examples in docs and tutorials assume that I have a tab delimited table with data in it. However, there is nothing saying how one can go about creating a new table from scratch. For example, I want to create a table for word frequencies across different documents. Maybe I am missing something

pandas pivot table rename columns

阅读更多关于 pandas pivot table rename columns

问题 How to rename columns with multiple levels after pandas pivot operation? Here's some code to generate test data: import pandas as pd df = pd.DataFrame({ 'c0': ['A','A','B','C'], 'c01': ['A','A1','B','C'], 'c02': ['b','b','d','c'], 'v1': [1, 3,4,5], 'v2': [1, 3,4,5]}) print(df) gives a test dataframe: c0 c01 c02 v1 v2 0 A A b 1 1 1 A A1 b 3 3 2 B B d 4 4 3 C C c 5 5 applying pivot df2 = pd.pivot_table(df, index=["c0"], columns=["c01","c02"], values=["v1","v2"]) df2 = df2.reset_index() gives

How to find frequent itemset irrespective of attribute name?

阅读更多关于 How to find frequent itemset irrespective of attribute name?

问题 I have a dataset (CSV file) to find frequent itemsets using Apriori algorithm. col1, col2, col3 bread, butter,? coke, bread, butter I am using WEKA for this purpose. The ouput is in the following format: ... Large Itemsets L(2): col1=bread col2= butter 1 col1=coke col2= bread 1 col1=coke col3= butter 1 col2= bread col3= butter 1 ... But the output that I am want is : bread, butter 2 Basically, the above output is independent of the col that they belong to. How can I achieve this kind of

When I convert a matrix into “transactions” for use with the arules package all of my values become 0

阅读更多关于 When I convert a matrix into “transactions” for use with the arules package all of my values become 0

问题 I am trying to ao apply the apriori algorithm to a binary matrix, but all of my values are returning 0. I performed a summary function on the matrix to confirm that it has non-zero values. I tried coercing into the transactions form using: trans<-as(a,"transactions") and I tried applying apriori directly to the matrix using: test<-apriori(a,parameter=list(support=.02,confidence=0,minlen=3,maxlen=3)) in both cases I got the same result seen below. Anyone else experienced this? Thanks parameter

Drawbacks of K-Medoid (PAM) Algorithm

阅读更多关于 Drawbacks of K-Medoid (PAM) Algorithm

问题 I have researched that K-medoid Algorithm (PAM) is a parition-based clustering algorithm and a variant of K-means algorithm. It has solved the problems of K-means like producing empty clusters and the sensitivity to outliers/noise. However, the time complexity of K-medoid is O(n^2), unlike K-means (Lloyd's Algorithm) which has a time complexity of O(n). I would like to ask if there are other drawbacks of K-medoid algorithm aside from its time complexity. 回答1: The main disadvantage of K-Medoid

Compare two strings and find how closely they are related by meaning

阅读更多关于 Compare two strings and find how closely they are related by meaning

问题 Problem: I have two strings, say, "Billie Jean" and "Thriller". I need to programmatically compare them and find how closely they are related. Those are both songs of the same artist, hence, they should give a higher score (probability, percentage etc) than say, "Brad Pitt" and "Jamaican Farewell". One way of doing this is an open source Java tool named WikipediaMiner which compares using the Wikipedia data dump, checking links, descriptions etc. Question: Please suggest a better alternative,

How to create vector matrix of movie ratings using R project?

阅读更多关于 How to create vector matrix of movie ratings using R project?

问题 Suppose I am using this data set of movie ratings: http://www.grouplens.org/node/73 It contains ratings in a file formatted as userID::movieID::rating::timestamp Given this, I want to construct a feature matrix in R project, where each row corresponds to a user and each column indicates the rating that the user gave to the movie (if any). Example, if the data file contains 1::1::1::10 2::2::2::11 1::2::3::12 2::1::5::13 3::3::4::14 Then the output matrix would look like: UserID, Movie1,

Does Data mining support other languages other than English?

阅读更多关于 Does Data mining support other languages other than English?

问题 I am new to data mining. I would like to do some data mining, whereas the data is not English, they are japanese or chinese wording. Does data mining support these languages? If yes, how can we achieve? Any tools and blogs. Appreciate if you can help. 回答1: The answer is as usual: Yes and no. While in fact there are no theoretical problems there are some practical problems with asian languages. A typical data mining pipeline for text consist of stemming (running -> run) removal of stop words