data-mining

proposed nlp algorithm for text tagging

浪子不回头ぞ 提交于 2019-12-12 02:06:39
问题 I was looking for opensource tool which can help to identify the tags for any user post on social media and identifying topic/off-topic or spam comment on that post. Even after looking for entire day, I could not find any suitable tool/library. Here I have proposed my own algorithm for tagging user post belonging to 7 categories (jobs, discussion, events, articles, services, buy/sell, talents). Initially when user makes post, he tags his post. Tags can be like marketing, suggestion,

R - twitteR package download of package ‘rjson’ failed

拥有回忆 提交于 2019-12-12 01:49:35
问题 I am trying my hand at some data mining and attempting to retrieve data from Twitter. When I tried installing the package 'twitteR', I get the following warning: Warning in install.packages : download of package ‘rjson’ failed But it loads the rest of the packages. Then when I try to call the library: > library(twitteR) Loading required package: ROAuth Loading required package: RCurl Loading required package: bitops Attaching package: ‘RCurl’ The following object is masked from ‘package:tm

How to do column wise intersection with itertools

爱⌒轻易说出口 提交于 2019-12-12 01:47:27
问题 When I calculate the jaccard similarity between each of my training data of (m) training examples each with 6 features (Age,Occupation,Gender,Product_range, Product_cat and Product) forming a (m*m) similarity matrix. I get a different outcome for matrix. I have identified the problem source but do not posses a optimized solution for the same. Find the sample of the dataset below: ID AGE Occupation Gender Product_range Product_cat Product 1100 25-34 IT M 50-60 Gaming XPS 6610 1101 35-44

Cutting dendrogram at highest level of purity

坚强是说给别人听的谎言 提交于 2019-12-11 23:34:03
问题 I am trying to create program that cluster documents using hierarchical agglomerative clustering, and the output of the program depends on cutting the dendrogram at such a level that I get maximum purity. So following is the algorithm I am working on right now. Create dedrogram for the documents in the dataset purity = 0 final_clusters for all the levels, lvl, in the dendrogram clusters = cut dendrogram at lvl new_purity = calculate_purity_of(clusters) if new_purity > purity purity = new

Parallelization over for loop analyzing a data.frame

我只是一个虾纸丫 提交于 2019-12-11 17:34:04
问题 These days I've been working with a data.frame of 8M registers, and I need to improve a loop that analyzes this data. I will describe each process of the problem that I am trying to solve. First, I have to arrange all the data.frame in ascending order by three fields ClientID, Date and Time. (Check) Then, using that arranged data.frame, I must operate the differences between each of the observations, where it can be only done when the ClientID is the same. For example: ClientID|Date(YMD)|Time

Python alternate way to find dendrogram

让人想犯罪 __ 提交于 2019-12-11 14:11:20
问题 I have data of dimension 8000x100. I need to cluster these 8000 items. I am more interested in the ordering of these items. I could get the desired result from the above code for small data but for higher dimension, I keep getting runtime error "RuntimeError: maximum recursion depth exceeded while getting the str of an object". Is there an alternate way to to get the reordered column from "Z". from hcluster import pdist, linkage, dendrogram import numpy from numpy.random import rand x = rand

Self-Join in SSAS

别来无恙 提交于 2019-12-11 11:33:16
问题 I have a table like this: PersonId Job City ParentId --------- ---- ----- -------- 101 A C1 105 102 B C2 101 103 A C1 102 Then I need to getting the association rules between Person's job and parent's city . I've used self-referencing and define case/nested tables but at the result of dependency graph there is no difference between person's job or city and parent's job or city! What is the best solution for this problem in SSAS project? 回答1: SSAS Hierarchies should address your problem.

Non overlapping pattern matching with gap constraint in python

廉价感情. 提交于 2019-12-11 10:38:44
问题 I want to find total no. of non-overlapping matches of a pattern appearing in a sequence, with the gap constraint 2. Eg. 2982f 2982l 2981l is a pattern found using some algorithm. I have to find the total # of this pattern appearing in a sequence such as 2982f 2982f 2982l 2982l 2981l 3111m 3171f 2982f 2982l 2981l … , where the max gap constraint is 2. Gap constraint 2 means that between the pattern 2982f 2982l 2981l , maximum of 2 other words allowed. And, the main thing is all these matches

R-convert transaction format dataset to basket format for sequence mining

左心房为你撑大大i 提交于 2019-12-11 08:34:31
问题 ORIGINAL TABLE CELL NUMBER ----------ACTIVITY--------TIME<br/> 001................................call a................12.23<br/> 002................................call b................01.00<br/> 002................................call d................01.09<br/> 001................................call b................12.25<br/> 003................................call a................12.23<br/> 002................................call a................02.07<br/> 003.......................

Input arff file for Weka Apriori

≯℡__Kan透↙ 提交于 2019-12-11 07:22:14
问题 I am trying to do association mining on version history. I have my transaction data in mysql. Weka apriori algorithm requires arff or csv file in a certain format. It has to have columns for each item. The values will be specified as TRUE or FALSE for each item in a transaction. I am looking for a way to create this file using Weka InstanceQuery. Also what are the options if the transaction data is huge. 回答1: I can answer for the second part: options if the transaction data is huge. Weka is a