data-mining

What is an intuitive explanation of the Expectation Maximization technique? [closed]

北慕城南 提交于 2019-12-29 10:08:51
问题 Closed . This question needs to be more focused. It is not currently accepting answers. Want to improve this question? Update the question so it focuses on one problem only by editing this post. Closed last year . Expectation Maximization (EM) is a kind of probabilistic method to classify data. Please correct me if I am wrong if it is not a classifier. What is an intuitive explanation of this EM technique? What is expectation here and what is being maximized ? 回答1: Note: the code behind this

clustering and matlab

被刻印的时光 ゝ 提交于 2019-12-29 06:16:15
问题 I'm trying to cluster some data I have from the KDD 1999 cup dataset the output from the file looks like this: 0,tcp,http,SF,239,486,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,8,8,0.00,0.00,0.00,0.00,1.00,0.00,0.00,19,19,1.00,0.00,0.05,0.00,0.00,0.00,0.00,0.00,normal. with 48 thousand different records in that format. I have cleaned the data up and removed the text keeping only the numbers. The output looks like this now: I created a comma delimited file in excel and saved as a csv file then created a

random unit vector in multi-dimensional space

左心房为你撑大大i 提交于 2019-12-28 05:59:27
问题 I'm working on a data mining algorithm where i want to pick a random direction from a particular point in the feature space. If I pick a random number for each of the n dimensions from [-1,1] and then normalize the vector to a length of 1 will I get an even distribution across all possible directions? I'm speaking only theoretically here since computer generated random numbers are not actually random. 回答1: One simple trick is to select each dimension from a gaussian distribution, then

convert http request to kdd cup data format with 41 parameters

戏子无情 提交于 2019-12-26 10:01:10
问题 machine learning is done using KDD cup dataset and formed a trained dataset.. Now I have to check the real time request with the trained dataset.. for that I have to convert TCP dump data/or http request to KDD CUP data set format(with 41 parameters) MY QUESTION IS "HOW CAN I DO THIS CONVERSION ??" 回答1: IIRC the process of how the feature of the flawed KDD CUP data set were exactly derived is not well documented. But it does not reflect real attacks anyway . Running it on recent data does not

convert http request to kdd cup data format with 41 parameters

荒凉一梦 提交于 2019-12-26 10:00:09
问题 machine learning is done using KDD cup dataset and formed a trained dataset.. Now I have to check the real time request with the trained dataset.. for that I have to convert TCP dump data/or http request to KDD CUP data set format(with 41 parameters) MY QUESTION IS "HOW CAN I DO THIS CONVERSION ??" 回答1: IIRC the process of how the feature of the flawed KDD CUP data set were exactly derived is not well documented. But it does not reflect real attacks anyway . Running it on recent data does not

How to index with ELKI - OPTICS clustering

跟風遠走 提交于 2019-12-25 14:24:10
问题 I'm an ELKI beginner, and I've been using it to cluster around 10K lat-lon points from a .csv file. Once I get my settings correct, I'd like to scale up to 1MM points. I'm using the OPTICSXi algorithm with LngLatDistanceFunction I keep reading about "enabling R*-tree index with STR bulk loading" in order to see vast improvements in performance. The tutorials haven't helped me much. Any tips on how I can implement this feature? 回答1: The suggested parameters for using a spatial R* index on 2

How could we know the ColumnName /attribute of items generated in Rules

夙愿已清 提交于 2019-12-25 07:15:19
问题 Using arules package, 'apriori' returns a 'rules' object. How can we make a query that - What exact column does the item(s) in rules {lhs, rhs} come from ? Example: I've some data in a tabular manner in file "input.csv" and want to associate/interpret the returned rule itemsets with the column headers in the file. How can I possibly do that? Any pointers are appreciated. Thanks, A reproducible example: input.csv ABC,DEF,GHI,JKL,MNO 11,56789,1,0,10 12,57685,0,0,10 11,56789,0,1,11 10,57689,1,0

Apriori Algorithm- frequent item set generation

时间秒杀一切 提交于 2019-12-25 06:57:58
问题 I am using Apriori algorithm to identify the frequent item sets of the customer.Based on the identified frequent item sets I want to prompt suggest items to customer when customer adds a new item to his shopping list, As the frequent item sets I got the result as follows; [1],[3],[2],[5] [2.3],[3,5],[1,3],[2,5] [2,3,5] My problem is if I consider only [2,3,5] set to make suggestions to customer am I wrong? i.e If customer adds item 3 to his shopping list I would recommend item 2 and item 5.

Easy way to fill in missing data

前提是你 提交于 2019-12-25 06:48:10
问题 I have a table with results from an optimization algorithm. I have 100 runs. X represents the time and is only stored when an improvement is stored. So I have missing x-es. x1; y1 ; x2 ; y2 1 ; 100 ; 1 ; 150 4 ; 90 ; 2 ; 85 7 ; 85 ; 10 ; 60 10; 80 ; This is just a csv. I am looking for a method to easily process this. As want to calculate averages at each x-value. So the average at x = 4, needs to take into account that for run 2, y at 4 is 85. Any easy way to do this with excel. Or read it

Pagerank Personalization vector , edge weights and dangling dictionary (teleportation vector)

こ雲淡風輕ζ 提交于 2019-12-25 06:24:04
问题 This is the pagerank function from networkx def pagerank(G, alpha=0.85, personalization=None, max_iter=100, tol=1.0e-6, nstart=None, weight='weight', dangling=None): I am confused with personalization and weight. I understand the when personalization matrix is not provides a uniform matrix is used and when weight is not provided edge weight of 1 is used. I have been reading about :Edge weight personalization and Node Weight Personalization. http://www.cs.cornell.edu/~bindel/present/2015-08