data-mining

How to scrape all the home page text content of a website?

早过忘川 提交于 2020-04-17 19:04:00
问题 So I am new to webscraping, I want to scrape all the text content of only the home page. this is my code, but it now working correctly. from bs4 import BeautifulSoup import requests website_url = "http://www.traiteurcheminfaisant.com/" ra = requests.get(website_url) soup = BeautifulSoup(ra.text, "html.parser") full_text = soup.find_all() print(full_text) When I print "full_text" it give me a lot of html content but not all, when I ctrl + f " traiteurcheminfaisant@hotmail.com" the email adress

finding key phrases using tm package in r

南楼画角 提交于 2020-02-25 03:06:48
问题 I have a project requiring me to search annual reports of various companies and find key phrases in them. I have converted the reports to text files, created and cleaned a corpus. I then created a document term matrix. The tm_term_score function only seems to work for single words and not phrases. Is it possible to search the corpus for key phrases (not necessarily the most frequent)? For example - I want to see how many times the phrase “supply chain finance” in each document in the corpus.

Refitting clusters around fixed centroids

雨燕双飞 提交于 2020-02-02 19:35:30
问题 Clustering/classification problem: Used k-means clustering to generate these clusters and centroids: This is the dataset with the added cluster attribute from the initial run: > dput(sampledata) structure(list(Player = structure(1:5, .Label = c("A", "B", "C", "D", "E"), class = "factor"), Metric.1 = c(0.3938961, 0.28062338, 0.32532626, 0.29239642, 0.25622558), Metric.2 = c(0.00763359, 0.01172354, 0.40550867, 0.04026846, 0.05976367), Metric.3 = c(0.50766075, 0.20345662, 0.06267444, 0.08661417,

How to choose the right normalization method for the right dataset?

拜拜、爱过 提交于 2020-01-24 22:42:40
问题 There are several normalization methods to choose from. L1/L2 norm, z-score, min-max. Can anyone give some insights as to how to choose the proper normalization method for a dataset? I didn't pay too much attention to normalization before, but I just got a small project where it's performance has been heavily affected not by parameters or choices of the ML algorithm but by the way I normalized the data. Kind of surprise to me. But this may be a common problem in practice. So, could anyone

Import ARFF dataset using RWeka in RStudio (depencendy error: rJava)

痞子三分冷 提交于 2020-01-24 22:13:09
问题 I am currently using R for Windows verison 3.5.3 and RStudio version 1.2.1335. My goal is to import an ARFF dataset using the RWeka package in order to do some Association analysis, more specifically, to apply the Apriori algorithm. I want to analyze a dataset (.ARFF) in R and, due to convenience, I am using the RWeka package, as my goal is to apply the Apriori algorithm, one of the associators available on that package. That package requires some dependencies (RWekajars e rJava) and they

Naive Bayesian for Topic detection using “Bag of Words” approach

那年仲夏 提交于 2020-01-22 04:25:29
问题 I am trying to implement a naive bayseian approach to find the topic of a given document or stream of words. Is there are Naive Bayesian approach that i might be able to look up for this ? Also, i am trying to improve my dictionary as i go along. Initially, i have a bunch of words that map to a topics (hard-coded). Depending on the occurrence of the words other than the ones that are already mapped. And depending on the occurrences of these words i want to add them to the mappings, hence

What is the difference between Gradient Descent and Newton's Gradient Descent?

有些话、适合烂在心里 提交于 2020-01-22 04:09:37
问题 I understand what Gradient Descent does. Basically it tries to move towards the local optimal solution by slowly moving down the curve. I am trying to understand what is the actual difference between the plan gradient descent and the newton's method? From Wikipedia, I read this short line "Newton's method uses curvature information to take a more direct route." What does this intuitively mean? 回答1: At a local minimum (or maximum) x , the derivative of the target function f vanishes: f'(x) = 0

GBM R function: get variable importance separately for each class

白昼怎懂夜的黑 提交于 2020-01-20 16:48:06
问题 I am using the gbm function in R (gbm package) to fit stochastic gradient boosting models for multiclass classification. I am simply trying to obtain the importance of each predictor separately for each class, like in this picture from the Hastie book (the Elements of Statistical Learning) (p. 382). However, the function summary.gbm only returns the overall importance of the predictors (their importance averaged over all classes). Does anyone know how to get the relative importance values?

Data mining termin “fledged”?

一曲冷凌霜 提交于 2020-01-16 18:18:08
问题 Please tell what is termin "full fledged KI"? As i understand it is part of data mining for text analyzing. Am i right? Some interesting and useful links will be fine! Thank you!!! 回答1: By "full fledged", he likely means "fully fledged", defined as developed or matured to the fullest degree of full rank or status source: thefreedictionary.com Not sure about KI, but possibly it means: http://en.wikipedia.org/wiki/Knowledge_integration 回答2: My guess is that it is a typo of AI or a near-synonym,

R, DMwR-package, SMOTE-function won't work

给你一囗甜甜゛ 提交于 2020-01-13 10:08:16
问题 I need to apply the smote-algorithm to a data set, but can't get it to work. Example: x <- c(12,13,14,16,20,25,30,50,75,71) y <- c(0,0,1,1,1,1,1,1,1,1) frame <- data.frame(x,y) library(DMwR) smotedobs <- SMOTE(y~ ., frame, perc.over=300) This gives the following error: Error in scale.default(T, T[i, ], ranges) : subscript out of bounds In addition: Warning messages: 1: In FUN(newX[, i], ...) : no non-missing arguments to max; returning -Inf 2: In FUN(newX[, i], ...) : no non-missing arguments