data-analysis

Keyword search between two DataFrames using python pandas

喜夏-厌秋 提交于 2020-01-06 06:09:22
问题 Hi I have two DataFrames like below DF1 Alpha | Numeric | Special and, or | 1,2,3,4,5| @,$,& and DF2 with single column Content | boy or girl | school @ morn| I want to search if anyone of the column in DF1 has anyone of the keyword in content column of DF2 and the output should be in a new DF output_DF output_column| Alpha | Special | someone help me with this 回答1: Solution is s bit complicated, because for multiple match (row 2) need only matched first column df1 : df1 = pd.DataFrame({

Weka prediction (percentage confidence) - what does it mean?

与世无争的帅哥 提交于 2020-01-06 03:00:10
问题 I've been teaching myself Weka and have learned how to build models and get predictions out of them (predictions using the CLI). When I run prediction on a data set from a previously built model I get a column that is the "prediction" also known as prediction confidence for each instance predicted. I know what percent confidence means but shouldn't all my predictions be the accuracy of my Weka Model? aka if I have a J48 Decision tree classifier with accuracy of 90%, shouldn't every classified

Crawl only content from multiple different Websites

Deadly 提交于 2020-01-05 04:37:22
问题 currently I am working on a project, where i want to analyze different articles from different blogs, Magazine, etc. published online on their Website. Therefore i have already built a Webcrawler using Python, which get me every new article as html. Now here is the point, i want to Analyse the pure content (only the article, without comments or recommendations etc. ), but i cant access this content, without defining a regular expression, to extract the content from the html response i get.

How do you test speed of sorting algorithm?

旧巷老猫 提交于 2020-01-04 09:08:12
问题 I want to do an empirical test on the speed of sorting algorithms. Initially I randomly generated data but this seems to be unfair and mess up some algorithms. For example with quicksort the pivot selection is important and one method of picking the pivot is to always pick the first and another method is to pick the median of the first, last, and middle elements. But if the array is already random it doesn't matter which pivot is selected, so in this sense it's unfair. How do you resolve this

Detect significant changes in a data-set that gradually changes

谁说胖子不能爱 提交于 2020-01-03 05:13:23
问题 I have a list of data in python that represents amount of resources used per minute. I want to find the number of times it changes significantly in that data set. What I mean by significant change is a bit different from what I've read so far. For e.g. if I have a dataset like [10,15,17,20,30,40,50,70,80,60,40,20] I say a significant change happens when data increases by double or reduces by half with respect to the previous normal. For e.g. since the list starts with 10, that is our starting

R - Calculate difference (similarity measure) between similar datasets

时光怂恿深爱的人放手 提交于 2020-01-02 11:03:50
问题 I have seen many questions that touch on this topic but haven't yet found an answer. If I have missed a question that does answer this question, please do mark this and point us to the question. Scenario: We have a benchmark dataset, we have imputation methods, we systematically delete values from the benchmark and use two different imputation methods. Thus we have a benchmark, imputedData1 and imputedData2. Question: Is there a function that can produce a number that represents the

Pandas and Python Dataframes and Conditional Shift Function

馋奶兔 提交于 2020-01-01 17:44:07
问题 Is there a conditional "shift" parameter in data frames? For example, Assume I own a used car lot and I have data as follows SaleDate Car 12/1/2016 Wrangler 12/2/2016 Camry 12/3/2016 Wrangler 12/7/2016 Prius 12/10/2016 Prius 12/12/2016 Wrangler I want to find two things out from this list - 1) For each sale, when was the last day that a car was sold? This is simple in Pandas, just a simple shift as follows df['PriorSaleDate'] = df['SaleDate'].shift() 2) For each sale, when was the prior date

Huge sparse dataframe to scipy sparse matrix without dense transform

末鹿安然 提交于 2020-01-01 13:26:12
问题 Have data with more then 1 million rows and 30 columns, one of the columns is user_id (more then 1500 different users). I want one-hot-encode this column and to use data in ML algorithms (xgboost, FFM, scikit). But due to huge row numbers and unique user values matrix will be ~ 1 million X 1500, so need do this in sparse format (otherwise data kill all RAM). For me convenient way to work with data through pandas DataFrame, which also now it support sparse format: df = pd.get_dummies(df,

value matching between two DataFrames using pandas in python

只愿长相守 提交于 2019-12-30 12:54:49
问题 Hi I have two DataFrames like below DF1 Alpha | Numeric | Special and | 1 | @ or | 2 | # lol ok | 4 | & DF2 with single column Content boy or girl school @ morn pyc LoL ok student Chandra I want to search if anyone of the column in DF1 has anyone of the keyword in content column of DF2 and the output should be in a new DF `df11 = (df1.unstack() .reset_index(level=2,drop=True) .rename_axis(('col_order','col_name')) .dropna() .reset_index(name='val_low')) df22 = (df2['Content'].str.split(expand

Processing a very very big data set in python - memory error

那年仲夏 提交于 2019-12-30 06:38:15
问题 I'm trying to process data obtained from a csv file using csv module in python. there are about 50 columns & 401125 rows in this. I used the following code chunk to put that data into a list csv_file_object = csv.reader(open(r'some_path\Train.csv','rb')) header = csv_file_object.next() data = [] for row in csv_file_object: data.append(row) I can get length of this list using len(data) & it returns 401125. I can even get each individual record by calling list indices. But when I try to get the