data-cleaning

Python pandas groupby aggregate on multiple columns, then pivot

末鹿安然 提交于 2019-11-28 21:52:30
问题 In Python, I have a pandas DataFrame similar to the following: Item | shop1 | shop2 | shop3 | Category ------------------------------------ Shoes| 45 | 50 | 53 | Clothes TV | 200 | 300 | 250 | Technology Book | 20 | 17 | 21 | Books phone| 300 | 350 | 400 | Technology Where shop1, shop2 and shop3 are the costs of every item in different shops. Now, I need to return a DataFrame, after some data cleaning, like this one: Category (index)| size| sum| mean | std ------------------------------------

Fill in missing pandas data with previous non-missing value, grouped by key

谁都会走 提交于 2019-11-28 20:40:00
I am dealing with pandas DataFrames like this: id x 0 1 10 1 1 20 2 2 100 3 2 200 4 1 NaN 5 2 NaN 6 1 300 7 1 NaN I would like to replace each NAN 'x' with the previous non-NAN 'x' from a row with the same 'id' value: id x 0 1 10 1 1 20 2 2 100 3 2 200 4 1 20 5 2 200 6 1 300 7 1 300 Is there some slick way to do this without manually looping over rows? You could perform a groupby/forward-fill operation on each group: import numpy as np import pandas as pd df = pd.DataFrame({'id': [1,1,2,2,1,2,1,1], 'x':[10,20,100,200,np.nan,np.nan,300,np.nan]}) df['x'] = df.groupby(['id'])['x'].ffill() print

Find all columns of dataframe in Pandas whose type is float, or a particular type?

旧街凉风 提交于 2019-11-28 18:28:32
问题 I have a dataframe, df, that has some columns of type float64, while the others are of object. Due to the mixed nature, I cannot use df.fillna('unknown') #getting error "ValueError: could not convert string to float:" as the error happened with the columns whose type is float64 (what a misleading error message!) so I'd wish that I could do something like for col in df.columns[<dtype == object>]: df[col] = df[col].fillna("unknown") So my question is if there is any such filter expression that

Data cleaning of dollar values and percentage in R

假如想象 提交于 2019-11-28 12:01:47
问题 I've been searching for a number of packages in R to help me in converting dollar values to nice numerical values. I don't seem to be able to find one (in plyr package for example). The basic thing I'm looking for is simply removing the $ sign as well as translating "M" and "K" for Millions and thousands respectively. To replicate, I can use this code below: require(XML) theurl <- "http://www.kickstarter.com/help/stats" html <- htmlParse(theurl) allProjects <- readHTMLTable(html)[[1]] names

dplyr pipes - How to change the original dataframe

血红的双手。 提交于 2019-11-28 11:14:04
When I don't use a pipe, I can change the original daframe using this command df<-slice(df,-c(1:3))%>% # delete top 3 rows df<-select(df,-c(Col1,Col50,Col51)) # delete specific columns How would one do this with a pipe? I tried this but the slice and select functions don't change the original dataframe. df%>% slice(-c(1:3))%>% select(-c(Col1,Col50,Col51)) I'd like to change the original df. You can definitely do the assignment by using an idiom such as df <- df %>% ... or df %>% ... -> df . But you could also avoid redundancy (i.e., stating df twice) by using the magrittr compound assignment

How can I turn part of the Excel data to columns to get a desired output?

痴心易碎 提交于 2019-11-28 02:16:38
For eg - Say I have data in the following format - Current Format I would need the data to be formatted in the following format for ease of use - Required Format Of course the data contains a lot more records - I'm looking for an easy way to transpose data in this way for large sets of data. Any help will be appreciated :) This is very easy with PowerQuery. It is inbuilt for Excel 2016 and a freely available add in for Version from 2010 to 2013. You would set your data up as a table excluding the first row which contains the text Number of Cases (Ctrl + T whilst bring up window to create table

Looping grepl() through data.table (R)

浪尽此生 提交于 2019-11-28 00:31:09
问题 I have a dataset stored as a data.table DT that looks like this: print(DT) category industry 1: administration admin 2: nurse practitioner truck 3: trucking truck 4: administration admin 5: warehousing nurse 6: warehousing admin 7: trucking truck 8: nurse practitioner nurse 9: nurse practitioner truck I would like to reduce the table to only rows where the industry matches the category. My general approach is to use grepl() to regex match the string '^{{INDUSTRY}}[a-z ]+$' and each row of DT

Fill in missing pandas data with previous non-missing value, grouped by key

痴心易碎 提交于 2019-11-27 13:03:08
问题 I am dealing with pandas DataFrames like this: id x 0 1 10 1 1 20 2 2 100 3 2 200 4 1 NaN 5 2 NaN 6 1 300 7 1 NaN I would like to replace each NAN 'x' with the previous non-NAN 'x' from a row with the same 'id' value: id x 0 1 10 1 1 20 2 2 100 3 2 200 4 1 20 5 2 200 6 1 300 7 1 300 Is there some slick way to do this without manually looping over rows? 回答1: You could perform a groupby/forward-fill operation on each group: import numpy as np import pandas as pd df = pd.DataFrame({'id': [1,1,2

dplyr pipes - How to change the original dataframe

怎甘沉沦 提交于 2019-11-27 06:09:20
问题 When I don't use a pipe, I can change the original daframe using this command df<-slice(df,-c(1:3))%>% # delete top 3 rows df<-select(df,-c(Col1,Col50,Col51)) # delete specific columns How would one do this with a pipe? I tried this but the slice and select functions don't change the original dataframe. df%>% slice(-c(1:3))%>% select(-c(Col1,Col50,Col51)) I'd like to change the original df. 回答1: You can definitely do the assignment by using an idiom such as df <- df %>% ... or df %>% ... ->

How can I turn part of the Excel data to columns to get a desired output?

倖福魔咒の 提交于 2019-11-26 22:09:58
问题 For eg - Say I have data in the following format - Current Format I would need the data to be formatted in the following format for ease of use - Required Format Of course the data contains a lot more records - I'm looking for an easy way to transpose data in this way for large sets of data. Any help will be appreciated :) 回答1: This is very easy with PowerQuery. It is inbuilt for Excel 2016 and a freely available add in for Version from 2010 to 2013. You would set your data up as a table