data-cleaning

pandas read_csv and setting na_values to any string in the csv file [duplicate]

不羁的心 提交于 2020-02-07 06:11:06
问题 This question already has answers here : Pandas: Converting to numeric, creating NaNs when necessary (4 answers) Closed 2 years ago . data.csv 1, 22, 3432 1, 23, \N 2, 24, 54335 2, 25, 3928 I have a csv file of data that is collected from a device. Every now and then the device doesn't relay information and outputs '\N'. I want to treat these as NaN and did this by doing read_csv(data.csv, na_values=['\\N']) which worked fine. However, I would prefer to have not only this string turned to NaN

pandas read_csv and setting na_values to any string in the csv file [duplicate]

ε祈祈猫儿з 提交于 2020-02-07 06:10:04
问题 This question already has answers here : Pandas: Converting to numeric, creating NaNs when necessary (4 answers) Closed 2 years ago . data.csv 1, 22, 3432 1, 23, \N 2, 24, 54335 2, 25, 3928 I have a csv file of data that is collected from a device. Every now and then the device doesn't relay information and outputs '\N'. I want to treat these as NaN and did this by doing read_csv(data.csv, na_values=['\\N']) which worked fine. However, I would prefer to have not only this string turned to NaN

How to drop rows that are not exact duplicates but contain no new information (more NaN)

不羁岁月 提交于 2020-01-25 09:17:06
问题 My goal is to collapse the below table into one single column. For this question specifically, I am asking how I can intelligently delete the yellow row because it is a duplicate of the gray row, although with less information. The table has three categorical variables and 6 analysis/quantitative variables. Columns C1 and C2 are the only variables that need to match for a successful join; all of the . All blank cells are NaNs and python code for copying is below. Question 1 . (Yellow) All of

Reordering columns in data frame once again

纵然是瞬间 提交于 2020-01-23 02:43:45
问题 I want to re-order my columns in my data frame, but what I found so far is not satisfactory. My dataframe looks like: cnt <-as.factor(c("Country 1", "Country 2", "Country 3", "Country 1", "Country 2", "Country 3" )) bnk <-as.factor(c("bank 1", "bank 2", "bank 3", "bank 1", "bank 2", "bank 3" )) mayData <-data.frame(age=c(10,12,13,10,11,15), Country=cnt, Bank=bnk, q10=c(1,1,1,2,2,2),q11=c(1,1,1,2,2,2), q1=c(1,1,1,2,2,2), q9=c(1,1,1,2,2,2), q6=c(1,1,1,2,2,2), year=c(1950,1960,1970,1980,1990

How to transfer negative value at current row to previous row in a data frame?

被刻印的时光 ゝ 提交于 2020-01-13 10:36:10
问题 I want to transfer the negative values at the current row to the previous row by adding them to the previous row within each group. Following is the sample raw data I have: raw_data <- data.frame(GROUP = rep(c('A','B','C'),each = 6), YEARMO = rep(c(201801:201806),3), VALUE = c(100,-10,20,70,-50,30,20,60,40,-20,-10,50,0,10,-30,50,100,-100)) > raw_data GROUP YEARMO VALUE 1 A 201801 100 2 A 201802 -10 3 A 201803 20 4 A 201804 70 5 A 201805 -50 6 A 201806 30 7 B 201801 20 8 B 201802 60 9 B 201803

python pandas: split comma-separated column into new columns - one per value

≡放荡痞女 提交于 2020-01-11 03:54:26
问题 I have a dataframe like this: data = np.array([["userA","event2, event3"], ['userB',"event3, event4"], ['userC',"event2"]]) data = pd.DataFrame(data) 0 1 0 userA "event2, event3" 1 userB "event3, event4" 2 userC "event2" now I would like to get a dataframe like this: 0 event2 event3 event4 0 userA 1 1 1 userB 1 1 2 userC 1 can anybody help please? 回答1: It seems you need get_dummies with replace 0 to empty string s: df = data[[0]].join(data[1].str.get_dummies(', ').replace(0, '')) print (df) 0

How to extract certain under specific condition in pandas? (Sentimental analysis)

自古美人都是妖i 提交于 2020-01-07 08:25:18
问题 The picture is what my dataframe looks like. I have user_name, movie_name and time column. I want to extract only rows that are first day of certain movie. For example, if movie a's first date in the time column is 2018-06-27, i want all the rows in that date and if movie b's first date in the time column is 2018-06-12, i only want those rows. How would i do that with pandas? 回答1: I assume that time column is of datetime type. If not, convert this column calling pd.to_datetime . Then run: df

Extract specific columns form a text file to make a dataframe in scala

时光毁灭记忆、已成空白 提交于 2020-01-07 04:14:05
问题 I need to clean some data in scala. I have the following raw data and they are exist in a text file: 06:36:15.718068 IP 10.0.0.1.5001 > 10.0.0.2.41516: Flags [.], ack 346, win 163, options [nop,nop,TS val 1654418 ecr 1654418], length 0 06:36:15.718078 IP 10.0.0.2.41516 > 10.0.0.1.5001: Flags [.], seq 1:65161, ack 0, win 58, options [nop,nop,TS val 1654418 ecr 1654418], length 65160 I need to have all of them in a dataframe in the following way: +----------------+-----------+----------+-------

Reformat and Collapse Data Frame Based on Corresponding Column Identifier Code R

别说谁变了你拦得住时间么 提交于 2020-01-06 06:09:07
问题 I'm trying to reshape a two column data frame by collapsing the corresponding column values that match in column 2 - in this case ticker symbols to their own unique row while making the contents of column 1 which are the fields of data that correspond to those tickers their own columns. See below for my example with a small sample since it's a data frame with 500 tickers and 4 fields: # Closed End Fund Selector url<-"https://www.cefconnect.com/api/v3/DailyPricing?props=Ticker,Name

Parsing out First and Last name from Excel field

烂漫一生 提交于 2019-12-25 04:47:11
问题 I have a field (column) in excel in the format of "LastName, FirstName MiddleInitial" with a space between the comma after the last name and the first name and a second space between the middle initial and the first name (no comma after the first name). Is there a way to identify which cells have a middle initial on the right hand side and then eliminate the middle initial for all cells such that the output will look like "LastName, FirstName"? Thanks! 回答1: What you want to do is to be able