data-cleaning | 易学教程

pandas read_csv and setting na_values to any string in the csv file [duplicate]

阅读更多关于 pandas read_csv and setting na_values to any string in the csv file [duplicate]

问题 This question already has answers here : Pandas: Converting to numeric, creating NaNs when necessary (4 answers) Closed 2 years ago . data.csv 1, 22, 3432 1, 23, \N 2, 24, 54335 2, 25, 3928 I have a csv file of data that is collected from a device. Every now and then the device doesn't relay information and outputs '\N'. I want to treat these as NaN and did this by doing read_csv(data.csv, na_values=['\\N']) which worked fine. However, I would prefer to have not only this string turned to NaN

pandas read_csv and setting na_values to any string in the csv file [duplicate]

阅读更多关于 pandas read_csv and setting na_values to any string in the csv file [duplicate]

How to drop rows that are not exact duplicates but contain no new information (more NaN)

阅读更多关于 How to drop rows that are not exact duplicates but contain no new information (more NaN)

问题 My goal is to collapse the below table into one single column. For this question specifically, I am asking how I can intelligently delete the yellow row because it is a duplicate of the gray row, although with less information. The table has three categorical variables and 6 analysis/quantitative variables. Columns C1 and C2 are the only variables that need to match for a successful join; all of the . All blank cells are NaNs and python code for copying is below. Question 1 . (Yellow) All of

Reordering columns in data frame once again

阅读更多关于 Reordering columns in data frame once again

问题 I want to re-order my columns in my data frame, but what I found so far is not satisfactory. My dataframe looks like: cnt <-as.factor(c("Country 1", "Country 2", "Country 3", "Country 1", "Country 2", "Country 3" )) bnk <-as.factor(c("bank 1", "bank 2", "bank 3", "bank 1", "bank 2", "bank 3" )) mayData <-data.frame(age=c(10,12,13,10,11,15), Country=cnt, Bank=bnk, q10=c(1,1,1,2,2,2),q11=c(1,1,1,2,2,2), q1=c(1,1,1,2,2,2), q9=c(1,1,1,2,2,2), q6=c(1,1,1,2,2,2), year=c(1950,1960,1970,1980,1990

How to transfer negative value at current row to previous row in a data frame?

阅读更多关于 How to transfer negative value at current row to previous row in a data frame?

问题 I want to transfer the negative values at the current row to the previous row by adding them to the previous row within each group. Following is the sample raw data I have: raw_data <- data.frame(GROUP = rep(c('A','B','C'),each = 6), YEARMO = rep(c(201801:201806),3), VALUE = c(100,-10,20,70,-50,30,20,60,40,-20,-10,50,0,10,-30,50,100,-100)) > raw_data GROUP YEARMO VALUE 1 A 201801 100 2 A 201802 -10 3 A 201803 20 4 A 201804 70 5 A 201805 -50 6 A 201806 30 7 B 201801 20 8 B 201802 60 9 B 201803

python pandas: split comma-separated column into new columns - one per value

阅读更多关于 python pandas: split comma-separated column into new columns - one per value

问题 I have a dataframe like this: data = np.array([["userA","event2, event3"], ['userB',"event3, event4"], ['userC',"event2"]]) data = pd.DataFrame(data) 0 1 0 userA "event2, event3" 1 userB "event3, event4" 2 userC "event2" now I would like to get a dataframe like this: 0 event2 event3 event4 0 userA 1 1 1 userB 1 1 2 userC 1 can anybody help please? 回答1: It seems you need get_dummies with replace 0 to empty string s: df = data[[0]].join(data[1].str.get_dummies(', ').replace(0, '')) print (df) 0

How to extract certain under specific condition in pandas? (Sentimental analysis)

阅读更多关于 How to extract certain under specific condition in pandas? (Sentimental analysis)

问题 The picture is what my dataframe looks like. I have user_name, movie_name and time column. I want to extract only rows that are first day of certain movie. For example, if movie a's first date in the time column is 2018-06-27, i want all the rows in that date and if movie b's first date in the time column is 2018-06-12, i only want those rows. How would i do that with pandas? 回答1: I assume that time column is of datetime type. If not, convert this column calling pd.to_datetime . Then run: df

Extract specific columns form a text file to make a dataframe in scala

阅读更多关于 Extract specific columns form a text file to make a dataframe in scala

问题 I need to clean some data in scala. I have the following raw data and they are exist in a text file: 06:36:15.718068 IP 10.0.0.1.5001 > 10.0.0.2.41516: Flags [.], ack 346, win 163, options [nop,nop,TS val 1654418 ecr 1654418], length 0 06:36:15.718078 IP 10.0.0.2.41516 > 10.0.0.1.5001: Flags [.], seq 1:65161, ack 0, win 58, options [nop,nop,TS val 1654418 ecr 1654418], length 65160 I need to have all of them in a dataframe in the following way: +----------------+-----------+----------+-------

Reformat and Collapse Data Frame Based on Corresponding Column Identifier Code R

阅读更多关于 Reformat and Collapse Data Frame Based on Corresponding Column Identifier Code R

问题 I'm trying to reshape a two column data frame by collapsing the corresponding column values that match in column 2 - in this case ticker symbols to their own unique row while making the contents of column 1 which are the fields of data that correspond to those tickers their own columns. See below for my example with a small sample since it's a data frame with 500 tickers and 4 fields: # Closed End Fund Selector url<-"https://www.cefconnect.com/api/v3/DailyPricing?props=Ticker,Name

Parsing out First and Last name from Excel field

阅读更多关于 Parsing out First and Last name from Excel field

问题 I have a field (column) in excel in the format of "LastName, FirstName MiddleInitial" with a space between the comma after the last name and the first name and a second space between the middle initial and the first name (no comma after the first name). Is there a way to identify which cells have a middle initial on the right hand side and then eliminate the middle initial for all cells such that the output will look like "LastName, FirstName"? Thanks! 回答1: What you want to do is to be able