data-cleaning | 易学教程

Python Pandas replace multiple columns zero to Nan

阅读更多关于 Python Pandas replace multiple columns zero to Nan

List with attributes of persons loaded into pandas dataframe df2 . For cleanup I want to replace value zero ( 0 or '0' ) by np.nan . df2.dtypes ID object Name object Weight float64 Height float64 BootSize object SuitSize object Type object dtype: object Working code to set value zero to np.nan : df2.loc[df2['Weight'] == 0,'Weight'] = np.nan df2.loc[df2['Height'] == 0,'Height'] = np.nan df2.loc[df2['BootSize'] == '0','BootSize'] = np.nan df2.loc[df2['SuitSize'] == '0','SuitSize'] = np.nan Believe this can be done in a similar/shorter way: df2[["Weight","Height","BootSize","SuitSize"]].astype

Fill missing Values by a ratio of other values in Pandas

阅读更多关于 Fill missing Values by a ratio of other values in Pandas

问题 I have a column in a Dataframe in Pandas with around 78% missing values. The remaining 22% values are divided between three labels - SC, ST, GEN with the following ratios. SC - 16% ST - 8% GEN - 76% I need to replace the missing values by the above three values so that the ratio of all the elements remains same as above. The assignment can be random as long the the ratio remains as above. How do I accomplish this? 回答1: Starting with this DataFrame (only to create something similar to yours):

Sum variable by group and append result

阅读更多关于 Sum variable by group and append result

问题 Dataset HAVE is a tibble edgelist of phone call data from the characters of Recess : Student Friend nCalls TJ Spinelli 3 TJ Gretchen 7 TJ Gus 6 TJ Vince 8 TJ King Bob 1 TJ Mikey 2 Spinelli TJ 3 Spinelli Vince 2 Randall Ms. Finster 17 Dataset NEED includes all original columns from HAVE but includes a new variable, nCallsPerStudent , that is exactly what it sounds like: Student Friend nCalls nCallsPerStudent TJ Spinelli 3 27 TJ Gretchen 7 27 TJ Gus 6 27 TJ Vince 8 27 TJ King Bob 1 27 TJ Mikey

Sum variable by group and append result

阅读更多关于 Sum variable by group and append result

Dataset HAVE is a tibble edgelist of phone call data from the characters of Recess : Student Friend nCalls TJ Spinelli 3 TJ Gretchen 7 TJ Gus 6 TJ Vince 8 TJ King Bob 1 TJ Mikey 2 Spinelli TJ 3 Spinelli Vince 2 Randall Ms. Finster 17 Dataset NEED includes all original columns from HAVE but includes a new variable, nCallsPerStudent , that is exactly what it sounds like: Student Friend nCalls nCallsPerStudent TJ Spinelli 3 27 TJ Gretchen 7 27 TJ Gus 6 27 TJ Vince 8 27 TJ King Bob 1 27 TJ Mikey 2 27 Spinelli TJ 3 5 Spinelli Vince 2 5 Randall Ms. Finster 17 17 How do I get from HAVE to NEED ? We

Cleansing string / input in Coldfusion 9

阅读更多关于 Cleansing string / input in Coldfusion 9

问题 I have been working with Coldfusion 9 lately (background in PHP primarily) and I am scratching my head trying to figure out how to 'clean/sanitize' input / string that is user submitted. I want to make it HTMLSAFE, eliminate any javascript, or SQL query injection, the usual. I am hoping I've overlooked some kind of function that already comes with CF9. Can someone point me in the proper direction? 回答1: This an addition to Kyle's suggestions not an alternative answer, but the comments panel is

Cleansing string / input in Coldfusion 9

阅读更多关于 Cleansing string / input in Coldfusion 9

I have been working with Coldfusion 9 lately (background in PHP primarily) and I am scratching my head trying to figure out how to 'clean/sanitize' input / string that is user submitted. I want to make it HTMLSAFE, eliminate any javascript, or SQL query injection, the usual. I am hoping I've overlooked some kind of function that already comes with CF9. Can someone point me in the proper direction? Stephen Moretti This an addition to Kyle's suggestions not an alternative answer, but the comments panel is a bit rubbish for links. Take a look a the ColdFusion string functions . You've got

Splitting a single column into multiple observation using R

阅读更多关于 Splitting a single column into multiple observation using R

I am working on HCUP data and this has range of values in one single column that needs to be split into multiple columns. Below is the HCUP data frame for reference : code label 61000-61003 excision of CNS 0169T-0169T ventricular shunt The desired output should be : code label 61000 excision of CNS 61001 excision of CNS 61002 excision of CNS 61003 excision of CNS 0169T ventricular shunt My approach to this problem is using the package splitstackshape and using this code library(data.table) library(splitstackshape) cSplit(hcup, "code", "-")[, list(code = code_1:code_2, by = label)] This

How to split a column into alphabetic values and numeric values from a column in a Pandas dataframe?

阅读更多关于 How to split a column into alphabetic values and numeric values from a column in a Pandas dataframe?

I have a dataframe: Name Section 1 James P3 2 Sam 2.5C 3 Billy T35 4 Sarah A85 5 Felix 5I How do I split numeric values into a separate column called Section_Number and also split alphabetic values to Section_Letter. Desired results Name Section Section_Number Section_Letter 1 James P3 3 P 2 Sam 2.5C 2.5 C 3 Billy T35 35 T 4 Sarah A85 85 A 5 Felix 5L 5 L Use str.replace with str.extract by [A-Z]+ for all uppercase strings: df['Section_Number'] = df['Section'].str.replace('([A-Z]+)', '') df['Section_Letter'] = df['Section'].str.extract('([A-Z]+)') print (df) Name Section Section_Number Section

How to split a column into alphabetic values and numeric values from a column in a Pandas dataframe?

阅读更多关于 How to split a column into alphabetic values and numeric values from a column in a Pandas dataframe?

问题 I have a dataframe: Name Section 1 James P3 2 Sam 2.5C 3 Billy T35 4 Sarah A85 5 Felix 5I How do I split numeric values into a separate column called Section_Number and also split alphabetic values to Section_Letter. Desired results Name Section Section_Number Section_Letter 1 James P3 3 P 2 Sam 2.5C 2.5 C 3 Billy T35 35 T 4 Sarah A85 85 A 5 Felix 5L 5 L 回答1: Use str.replace with str.extract by [A-Z]+ for all uppercase strings: df['Section_Number'] = df['Section'].str.replace('([A-Z]+)', '')

How to record bad lines skipped by pandas

阅读更多关于 How to record bad lines skipped by pandas

问题 I'm reading a CSV file with pandas with error_bad_lines=False A warning is printed when a bad line is encountered. However, I want to keep a record of all the bad line numbers to feed into another program. Is there an easy way of doing that? I thought about iterating over the file with a chunksize=1 and catching the CParserError that ought to be thrown for each bad line encountered. When I do this though no CParserError is thrown for bad lines so I can't catch them. 回答1: Warnings are printed