data-cleaning | 易学教程

Blocking '0000-00-00' from MySQL Date Fields

阅读更多关于 Blocking '0000-00-00' from MySQL Date Fields

问题 I have a database where old code likes to insert '0000-00-00' in Date and DateTime columns instead of a real date. So I have the following two questions: Is there anything that I could do on the db level to block this? I know that I can set a column to be not-null, but that does not seem to be blocking these zero values. What is the best way to detect the existing zero values in date fields? I have about a hundred tables with 2-3 date columns each and I don't want to query them individually.

multi-column factorize in pandas

阅读更多关于 multi-column factorize in pandas

The pandas factorize function assigns each unique value in a series to a sequential, 0-based index, and calculates which index each series entry belongs to. I'd like to accomplish the equivalent of pandas.factorize on multiple columns: import pandas as pd df = pd.DataFrame({'x': [1, 1, 2, 2, 1, 1], 'y':[1, 2, 2, 2, 2, 1]}) pd.factorize(df)[0] # would like [0, 1, 2, 2, 1, 0] That is, I want to determine each unique tuple of values in several columns of a data frame, assign a sequential index to each, and compute which index each row in the data frame belongs to. Factorize only works on single

Python pandas groupby aggregate on multiple columns, then pivot

阅读更多关于 Python pandas groupby aggregate on multiple columns, then pivot

In Python, I have a pandas DataFrame similar to the following: Item | shop1 | shop2 | shop3 | Category ------------------------------------ Shoes| 45 | 50 | 53 | Clothes TV | 200 | 300 | 250 | Technology Book | 20 | 17 | 21 | Books phone| 300 | 350 | 400 | Technology Where shop1, shop2 and shop3 are the costs of every item in different shops. Now, I need to return a DataFrame, after some data cleaning, like this one: Category (index)| size| sum| mean | std ---------------------------------------- where size is the number of items in each Category and sum, mean and std are related to the same

Find all columns of dataframe in Pandas whose type is float, or a particular type?

阅读更多关于 Find all columns of dataframe in Pandas whose type is float, or a particular type?

I have a dataframe, df, that has some columns of type float64, while the others are of object. Due to the mixed nature, I cannot use df.fillna('unknown') #getting error "ValueError: could not convert string to float:" as the error happened with the columns whose type is float64 (what a misleading error message!) so I'd wish that I could do something like for col in df.columns[<dtype == object>]: df[col] = df[col].fillna("unknown") So my question is if there is any such filter expression that I can use with df.columns? I guess alternatively, less elegantly, I could do: for col in df.columns: if

Data cleaning of dollar values and percentage in R

阅读更多关于 Data cleaning of dollar values and percentage in R

I've been searching for a number of packages in R to help me in converting dollar values to nice numerical values. I don't seem to be able to find one (in plyr package for example). The basic thing I'm looking for is simply removing the $ sign as well as translating "M" and "K" for Millions and thousands respectively. To replicate, I can use this code below: require(XML) theurl <- "http://www.kickstarter.com/help/stats" html <- htmlParse(theurl) allProjects <- readHTMLTable(html)[[1]] names(allProjects) <- c("Category","LaunchedProjects","TotalDollars","SuccessfulDollars","UnsuccessfulDollars"

How to record bad lines skipped by pandas

阅读更多关于 How to record bad lines skipped by pandas

I'm reading a CSV file with pandas with error_bad_lines=False A warning is printed when a bad line is encountered. However, I want to keep a record of all the bad line numbers to feed into another program. Is there an easy way of doing that? I thought about iterating over the file with a chunksize=1 and catching the CParserError that ought to be thrown for each bad line encountered. When I do this though no CParserError is thrown for bad lines so I can't catch them. Warnings are printed in the standard error channel. You can capture them to a file by redirecting the sys.stderr output. import

Removing non-English words from text using Python

阅读更多关于 Removing non-English words from text using Python

问题 I am doing a data cleaning exercise on python and the text that I am cleaning contains Italian words which I would like to remove. I have been searching online whether I would be able to do this on Python using a tool kit like nltk. For example given some text : "Io andiamo to the beach with my amico." I would like to be left with : "to the beach with my" Does anyone know of a way as to how this could be done? Any help would be much appreciated. 回答1: You can use the words corpus from NLTK:

Looping grepl() through data.table (R)

阅读更多关于 Looping grepl() through data.table (R)

I have a dataset stored as a data.table DT that looks like this: print(DT) category industry 1: administration admin 2: nurse practitioner truck 3: trucking truck 4: administration admin 5: warehousing nurse 6: warehousing admin 7: trucking truck 8: nurse practitioner nurse 9: nurse practitioner truck I would like to reduce the table to only rows where the industry matches the category. My general approach is to use grepl() to regex match the string '^{{INDUSTRY}}[a-z ]+$' and each row of DT$category , with each corresponding row of DT$industry inserted in place of {{INDUSTRY}} in the regex

Blocking '0000-00-00' from MySQL Date Fields

阅读更多关于 Blocking '0000-00-00' from MySQL Date Fields

I have a database where old code likes to insert '0000-00-00' in Date and DateTime columns instead of a real date. So I have the following two questions: Is there anything that I could do on the db level to block this? I know that I can set a column to be not-null, but that does not seem to be blocking these zero values. What is the best way to detect the existing zero values in date fields? I have about a hundred tables with 2-3 date columns each and I don't want to query them individually. Followup: The default is already set to null. A long time ago, the default was '0000-00-00'. Some code

multi-column factorize in pandas

阅读更多关于 multi-column factorize in pandas

问题 The pandas factorize function assigns each unique value in a series to a sequential, 0-based index, and calculates which index each series entry belongs to. I'd like to accomplish the equivalent of pandas.factorize on multiple columns: import pandas as pd df = pd.DataFrame({'x': [1, 1, 2, 2, 1, 1], 'y':[1, 2, 2, 2, 2, 1]}) pd.factorize(df)[0] # would like [0, 1, 2, 2, 1, 0] That is, I want to determine each unique tuple of values in several columns of a data frame, assign a sequential index