data-munging | 易学教程

Transform a folder of CSV files the same way, then output multiple dataframes with python

阅读更多关于 Transform a folder of CSV files the same way, then output multiple dataframes with python

问题 I've got a folder of csv files that I need to transform and manipulate/clean up, outputting a dataframe that I can then continue to work with. I'd like one dataframe uniquely titled per CSV file that I have. I wrote the code to be able to manipulate just one of the csv files the way that I'd like to, with a clean dataframe at the end, but I'm getting tripped up on attempting to iterate through the folder and transform all of the csv files, ending with a dataframe per csv. Here's the code I've

Unexpected results of min() and max() methods of Pandas series made of Timestamp objects

阅读更多关于 Unexpected results of min() and max() methods of Pandas series made of Timestamp objects

问题 I encountered this behaviour when doing basic data munging, like in this example: In [55]: import pandas as pd In [56]: import numpy as np In [57]: rng = pd.date_range('1/1/2000', periods=10, freq='4h') In [58]: lvls = ['A','A','A','B','B','B','C','C','C','C'] In [59]: df = pd.DataFrame({'TS': rng, 'V' : np.random.randn(len(rng)), 'L' : lvls}) In [60]: df Out[60]: L TS V 0 A 2000-01-01 00:00:00 -1.152371 1 A 2000-01-01 04:00:00 -2.035737 2 A 2000-01-01 08:00:00 -0.493008 3 B 2000-01-01 12:00

How to efficiently rearrange pandas data as follows?

阅读更多关于 How to efficiently rearrange pandas data as follows?

I need some help with a concise and first of all efficient formulation in pandas of the following operation: Given a data frame of the format id a b c d 1 0 -1 1 1 42 0 1 0 0 128 1 -1 0 1 Construct a data frame of the format: id one_entries 1 "c d" 42 "b" 128 "a d" That is, the column "one_entries" contains the concatenated names of the columns for which the entry in the original frame is 1. Here's one way using boolean rule and applying lambda func. In [58]: df Out[58]: id a b c d 0 1 0 -1 1 1 1 42 0 1 0 0 2 128 1 -1 0 1 In [59]: cols = list('abcd') In [60]: (df[cols] > 0).apply(lambda x: ' '

Expanding pandas Data Frame rows based on number and group ID (Python 3).

阅读更多关于 Expanding pandas Data Frame rows based on number and group ID (Python 3).

I have been struggling with finding a way to expand/clone observation rows based on a pre-determined number and a grouping variable (id). For context, here is an example data frame using pandas and numpy (python3). df = pd.DataFrame([[1, 15], [2, 20]], columns = ['id', 'num']) df Out[54]: id num 0 1 15 1 2 20 I want to expand/clone the rows by the number given in the "num" variable based on their ID group. In this case, I would want 15 rows for id = 1 and 20 rows for id = 2. This is probably an easy question, but I am struggling to make this work. I've been messing around with reindex and np

Pandas merge two dataframes with different columns

阅读更多关于 Pandas merge two dataframes with different columns

问题 I'm surely missing something simple here. Trying to merge two dataframes in pandas that have mostly the same column names, but the right dataframe has some columns that the left doesn't have, and vice versa. >df_may id quantity attr_1 attr_2 0 1 20 0 1 1 2 23 1 1 2 3 19 1 1 3 4 19 0 0 >df_jun id quantity attr_1 attr_3 0 5 8 1 0 1 6 13 0 1 2 7 20 1 1 3 8 25 1 1 I've tried joining with an outer join: mayjundf = pd.DataFrame.merge(df_may, df_jun, how="outer") But that yields: Left data columns

Strip white spaces from CSV file

阅读更多关于 Strip white spaces from CSV file

I need to stripe the white spaces from a CSV file that I read import csv aList=[] with open(self.filename, 'r') as f: reader = csv.reader(f, delimiter=',', quoting=csv.QUOTE_NONE) for row in reader: aList.append(row) # I need to strip the extra white space from each string in the row return(aList) CaraW There's also the embedded formatting parameter: skipinitialspace (the default is false) http://docs.python.org/2/library/csv.html#csv-fmt-params aList=[] with open(self.filename, 'r') as f: reader = csv.reader(f, skipinitialspace=False,delimiter=',', quoting=csv.QUOTE_NONE) for row in reader:

Strip white spaces from CSV file

阅读更多关于 Strip white spaces from CSV file

问题 I need to stripe the white spaces from a CSV file that I read import csv aList=[] with open(self.filename, 'r') as f: reader = csv.reader(f, delimiter=',', quoting=csv.QUOTE_NONE) for row in reader: aList.append(row) # I need to strip the extra white space from each string in the row return(aList) 回答1: There's also the embedded formatting parameter: skipinitialspace (the default is false) http://docs.python.org/2/library/csv.html#csv-fmt-params aList=[] with open(self.filename, 'r') as f:

Pandas merge two dataframes with different columns

阅读更多关于 Pandas merge two dataframes with different columns

I'm surely missing something simple here. Trying to merge two dataframes in pandas that have mostly the same column names, but the right dataframe has some columns that the left doesn't have, and vice versa. >df_may id quantity attr_1 attr_2 0 1 20 0 1 1 2 23 1 1 2 3 19 1 1 3 4 19 0 0 >df_jun id quantity attr_1 attr_3 0 5 8 1 0 1 6 13 0 1 2 7 20 1 1 3 8 25 1 1 I've tried joining with an outer join: mayjundf = pd.DataFrame.merge(df_may, df_jun, how="outer") But that yields: Left data columns not unique: Index([.... I've also specified a single column to join on (on = "id", e.g.), but that

How to convert a python datetime.datetime to excel serial date number

阅读更多关于 How to convert a python datetime.datetime to excel serial date number

I need to convert dates into Excel serial numbers for a data munging script I am writing. By playing with dates in my OpenOffice Calc workbook, I was able to deduce that '1-Jan 1899 00:00:00' maps to the number zero. I wrote the following function to convert from a python datetime object into an Excel serial number: def excel_date(date1): temp=dt.datetime.strptime('18990101', '%Y%m%d') delta=date1-temp total_seconds = delta.days * 86400 + delta.seconds return total_seconds However, when I try some sample dates, the numbers are different from those I get when I format the date as a number in