问题
I am trying to read a csv file with Pandas but the first column contains a first name and a last name seperated by a comma. This causes Pandas to think that there are 5 columns instead of 4 so the last column now has no header making it unable to be selected.
The file looks like this:
CustomerName,ClientID,EmailDate,EmailAddress
FNAME1,LNAME1,100,2019-01-13 00:00:00.000,FNAME1@HOTMAIL.COM
FNAME2,LNAME2,100,2019-01-13 00:00:00.000,FNAME2@GMAIL.COM
FNAME3,LNAME3,100,2019-01-13 00:00:00.000,FNAME3@AOL.COM
FNAME4,LNAME4,100,2019-01-13 00:00:00.000,FNAME40@GMAIL.COM
FNAME5,LNAME5,100,2019-01-13 00:00:00.000,FNAME5@AOL.COM
What my code looks like now:
def convert_ftp_data():
file = os.getcwd() + "/data.csv"
data = pd.read_csv(file, index_col=False)
data["first_name"] = data["CustomerName"].str.split().str[0].str.title()
data["email"] = data["EmailAddress"]
clean_data = data.drop(data[["CustomerName", "ClientID", "EmailDate", "EmailAddress"]], 1)
print(clean_data)
Using my code I get the following output:
first_name email
0 FNAME1 2019-01-13 00:00:00.000
1 FNAME1 2019-01-13 00:00:00.000
2 FNAME1 2019-01-13 00:00:00.000
3 FNAME1 2019-01-13 00:00:00.000
4 FNAME1 2019-01-13 00:00:00.000
I only need to select the FNAME and EmailAddress field. What would be the best way to do this?
回答1:
Why not just skip the header and set it correctly after import
data = pd.read_csv(file, index_col=False, header=None, skiprows=1)
data.columns = 'CustomerFirstName,CustomerName,ClientID,EmailDate,EmailAddress'.split(',')
回答2:
Read the headers separately
With pd.read_csv, you can utilize nrows
, skiprows
and names
parameters:
from io import StringIO
x = """CustomerName,ClientID,EmailDate,EmailAddress
FNAME1,LNAME1,100,2019-01-13 00:00:00.000,FNAME1@HOTMAIL.COM
FNAME2,LNAME2,100,2019-01-13 00:00:00.000,FNAME2@GMAIL.COM
FNAME3,LNAME3,100,2019-01-13 00:00:00.000,FNAME3@AOL.COM
FNAME4,LNAME4,100,2019-01-13 00:00:00.000,FNAME40@GMAIL.COM
FNAME5,LNAME5,100,2019-01-13 00:00:00.000,FNAME5@AOL.COM"""
headers = pd.read_csv(StringIO(x), nrows=0).columns
headers = np.hstack((['FirstName', 'LastName'], headers[1:]))
df = pd.read_csv(StringIO(x), header=None, skiprows=[0], names=headers)
print(df)
# FirstName LastName ClientID EmailDate EmailAddress
# 0 FNAME1 LNAME1 100 2019-01-13 00:00:00.000 FNAME1@HOTMAIL.COM
# 1 FNAME2 LNAME2 100 2019-01-13 00:00:00.000 FNAME2@GMAIL.COM
# 2 FNAME3 LNAME3 100 2019-01-13 00:00:00.000 FNAME3@AOL.COM
# 3 FNAME4 LNAME4 100 2019-01-13 00:00:00.000 FNAME40@GMAIL.COM
# 4 FNAME5 LNAME5 100 2019-01-13 00:00:00.000 FNAME5@AOL.COM
回答3:
Try the following:
pd.read_csv(file, usecols=['EmailAddress']).reset_index().rename(columns={'index': 'first_name', 'EmailAddress': 'email'})
OUTPUT:
first_name email
0 FNAME1 FNAME1@HOTMAIL.COM
1 FNAME2 FNAME2@GMAIL.COM
2 FNAME3 FNAME3@AOL.COM
3 FNAME4 FNAME40@GMAIL.COM
4 FNAME5 FNAME5@AOL.COM
来源:https://stackoverflow.com/questions/54092614/pandas-read-csv-where-one-header-is-missing