TypeError: unsupported operand type(s) for -: 'str' and 'str' in python 3.x Anaconda

匿名 (未验证) 提交于 2019-12-03 01:00:01

问题:

I am trying to count some instances per hour time in a large dataset. The code below seems to work fine on python 2.7 but I had to upgrade it to 3.x latest version of python with all updated packages on Anaconda. When I am trying to execute the program I am getting following str error

Code:

import pandas as pd from datetime import datetime,time import numpy as np  fn = r'00_input.csv' cols = ['UserId', 'UserMAC', 'HotspotID', 'StartTime', 'StopTime'] df = pd.read_csv(fn, header=None, names=cols)  df['m'] = df.StopTime + df.StartTime df['d'] = df.StopTime - df.StartTime  # 'start' and 'end' for the reporting DF: `r` # which will contain equal intervals (1 hour in this case) start = pd.to_datetime(df.StartTime.min(), unit='s').date() end = pd.to_datetime(df.StopTime.max(), unit='s').date() + pd.Timedelta(days=1)  # building reporting DF: `r` freq = '1H'  # 1 Hour frequency idx = pd.date_range(start, end, freq=freq) r = pd.DataFrame(index=idx) r['start'] = (r.index - pd.datetime(1970,1,1)).total_seconds().astype(np.int64)  # 1 hour in seconds, minus one second (so that we will not count it twice) interval = 60*60 - 1  r['LogCount'] = 0 r['UniqueIDCount'] = 0  for i, row in r.iterrows():         # intervals overlap test         # https://en.wikipedia.org/wiki/Interval_tree#Overlap_test         # i've slightly simplified the calculations of m and d         # by getting rid of division by 2,         # because it can be done eliminating common terms     u = df[np.abs(df.m - 2*row.start - interval) < df.d + interval].UserID     r.ix[i, ['LogCount', 'UniqueIDCount']] = [len(u), u.nunique()]  r['Date'] = pd.to_datetime(r.start, unit='s').dt.date r['Day'] = pd.to_datetime(r.start, unit='s').dt.weekday_name.str[:3] r['StartTime'] = pd.to_datetime(r.start, unit='s').dt.time r['EndTime'] = pd.to_datetime(r.start + interval + 1, unit='s').dt.time  #r.to_csv('results.csv', index=False) #print(r[r.LogCount > 0]) #print (r['StartTime'], r['EndTime'], r['Day'], r['LogCount'], r['UniqueIDCount'])  rout =  r[['Date', 'StartTime', 'EndTime', 'Day', 'LogCount', 'UniqueIDCount'] ] #print rout rout.to_csv('o_1_hour.csv', index=False, header=False 

)

Where do I make changes to get a error free execution

Error:

File "C:\Program Files\Anaconda3\lib\site-packages\pandas\core\ops.py", line 686, in <lambda>     lambda x: op(x, rvalues))  TypeError: unsupported operand type(s) for -: 'str' and 'str' 

Appreciate the Help, Thanks in advance

回答1:

I think you need change header=0 for select first row to header - then column names are replace by list cols.

If still problem, need to_numeric, because some values in StartTime and StopTime are strings, which are parsed to NaN, replace by 0 an last convert column to int:

cols = ['UserId', 'UserMAC', 'HotspotID', 'StartTime', 'StopTime'] df = pd.read_csv('canada_mini_unixtime.csv', header=0, names=cols) #print (df)  df['StartTime'] = pd.to_numeric(df['StartTime'], errors='coerce').fillna(0).astype(int) df['StopTime'] =  pd.to_numeric(df['StopTime'], errors='coerce').fillna(0).astype(int) 

No change:

df['m'] = df.StopTime + df.StartTime df['d'] = df.StopTime - df.StartTime start = pd.to_datetime(df.StartTime.min(), unit='s').date() end = pd.to_datetime(df.StopTime.max(), unit='s').date() + pd.Timedelta(days=1)  freq = '1H'  # 1 Hour frequency idx = pd.date_range(start, end, freq=freq) r = pd.DataFrame(index=idx) r['start'] = (r.index - pd.datetime(1970,1,1)).total_seconds().astype(np.int64)  # 1 hour in seconds, minus one second (so that we will not count it twice) interval = 60*60 - 1  r['LogCount'] = 0 r['UniqueIDCount'] = 0 

ix is deprecated in last version of pandas, so use loc and column name is in []:

for i, row in r.iterrows():         # intervals overlap test         # https://en.wikipedia.org/wiki/Interval_tree#Overlap_test         # i've slightly simplified the calculations of m and d         # by getting rid of division by 2,         # because it can be done eliminating common terms     u = df.loc[np.abs(df.m - 2*row.start - interval) < df.d + interval, 'UserId']     r.loc[i, ['LogCount', 'UniqueIDCount']] = [len(u), u.nunique()]  r['Date'] = pd.to_datetime(r.start, unit='s').dt.date r['Day'] = pd.to_datetime(r.start, unit='s').dt.weekday_name.str[:3] r['StartTime'] = pd.to_datetime(r.start, unit='s').dt.time r['EndTime'] = pd.to_datetime(r.start + interval + 1, unit='s').dt.time  print (r) 


回答2:

df['d'] = df.StopTime - df.StartTime is attempting to subtract a string from another string. I don't know what your data looks like, but chances are that you want to parse StopTime and StartTime as dates. Try

df = pd.read_csv(fn, header=None, names=cols, parse_dates=[3,4]) 

instead of df = pd.read_csv(fn, header=None, names=cols).



标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!