I have a CSV, which has got three different delimiters namely, \'|\', \',\' and \';\' between different columns.
How can I using Python parse this CSV ?
My
My sample data was something like this :
2017-01-24|05:19:30+0000|TRANSACTIONDelim_secondUSER_LOGINDelim_firstCONSUMERIDDelim_secondc4115f53-3798-4c9e-9bfd-506c842aff96Delim_firstTRANSACTIONDATEDelim_second17-01-24 05:19:30Delim_firstCHANNELIDDelim_secondDelim_firstSHOWIDDelim_secondDelim_firstEPISODEIDDelim_secondDelim_firstBUSINESSUNITDelim_secondnullDelim_firstAIRINGDATEDelim_second|**
2017-01-24|05:19:30+0000|TRANSACTIONDelim_secondUSER_LOGOUTDelim_firstCONSUMERIDDelim_second1583e83882b8e7Delim_firstTRANSACTIONDATEDelim_second17-01-24 05:19:26Delim_firstCHANNELIDDelim_secondDelim_firstSHOWIDDelim_secondDelim_firstEPISODEIDDelim_secondDelim_firstBUSINESSUNITDelim_secondbu002Delim_firstAIRINGDATEDelim_second24-Jan-2017|**
2017-01-24|05:21:59+0000|TRANSACTIONDelim_secondVIEW_PRIVACY_POLICYDelim_firstCONSUMERIDDelim_secondnullDelim_firstTRANSACTIONDATEDelim_second17-01-24 05:21:59Delim_firstCHANNELIDDelim_secondDelim_firstSHOWIDDelim_secondDelim_firstEPISODEIDDelim_secondDelim_firstBUSINESSUNITDelim_secondnullDelim_firstAIRINGDATEDelim_second|**
2017-01-24|05:59:25+0000|TRANSACTIONDelim_secondUSER_LOGOUTDelim_firstCONSUMERIDDelim_second1586a2aa4bc18fDelim_firstTRANSACTIONDATEDelim_second17-01-24 05:59:21Delim_firstCHANNELIDDelim_secondDelim_firstSHOWIDDelim_secondDelim_firstEPISODEIDDelim_secondDelim_firstBUSINESSUNITDelim_secondbu002Delim_firstAIRINGDATEDelim_second24-Jan-2017|**
2017-01-24|05:59:36+0000|TRANSACTIONDelim_secondUSER_LOGOUTDelim_firstCONSUMERIDDelim_second1583e83882b8e7Delim_firstTRANSACTIONDATEDelim_second17-01-24 05:59:31Delim_firstCHANNELIDDelim_secondDelim_firstSHOWIDDelim_secondDelim_firstEPISODEIDDelim_secondDelim_firstBUSINESSUNITDelim_secondbu002Delim_firstAIRINGDATEDelim_second24-Jan-2017|**
2017-01-24|06:04:25+0000|TRANSACTIONDelim_secondUSER_LOGOUTDelim_firstCONSUMERIDDelim_secondc4115f53-3798-4c9e-9bfd-506c842aff96Delim_firstTRANSACTIONDATEDelim_second17-01-24 06:04:24Delim_firstCHANNELIDDelim_secondDelim_firstSHOWIDDelim_secondDelim_firstEPISODEIDDelim_secondDelim_firstBUSINESSUNITDelim_secondbu002Delim_firstAIRINGDATEDelim_second|**
2017-01-24|06:05:07+0000|TRANSACTIONDelim_secondUSER_LOGINDelim_firstCONSUMERIDDelim_secondc4115f53-3798-4c9e-9bfd-506c842aff96Delim_firstTRANSACTIONDATEDelim_second17-01-24 06:05:07Delim_firstCHANNELIDDelim_secondDelim_firstSHOWIDDelim_secondDelim_firstEPISODEIDDelim_secondDelim_firstBUSINESSUNITDelim_secondnullDelim_firstAIRINGDATEDelim_second|**
2017-01-24|06:05:07+0000|TRANSACTIONDelim_secondUSER_LOGINDelim_firstCONSUMERIDDelim_secondc4115f53-3798-4c9e-9bfd-506c842aff96Delim_firstTRANSACTIONDATEDelim_second17-01-24 06:05:07Delim_firstCHANNELIDDelim_secondDelim_firstSHOWIDDelim_secondDelim_firstEPISODEIDDelim_secondDelim_firstBUSINESSUNITDelim_secondbu002Delim_firstAIRINGDATEDelim_second|**
So, it contained a '|' delimiter, 'Delim_first' and 'Delim_second' as the delimiters.
I needed the data to be separated at all the three delimiters.
Created a pandas Dataframe out of the data and then used ;
i = 0
while i < 8:
df10[i+6]=(df10[2].str[:].str.split('First_delim').apply(pd.Series).astype(str))[i]
i = i + 1
j = 0
while j < 8:
k = 0
df10[2*j+14]=(df10[j+6+k].str[:].str.split('Second_delim').apply(pd.Series).astype(str))[0]
df10[2*j+15]=(df10[j+6+k].str[:].str.split('Second_delim').apply(pd.Series).astype(str))[1]
j = j + 1
k = k + 1
j=0
for i in df10[1]:
i = i[:-5]
df10[1][j]=i
j = j+1
Sticking with the standard library, re.split()
can split a line at any of these characters:
import re
with open(file_name) as fobj:
for line in fobj:
line_data = re.split('Delim_first|Delim_second|[|]', line)
print(line_data)
This will split at the delimiters |
, Delim_first
, and Delim_second
.
Or with pandas:
import pandas as pd
df = pd.read_csv('multi_delim.csv', sep='Delim_first|Delim_second|[|]',
engine='python', header=None)
Result:
One easy way to achieve what you want is using pandas package, here's a little example:
import pandas as pd
import StringIO
data = StringIO.StringIO("""a;b|c;
2016-09-05 10:47:00|1,foo;
2016-09-06 10:47:00;2;foo2;
2016-09-07 10:47:00;3;foo3;""")
df = pd.read_csv(data, sep='[;,|]', engine='python')
for c in ['a', 'b', 'c']:
print('-' * 80)
print(df[c])