I had a some issue about defining multiple seperator when reading a file. It is originally solved in my previous post reading-files-with-multiple-delimiter-in-column-headers-and-skipping-some-rows thanks to @piRsquared
When I looked in detail to my real data I realized that some columns have .cd or .dvd extensions and when I applied the solution above they are also separated as a new column and the solution above started not to work!
b.txt
skip1
A1| A2 |A3 |A4# A5# A6 A7| A8 , A9
1,2,3,4,5.cd,6,7,8.dvd,9
1,2,3,4,5.cd,6,7,8.dvd,9
1,2,3,4,5.cd,6,7,8.dvd,9
END123
Some other data starts from here
and read this b.txt file using the solution above
txt = open('b.txt').read().split('\nEND')[0]
pd.read_csv(
pd.io.common.StringIO(txt),
sep=r'\W+',
skiprows=1,index_col=False, engine='python')
A1 A2 A3 A4 A5 A6 A7 A8 A9
0 1 2 3 4 5 cd 6 7 8
1 1 2 3 4 5 cd 6 7 8
2 1 2 3 4 5 cd 6 7 8
A5 column should have the rows
5.cd
5.cd
5.cd
and same for the A9 column
8.dvd
8.dvd
8.dvd
and we should have A9 column but seems that it disappears due to this conflict.
EDIT:
I put almost similar identity to my real data
skip rows
A1| A2| A3|A4# A5# | A6 | A7 , A8, A9 | A10 |
1 | 2 | 3 |4 # 5 # | 6.cd|7.dvd, , | 10 |
1 | 2 | 3 |4 # 5 # | 6.cd| , , 9 | 10 |
1 | 2 | 3 |4 # 5 # | |7.dvd, , | 10 |
END123
Some other data starts from here
and tried
txt = open('real_dat.txt').read().split('\nEND')[0]
_, h, txt = txt.split('\n', 2)
pat = r'[\|, ,#,\,]+'
names = re.split(pat, h.strip())
df=pd.read_csv(
pd.io.common.StringIO(txt),
names=names,skiprows=1,index_col=False,
engine='python')
and got this output!
Updated Answer
It was just easier to get rid of spaces... Let me know if this works
txt = open('b.txt').read().split('\nEND')[0] \
.replace(' ', '').replace('|\n', '\n').split('\n', 1)[1]
pd.read_csv(
pd.io.common.StringIO(txt),
sep=r'#\||\||#|,',
engine='python')
A1 A2 A3 A4 A5 A6 A7 A8 A9 A10
0 1 2 3 4 5 6.cd 7.dvd NaN NaN 10
1 1 2 3 4 5 6.cd NaN NaN 9.0 10
2 1 2 3 4 5 NaN 7.dvd NaN NaN 10
Old Answer
I used \W+
as a fast and easy way to parse what you showed. Below I used something more specific to the actual delimiters you need.
txt = open('b.txt').read().split('\nEND')[0]
pd.read_csv(
pd.io.common.StringIO(txt),
sep=r'[\|, ,#,\,]+',
skiprows=1,index_col=False, engine='python')
A1 A2 A3 A4 A5 A6 A7 A8 A9
0 1 2 3 4 5.cd 6 7 8.dvd 9
1 1 2 3 4 5.cd 6 7 8.dvd 9
2 1 2 3 4 5.cd 6 7 8.dvd 9
However, I still think this is a cleaner way to do it. Here, I separate the parsing of the header from the parsing of the rest of the data. That way, I assume the data should only be using ,
as a separator.
txt = open('b.txt').read().split('END')[0]
_, h, txt = txt.split('\n', 2)
pat = r'[\|, ,#,\,]+'
names = re.split(pat, h.strip())
pd.read_csv(
pd.io.common.StringIO(txt),
names=names, header=None,
engine='python')
A1 A2 A3 A4 A5 A6 A7 A8 A9
0 1 2 3 4 5.cd 6 7 8.dvd 9
1 1 2 3 4 5.cd 6 7 8.dvd 9
2 1 2 3 4 5.cd 6 7 8.dvd 9
来源:https://stackoverflow.com/questions/45699077/multiple-delimiters-in-column-headers-also-separates-the-row-values