问题
I have a huge flat file 100K records each spanning 3000 columns. I need to removed a segment of the data fay starting position 300 to position 500 before archiving. This is sensitive part of data that needs to be wiped before I can archive. I am looking for a awk or sed or any similar command that can do the trick for me.
Sample file
003133780 MORNING GLORY DR SOUTHAMPTON PA18966780 MORNING GLORY DR
0054381303 MADISON ST RADFORD VA241411303 MADISON ST
00586728 CONESTOGA COURT CHADDS FORD PA1931728 CONESTOGA COURT
1852921800 SAMER RD MILAN MI481601800 SAMER RD
192717175 EVERGREEN CIRCLE HENDERSONVILLE TN37075175 EVERGREEN CIRCLE
213673217 EAST BRANCH LONGVIEW TX75604217 EAST BRANCH
2490423205 NOTTAGE LANE FALLS CHURCH VA220423205 NOTTAGE LANE
249357344 BALOGH PLACE LONGWOOD FL32750344 BALOGH PLACE
2502811224 WILFORD HOLLOW ROAD VINTON VA241791224 WILFORD HOLLOW ROAD
277634210 AMANDA CT WHITEHOUSE TX7579119726 COPPER OAKS DRIVE
282482507 B ST. CHESAPEAKE VA23324507 B ST.
Expected output
003133780 MORNING GLORY DR SOUTHAMPTON PA780 MORNING GLORY DR
0054381303 MADISON ST RADFORD VA1303 MADISON ST
00586728 CONESTOGA COURT CHADDS FORD PA28 CONESTOGA COURT
1852921800 SAMER RD MILAN MI1800 SAMER RD
192717175 EVERGREEN CIRCLE HENDERSONVILLE TN175 EVERGREEN CIRCLE
213673217 EAST BRANCH LONGVIEW TX217 EAST BRANCH
2490423205 NOTTAGE LANE FALLS CHURCH VA3205 NOTTAGE LANE
249357344 BALOGH PLACE LONGWOOD FL344 BALOGH PLACE
2502811224 WILFORD HOLLOW ROAD VINTON VA1224 WILFORD HOLLOW ROAD
277634210 AMANDA CT WHITEHOUSE TX19726 COPPER OAKS DRIVE
282482507 B ST. CHESAPEAKE VA507 B ST.
Here I removed the char between position 89 and 95. One small change, I also need to write the changed content to the same file.
Below is the script I have so far. I am looping through all files, dividing them into files of max rows 20000 and then removing the characters from position X and Y before archiving.
for currentfilename in ls -1 *.[tT][xX][tT]
do
echo $currentfilename
tempfilename=${currentfilename%%.*}
awk -v A="$tempfilename" '{filename = A "Part" int((NR-1)/20000) ".txt"; print >> filename}' $currentfilename
awk '{print substr($0,1,522) substr($0,953) >> filename}' $currentfilename
mv $currentfilename $APP_ROOT/Archive
done
回答1:
Assuming "position" means "character":
awk '{print substr($0,1,299) substr($0,501)}' file
If it doesn't then edit your question to add some REPRESENTATIVE sample input and expected output (e.g. 5 lines of 6 columns each, not thousands of lines of thousands of columns).
回答2:
Assuming that position means column, you can use cut
to select the columns you want.
cut -f 1-299,501-3000 CutMe.txt
If your data is delimited by commas instead of tabs, then use -d
.
cut -d, -f 1-299,501-3000 CutMe.txt
If position means character, you can do the same with cut -c
.
cut -c 1-299,501-3000 CutMe.txt
回答3:
Using sed
:
sed -r -i.bak 's/(.{299}).{200}/\1/' file
The -r
option enables extended regex. If you need to make it portable you can remove that option by escaping braces and curlies. The -i
option makes changes in-places. I have put an extension .bak
to safeguard from any mess up. You can remove it if you don't need to maintain the backup of original.
For solution, we just capture the first 299 characters in a capture group and seek the next 200 we need to remove. We substitute this entire line with our captured group.
来源:https://stackoverflow.com/questions/25293481/i-need-to-delete-string-from-position-x-to-position-y-on-each-line-in-a-text-fil