问题

sample fileI receive large CSV files delimited with (comma or | or ^) with millions of records.
Some of the fields have non-printable character like CR|LF which translated as end of field. This is in windows10.

I need to write python to go thru the file and remove CR|LF in the fields. However, I cant remove all because then lines will be merged.

I have gone thru several postings on here on how to remove non-printable. My thought to write a panda dataframe, then check every field for CR|LF and remove it. It seems a bit complicated. If you have a quick code how to do this, it will be great help.

Thanks in advance.

Sample file:

record1, 111. texta, textb CR|LF
record2, 111. teCR|LF
xta, textb CR|LF
record3, 111. texta, textb CR|LF

Sample output file should be:

record1, 111. texta, textb CR|LF
record2, 111. texta, textb CR|LF
record3, 111. texta, textb CR|LF

CR = carriage Return = x0d LF = Line Feed = x0a

回答1:

EDIT -- Wed Sep 18, 23.46 UTC+2

NOTE:

record1|111. texta|textb|111CR|LF
record2|111. teCR|LF 
xta|text|111CR|LF
record3|111. texta|textb|111CR|LF

This is the file that we are going to analyze

Due to the fact that we have a csv file, we can be quite sure that the datatype will be consistent among the rows for a given columns.

Due to this assumption, we can regexp the delimiter in scope (CL|RF) with a regexp (\|\d+CR\|LF).

If the regexp is not matched, we can remove the carriage return, cause is not the end of the line.

import pandas as pd
from io import StringIO
import re

# Verify that the pattern `|ARBITRARY NUMBER + CR|LF`
pattern = re.compile("\|\d+CR\|LF")
# Open the file and read the content
with open("a.txt") as f:
    data = f.readlines()
not_parsed = data.copy()
_max = len(data)
i = 0
parsed_data = []
# Iterate the data
while i < _max:
    # Remove unnecessary new line
    line = data[i].strip()
    # If the pattern does not match, we need to strip the carriage return
    if not pattern.search(line) and i + 1 < _max:
        line = line.replace("CR|LF", "").strip()
        line = line + data[i + 1].strip()
        i += 1
    line = line
    if line != "":
        parsed_data.append(line)
    i += 1
# Comment
data = [line.replace("CR|LF", "") for line in parsed_data]
# Load the csv using pandas

print("DATA BEFORE -> {}".format("".join(not_parsed)))
print("DATA NOW -> {}".format("\n".join(data)))
DATA = StringIO("\n".join(data))
df = pd.read_csv(DATA, delimiter="|")

How does this remove the unwanted CL|RF, but leave the wanted ones?

The file will not be modified, instead it will be saved as a list of single ('line by line'). Then we are going to replace the carriage return only when the regexp not match and load as a dataframe

NOTE:
Tested on Linux that use \n as new line

回答2:

Run this script (e.g. name it fix_csv.py) on your file to sanitize it:

#!/usr/bin/env python3

import sys
import os

if len(sys.argv) < 3:
    sys.stderr.write('Please give the input filename and an output filename.\n')
    sys.exit(1)

# set the correct number of fields
nf = 3
# set the delimiter
delim = ','

inpf = sys.argv[1]
outf = sys.argv[2]

newline = os.linesep

with open(inpf, 'r') as inf, open(outf, 'w') as of:
    cache = []
    for line in inf:
        line = line.strip()
        ls = line.split(delim)
        if len(ls) < nf or cache:
            if not cache:
                cache = cache + ls
            elif cache:
                cache[-1] += ls[0]
                cache = cache + ls[1:]
            if len(cache) == nf:
                of.write(f'{delim}'.join(cache) + newline)
                cache = []
        else:
            of.write(line + newline)

Call it like

./fix_csv input.dat output.dat

Output:

record1, 111. texta, textb
record2, 111. texta, textb
record3, 111. texta, textb

来源：https://stackoverflow.com/questions/57997338/remove-unwanted-non-printable-characters-from-large-csv-files-with-millions-of-r

标签

python

pandas

csv

non-printing-characters

Remove unwanted non-printable characters from large CSV files with millions of records -in Python 3 or 2.7

问题

回答1:

EDIT -- Wed Sep 18, 23.46 UTC+2

回答2: