Converting mixed-format .DAT to .CSV (or anything else)

后端 未结 3 384
无人及你
无人及你 2021-01-21 16:14

I have a large collection of DAT files that need to be converted (eventually to a unique file type). The DAT\'s have a mixed amount of whitespace between fields, and the column

相关标签:
3条回答
  • 2021-01-21 16:46

    This is answer for "Python - download and convert .dat to .csv [duplicate]". I couldn't post there so FYI you can get you exact output from here.

    import urllib2
    import csv
    url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data'
    response = urllib2.urlopen(url)
    readData = response.read()
    strObj = filter(None,readData.splitlines())
    
    strObj = [w.replace('\t', '  ') for w in strObj]
    
    listB = []
    for i in strObj:
        listB.append(filter(None,i.split("  ")))
    with open(r'c:/data2.csv','a') as f:
        writer = csv.writer(f)
        writer.writerows(listB)
    
    0 讨论(0)
  • 2021-01-21 16:54

    It looks like you can combine the header rows dynamically based on a word's position in the line. You can skip the first two lines, and combine the next two. If you do it right, you will be left with an iterator over a file stream that you can use to process the remainder of the data as you wish. You can convert it to a different format, or even import it into a pandas DataFrame directly.

    To get the headers:

    import re
    
    def get_words_and_positions(line):
        return [(match.start(), match.group()) in re.finditer(r'[\w.]+', line)]
    
    with open('file.dat') as file:
        iterator = iter(file)
        # Skip two lines
        next(iterator)
        next(iterator)
        # Get two header lines
        header = get_words_and_positions(next(iterator)) + \
                 get_words_and_positions(next(iterator))
        # Sort by positon
        header.sort()
        # Extract words
        header = [word for pos, word in header]
    

    You can now convert the file to a true CSV, or do something else with it. The important thing here is that you have iterator pointing to the actual data in the file stream, and a bunch of dynamically loaded column headers.

    To write the remainder to a CSV file, without having to load the entire thing into memory at once, use csv.writer and the iterator from above:

     import csv
     ...
     with ...:
     ...
        with open('outfile.csv', 'w') as output:
            writer = csv.writer(output)
            writer.writerow(header)
            for line in iterator:
                writer.writerow(re.split(r'\s+', line))
    

    You can combine the nested output with and the outer input with into a single outer block to reduce the nesting levels:

    with open('file.dat') as file, open('outputfile.csv', 'w') as output:
        ....
    

    To read in a pandas DataFrame, you can just pass the file object to pandas.read_csv. Since the file stream is past the headers at this point, it will not give you any issues:

    import pandas as pd
    ...
    with ...:
        ...
        df = pd.read_csv(file, sep=r'\s'+, header=None, names=header)
    
    0 讨论(0)
  • 2021-01-21 16:55

    Treat those header lines in the input file with all the disdain they deserve. (Or, in other words, read them and discard them.)

    headers='Year Month Day Hour Minute Direct Diffuse2 D_Global D_IR U_Global U_IR Zenith'
    with open ( 'temp.dat') as input_file:
        with open ('temp_2.csv', 'w') as output_file:
            output_file.write('"%s"\n'%'","'.join(headers.split()))
            for count, line in enumerate(input_file):
                if count<4: continue
                outLine = ','.join(line.split())
                output_file.write(outLine + '\n')
    
    0 讨论(0)
提交回复
热议问题