Combine a folder of text files into a CSV with each content in a cell

前端 未结 3 1015
感动是毒
感动是毒 2020-12-20 01:22

I have a folder containing several thousand .txt files. I\'d like to combine them in a big .csv according to the following model:

I found a R script suppose

相关标签:
3条回答
  • 2020-12-20 01:35

    The following python script works for me (where path_of_directory is replace by the path of the directory your files are in and output_file.csv is the path of the file you want to create/overwrite):

    #! /usr/bin/python
    
    import os
    import csv
    
    dirpath = 'path_of_directory'
    output = 'output_file.csv'
    with open(output, 'w') as outfile:
        csvout = csv.writer(outfile)
        csvout.writerow(['FileName', 'Content'])
    
        files = os.listdir(dirpath)
    
        for filename in files:
            with open(dirpath + '/' + filename) as afile:
                csvout.writerow([filename, afile.read()])
                afile.close()
    
        outfile.close()
    

    Note that this assumes everything in the directory is a file.

    0 讨论(0)
  • 2020-12-20 01:44

    Can be written slightly more compactly using pathlib.

    >>> import os
    >>> os.chdir('c:/scratch/folder to process')
    >>> from pathlib import Path
    >>> with open('big.csv', 'w') as out_file:
    ...     csv_out = csv.writer(out_file)
    ...     csv_out.writerow(['FileName', 'Content'])
    ...     for fileName in Path('.').glob('*.txt'):
    ...         csv_out.writerow([str(fileName),open(str(fileName.absolute())).read().strip()])
    

    The items yielded by this glob provide access to both the full pathname and the filename, hence no need for concatenations.

    EDIT: I've examined one of the text files and found that one of the characters that chokes processing looks like 'fi' but is actually these two characters together as a single character. Given the likely practical use to which this csv will be put I suggest the following processing, which ignores weird characters like that one. I strip out endlines because I suspect this makes csv processing more complicated, and a possible topic for another question.

    import csv
    from pathlib import Path
    
    with open('big.csv', 'w', encoding='Latin-1') as out_file:
        csv_out = csv.writer(out_file)
        csv_out.writerow(['FileName', 'Content'])
        for fileName in Path('.').glob('*.txt'):
            lines = [ ]
            with open(str(fileName.absolute()),'rb') as one_text:
                for line in one_text.readlines():
                    lines.append(line.decode(encoding='Latin-1',errors='ignore').strip())
            csv_out.writerow([str(fileName),' '.join(lines)])
    
    0 讨论(0)
  • 2020-12-20 01:45

    If your txt files are not in table format, you might be better off using readLines(). This is one way to do it in base R:

    setwd("~/your/file/path/to/txt_files_dir") 
    txt_files <- list.files()
    list_of_reads <- lapply(txt_files, readLines)
    df_of_reads <- data.frame(file_name = txt_files, contents = do.call(rbind, list_of_reads))
    write.csv(df_of_reads, "one_big_CSV.csv", row.names = F)
    
    0 讨论(0)
提交回复
热议问题