Extracting columns containing a certain name

后端 未结 3 1435
鱼传尺愫
鱼传尺愫 2021-01-18 04:38

I\'m trying to use it to manipulate data in large txt-files.

I have a txt-file with more than 2000 columns, and about a third of these have a title which contains th

相关标签:
3条回答
  • 2021-01-18 05:03

    You can use pandas filter function to select few columns based on regex

    data_filtered = data.filter(regex='net')
    
    0 讨论(0)
  • 2021-01-18 05:05

    One way of doing this, without the installation of third-party modules like numpy/pandas, is as follows. Given an input file, called "input.csv" like this:

    a,b,c_net,d,e_net

    0,0,1,0,1

    0,0,1,0,1

    (remove the blank lines in between, they are just for formatting the content in this post)

    The following code does what you want.

    import csv
    
    
    input_filename = 'input.csv'
    output_filename = 'output.csv'
    
    # Instantiate a CSV reader, check if you have the appropriate delimiter
    reader = csv.reader(open(input_filename), delimiter=',')
    
    # Get the first row (assuming this row contains the header)
    input_header = reader.next()
    
    # Filter out the columns that you want to keep by storing the column
    # index
    columns_to_keep = []
    for i, name in enumerate(input_header):
        if 'net' in name:
            columns_to_keep.append(i)
    
    # Create a CSV writer to store the columns you want to keep
    writer = csv.writer(open(output_filename, 'w'), delimiter=',')
    
    # Construct the header of the output file
    output_header = []
    for column_index in columns_to_keep:
        output_header.append(input_header[column_index])
    
    # Write the header to the output file
    writer.writerow(output_header)
    
    # Iterate of the remainder of the input file, construct a row
    # with columns you want to keep and write this row to the output file
    for row in reader:
        new_row = []
        for column_index in columns_to_keep:
            new_row.append(row[column_index])
        writer.writerow(new_row)
    

    Note that there is no error handling. There are at least two that should be handled. The first one is the check for the existence of the input file (hint: check the functionality provide by the os and os.path modules). The second one is to handle blank lines or lines with an inconsistent amount of columns.

    0 讨论(0)
  • 2021-01-18 05:13

    This could be done for instance with Pandas,

    import pandas as pd
    
    df = pd.read_csv('path_to_file.txt', sep='\s+')
    print(df.columns)  # check that the  columns are parsed correctly 
    selected_columns = [col for col in df.columns if "net" in col]
    df_filtered = df[selected_columns]
    df_filtered.to_csv('new_file.txt')
    

    Of course, since we don't have the structure of your text file, you would have to adapt the arguments of read_csv to make this work in your case (see the the corresponding documentation).

    This will load all the file in memory and then filter out the unnecessary columns. If your file is so large that it cannot be loaded in RAM at once, there is a way to load only specific columns with the usecols argument.

    0 讨论(0)
提交回复
热议问题