Read specific columns in csv using python

前端 未结 7 1372
耶瑟儿~
耶瑟儿~ 2021-02-06 06:27

I have a csv file that look like this:

+-----+-----+-----+-----+-----+-----+-----+-----+
| AAA | bbb | ccc | DDD | eee | FFF | GGG | hhh |
+-----+-----+-----+-----+---         


        
相关标签:
7条回答
  • 2021-02-06 06:52

    If your files and requirements are relatively simple and set, then once you know the desired columns, I would likely use split() to divide each data line into a list of column entries:

    alist = aline.split('|')
    

    I would then use the desired column indices to get the column entries from the list, process each with strip() to remove the whitespace, convert it to the desired format (it looks like your data has integer values), and create the tuples.

    As I said, I am assuming that your requirements are relatively fixed. The more complicated or the more they are likely to change, the more likely that it will be worth your time to pick up and use a library made for manipulating this type of data.

    0 讨论(0)
  • 2021-02-06 06:57

    I realize the answer has been accepted, but if you really want to read specific named columns from a csv file, you should use a DictReader (if you're not using Pandas that is).

    import csv
    from StringIO import StringIO
    
    columns = 'AAA,DDD,FFF,GGG'.split(',')
    
    
    testdata ='''\
    AAA,bbb,ccc,DDD,eee,FFF,GGG,hhh
    1,2,3,4,50,3,20,4
    2,1,3,5,24,2,23,5
    4,1,3,6,34,1,22,5
    2,1,3,5,24,2,23,5
    2,1,3,5,24,2,23,5
    '''
    
    reader = csv.DictReader(StringIO(testdata))
    
    desired_cols = (tuple(row[col] for col in columns) for row in reader)
    

    Output:

    >>> list(desired_cols)
    [('1', '4', '3', '20'),
     ('2', '5', '2', '23'),
     ('4', '6', '1', '22'),
     ('2', '5', '2', '23'),
     ('2', '5', '2', '23')]
    
    0 讨论(0)
  • 2021-02-06 07:00

    All other answers are good, but I think it would be better to not load all data at the same time because the csv file could be really huge. I suggest using a generator.

    def read_csv(f, cols):
        reader = csv.reader(f)
        for row in reader:
            if len(row) == 1:
                columns = row[0].split()
                yield (columns[c] for c in cols)
    

    Which can be used for a for loop after

    with open('path/to/test.csv', 'rb') as f:
        for bbb, ccc in read_csv(f, [1, 2]):
            print bbb, ccc
    

    Of course you can enhance this function to receive the column's name instead of the index. To do so, just mix Brad M answer and mine.

    0 讨论(0)
  • 2021-02-06 07:01
    def read_csv(file, columns, type_name="Row"):
      try:
        row_type = namedtuple(type_name, columns)
      except ValueError:
        row_type = tuple
      rows = iter(csv.reader(file))
      header = rows.next()
      mapping = [header.index(x) for x in columns]
      for row in rows:
        row = row_type(*[row[i] for i in mapping])
        yield row
    

    Example:

    >>> import csv
    >>> from collections import namedtuple
    >>> from StringIO import StringIO
    >>> def read_csv(file, columns, type_name="Row"):
    ...   try:
    ...     row_type = namedtuple(type_name, columns)
    ...   except ValueError:
    ...     row_type = tuple
    ...   rows = iter(csv.reader(file))
    ...   header = rows.next()
    ...   mapping = [header.index(x) for x in columns]
    ...   for row in rows:
    ...     row = row_type(*[row[i] for i in mapping])
    ...     yield row
    ... 
    >>> testdata = """\
    ... AAA,bbb,ccc,DDD,eee,FFF,GGG,hhh
    ... 1,2,3,4,50,3,20,4
    ... 2,1,3,5,24,2,23,5
    ... 4,1,3,6,34,1,22,5
    ... 2,1,3,5,24,2,23,5
    ... 2,1,3,5,24,2,23,5
    ... """
    >>> testfile = StringIO(testdata)
    >>> for row in read_csv(testfile, "AAA GGG DDD".split()):
    ...   print row
    ... 
    Row(AAA='1', GGG='20', DDD='4')
    Row(AAA='2', GGG='23', DDD='5')
    Row(AAA='4', GGG='22', DDD='6')
    Row(AAA='2', GGG='23', DDD='5')
    Row(AAA='2', GGG='23', DDD='5')
    
    0 讨论(0)
  • 2021-02-06 07:12
    import csv
    
    DESIRED_COLUMNS = ('AAA','DDD','FFF','GGG')
    
    f = open("myfile.csv")
    reader = csv.reader(f)
    
    headers = None
    results = []
    for row in reader:
        if not headers:
            headers = []
            for i, col in enumerate(row):
            if col in DESIRED_COLUMNS:
                # Store the index of the cols of interest
                headers.append(i)
    
        else:
            results.append(tuple([row[i] for i in headers]))
    
    print results
    
    0 讨论(0)
  • 2021-02-06 07:12

    Context: For this type of work you should use the amazing python petl library. That will save you a lot of work and potential frustration from doing things 'manually' with the standard csv module. AFAIK, the only people who still use the csv module are those who have not yet discovered better tools for working with tabular data (pandas, petl, etc.), which is fine, but if you plan to work with a lot of data in your career from various strange sources, learning something like petl is one of the best investments you can make. To get started should only take 30 minutes after you've done pip install petl. The documentation is excellent.

    Answer: Let's say you have the first table in a csv file (you can also load directly from the database using petl). Then you would simply load it and do the following.

    from petl import fromcsv, look, cut, tocsv    
    
        #Load the table
        table1 = fromcsv('table1.csv')
        # Alter the colums
        table2 = cut(table1, 'Song_Name','Artist_ID')
        #have a quick look to make sure things are ok.  Prints a nicely formatted table to your console
        print look(table2)
        # Save to new file
        tocsv(table2, 'new.csv')
    
    0 讨论(0)
提交回复
热议问题