Use python requests to download CSV

前端 未结 8 1862
再見小時候
再見小時候 2020-11-29 18:34

Here is my code:

import csv
import requests
with requests.Session() as s:
    s.post(url, data=payload)
    download = s.get(\'url that directly download a          


        
相关标签:
8条回答
  • 2020-11-29 19:02

    From a little search, that I understand the file should be opened in universal newline mode, which you cannot directly do with a response content (I guess).

    To finish the task, you can either save the downloaded content to a temporary file, or process it in memory.

    Save as file:

    import requests
    import csv
    import os
    
    temp_file_name = 'temp_csv.csv'
    url = 'http://url.to/file.csv'
    download = requests.get(url)
    
    with open(temp_file_name, 'w') as temp_file:
        temp_file.writelines(download.content)
    
    with open(temp_file_name, 'rU') as temp_file:
        csv_reader = csv.reader(temp_file, dialect=csv.excel_tab)
        for line in csv_reader:
            print line
    
    # delete the temp file after process
    os.remove(temp_file_name)
    

    In memory:

    (To be updated)

    0 讨论(0)
  • 2020-11-29 19:05

    To simplify these answers, and increase performance when downloading a large file, the below may work a bit more efficiently.

    import requests
    from contextlib import closing
    import csv
    
    url = "http://download-and-process-csv-efficiently/python.csv"
    
    with closing(requests.get(url, stream=True)) as r:
        reader = csv.reader(r.iter_lines(), delimiter=',', quotechar='"')
        for row in reader:
            print row   
    

    By setting stream=True in the GET request, when we pass r.iter_lines() to csv.reader(), we are passing a generator to csv.reader(). By doing so, we enable csv.reader() to lazily iterate over each line in the response with for row in reader.

    This avoids loading the entire file into memory before we start processing it, drastically reducing memory overhead for large files.

    0 讨论(0)
  • 2020-11-29 19:10

    I like the answers from The Aelfinn and aheld. I can improve them only by shortening a bit more, removing superfluous pieces, using a real data source, making it 2.x & 3.x-compatible, and maintaining the high-level of memory-efficiency seen elsewhere:

    import csv
    import requests
    
    CSV_URL = 'http://samplecsvs.s3.amazonaws.com/Sacramentorealestatetransactions.csv'
    
    with requests.get(CSV_URL, stream=True) as r:
        lines = (line.decode('utf-8') for line in r.iter_lines())
        for row in csv.reader(lines):
            print(row)
    

    Too bad 3.x is less flexible CSV-wise because the iterator must emit Unicode strings (while requests does bytes) because the 2.x-only version—for row in csv.reader(r.iter_lines()):—is more Pythonic (shorter and easier-to-read). Anyhow, note the 2.x/3.x solution above won't handle the situation described by the OP where a NEWLINE is found unquoted in the data read.

    For the part of the OP's question regarding downloading (vs. processing) the actual CSV file, here's another script that does that, 2.x & 3.x-compatible, minimal, readable, and memory-efficient:

    import os
    import requests
    
    CSV_URL = 'http://samplecsvs.s3.amazonaws.com/Sacramentorealestatetransactions.csv'
    
    with open(os.path.split(CSV_URL)[1], 'wb') as f, \
            requests.get(CSV_URL, stream=True) as r:
        for line in r.iter_lines():
            f.write(line)
    
    0 讨论(0)
  • 2020-11-29 19:13

    This should help:

    import csv
    import requests
    
    CSV_URL = 'http://samplecsvs.s3.amazonaws.com/Sacramentorealestatetransactions.csv'
    
    
    with requests.Session() as s:
        download = s.get(CSV_URL)
    
        decoded_content = download.content.decode('utf-8')
    
        cr = csv.reader(decoded_content.splitlines(), delimiter=',')
        my_list = list(cr)
        for row in my_list:
            print(row)
    

    Ouput sample:

    ['street', 'city', 'zip', 'state', 'beds', 'baths', 'sq__ft', 'type', 'sale_date', 'price', 'latitude', 'longitude']
    ['3526 HIGH ST', 'SACRAMENTO', '95838', 'CA', '2', '1', '836', 'Residential', 'Wed May 21 00:00:00 EDT 2008', '59222', '38.631913', '-121.434879']
    ['51 OMAHA CT', 'SACRAMENTO', '95823', 'CA', '3', '1', '1167', 'Residential', 'Wed May 21 00:00:00 EDT 2008', '68212', '38.478902', '-121.431028']
    ['2796 BRANCH ST', 'SACRAMENTO', '95815', 'CA', '2', '1', '796', 'Residential', 'Wed May 21 00:00:00 EDT 2008', '68880', '38.618305', '-121.443839']
    ['2805 JANETTE WAY', 'SACRAMENTO', '95815', 'CA', '2', '1', '852', 'Residential', 'Wed May 21 00:00:00 EDT 2008', '69307', '38.616835', '-121.439146']
    [...]
    

    Related question with answer: https://stackoverflow.com/a/33079644/295246


    Edit: Other answers are useful if you need to download large files (i.e. stream=True).

    0 讨论(0)
  • 2020-11-29 19:15

    You can update the accepted answer with the iter_lines method of requests if the file is very large

    import csv
    import requests
    
    CSV_URL = 'http://samplecsvs.s3.amazonaws.com/Sacramentorealestatetransactions.csv'
    
    with requests.Session() as s:
        download = s.get(CSV_URL)
    
        line_iterator = (x.decode('utf-8') for x in download.iter_lines(decode_unicode=True))
    
        cr = csv.reader(line_iterator, delimiter=',')
        my_list = list(cr)
        for row in my_list:
            print(row)
    
    0 讨论(0)
  • 2020-11-29 19:16

    The following approach worked well for me. I also did not need to use csv.reader() or csv.writer() functions, which I feel makes the code cleaner. The code is compatible with Python2 and Python 3.

    from six.moves import urllib
    
    DOWNLOAD_URL = "https://raw.githubusercontent.com/gjreda/gregreda.com/master/content/notebooks/data/city-of-chicago-salaries.csv"
    DOWNLOAD_PATH ="datasets\city-of-chicago-salaries.csv" 
    urllib.request.urlretrieve(URL,DOWNLOAD_PATH)
    

    Note - six is a package that helps in writing code that is compatible with both Python 2 and Python 3. For additional details regarding six see - What does from six.moves import urllib do in Python?

    0 讨论(0)
提交回复
热议问题