Load CSV to .mdb using pyodbc and pandas

问题

Background story: I work on finance (not a developer, so help is very appreciated), my department currently relies heavily on excel and vba to automate as much as possible of our tasks. The company just validated a python distribution and we're now allowed to use it, so I just thought on giving a try.

Challenge: My first challenge was to load a CSV file into a MSAcess database (because not all of us are tech savy enough to work purely using dev tools and DBs, so need to make things easy for everybody).

I could find bits and pieces of different ppl's code around the internet that I could put together, it's working, but turn out it became a Frankenstein.

What it's doing and why:

Load CSV to variable
Strip out first rows (because source file is not realy a CSV, has rubbish rows at the start of the file)
Export to a CSV in temp drive (because could not figure out how to load to panda from a variable)
Load CSV to SQLite using panda (because panda is able to infer data type of each column)
Export "create table" statement to variable
Create table in .mdb file using pyodbc
Load data to .mdb table row by row (it's very slow)

TL;DR:
Current code is a patchwork of different codes, it's ugly and slow, what would you change to make it more efficient / to optimize it?

The goal is to have a code that loads CSV to .mdb, possibly using correct data type to create table.

import csv
import pyodbc
import pandas
import pandas.io.sql
import sqlite3
import tempfile
import time
import string


def load_csv_to_access(access_path, table_name, csv_path, skip_rows):


# open CSV file, load to a variable, output to a temp file excluding first non csv rows
#
filename = csv_path
csv_file = open(filename)
txt = ""
for index, line in enumerate(csv_file, start=0):  #Skip first rows
    if index > skip_rows:
        txt += line
csv_file.close()
temp_filename = time.strftime("%y%m%d%H%M%S") + '.csv'
temp_filepath = tempfile.gettempdir() + '\\' + temp_filename
file = open(temp_filepath, 'w+')
file.write(txt)  # create temp csv
file.close()
print "1: temp file created: " + temp_filepath

# Use panda and SQLite to infer data type of CSV fields
#
df = pandas.read_csv(temp_filepath, delimiter=';', index_col=0, engine='python')
df.columns = df.columns.str.replace(' ', '_')
# connect to in-memory database for testing; replace `:memory:` w/ file path
con = sqlite3.connect('db.sqlite')
df.to_sql(table_name, con, if_exists='replace')
sqlite_query_string = "SELECT sql FROM sqlite_master where name = '" + table_name + "'"
create_table_tuple = con.execute(sqlite_query_string).fetchone()
con.close()
create_table_string = create_table_tuple[0]
print "2: Data type inferred"

#Connect to AccessDB and load temp CSV
#
access_string = "DRIVER={Microsoft Access Driver (*.mdb, *.accdb)};DBQ=" + access_path + "; Provider=MSDASQL;"
print access_string
con = pyodbc.connect(access_string)
cur = con.cursor()
cur.execute(create_table_string)
con.commit()
print "3: MS Access table created: " + table_name

print "4: Loading data rows:"
with open(temp_filepath, 'r') as f:
    reader = csv.reader(f, delimiter=';')
    columns = next(reader)
    query = "insert into " + table_name + "({0}) values ({1})"
    query = query.format(','.join(columns).replace(' ', '_'), ','.join(
        '?' * len(columns)))  #Create insert query (replace empty space by underscore to avoid db issues)
    for index, data in enumerate(reader, start=0):
        cur.execute(query, data)  #Insert row by row
        print index # For debugging
    cur.commit()
con.close()

Thanks, as you guys are much better then me, would appreciate any suggestions.

回答1:

MS Access can directly query CSV files and run a Make-Table Query to produce a resulting table. However, some cleaning is needed to remove the rubbish rows. Below opens two files one for reading and other for writing. Assuming rubbish is in first column of csv, the if logic writes any line that has some data in second column (adjust as needed):

import os
import csv
import pyodbc

# TEXT FILE CLEAN
with open('C:\Path\To\Raw.csv', 'r') as reader, open('C:\Path\To\Clean.csv', 'w') as writer:
    read_csv = csv.reader(reader); write_csv = csv.writer(writer, lineterminator='\n')

    for line in read_csv:
        if len(line[1]) > 0:            
            write_csv.writerow(line)

# DATABASE CONNECTION
access_path = "C:\Path\To\Access\\DB.mdb"
con = pyodbc.connect("DRIVER={{Microsoft Access Driver (*.mdb, *.accdb)}};DBQ={};" \
                     .format(access_path))

# RUN QUERY
strSQL = "SELECT * INTO [TableName] FROM [text;HDR=Yes;FMT=Delimited(,);" + \
         "Database=C:\Path\To\Folder].Clean.csv;"    
cur = con.cursor()
cur.execute(strSQL)
con.commit()

con.close()                            # CLOSE CONNECTION
os.remove('C\Path\To\Clean.csv')       # DELETE CLEAN TEMP

Raw CSV

Clean CSV

MS Access Table

Notice Access can infer column types such as the Date in first column.

来源：https://stackoverflow.com/questions/41430458/load-csv-to-mdb-using-pyodbc-and-pandas

标签

python