How would I go around creating a MYSQL table schema inspecting an Excel(or CSV) file. Are there any ready Python libraries for the task?
Column headers would be sani
Use the xlrd
module; start here. [Disclaimer: I'm the author]. xlrd
classifies cells into text, number, date, boolean, error, blank, and empty. It distinguishes dates from numbers by inspecting the format associated with the cell (e.g. "dd/mm/yyyy" versus "0.00").
The job of programming some code to wade through user-entered data to decide on what DB datatype to use for each column is not something that can be easily automated. You should be able to eyeball the data and assign types like integer, money, text, date, datetime, time, etc and write code to check your guesses. Note that you need to able to cope with things like numeric or date data entered in text fields (can look OK in the GUI). You need a strategy to handle cells that don't fit the "estimated" datatype. You need to validate and clean your data. Make sure you normalize text strings (strip leading/trailing whitespace, replace multiple whitespaces by a single space. Excel text is (BMP-only) Unicode; don't bash it into ASCII or "ANSI" -- work in Unicode and encode in UTF-8 to put it in your database.
As far as I know, there is no tool that can automate this process (I would love for someone to prove me wrong as I've had this exact problem before).
When I did this, I came up with two options:
(1) Manually create the columns in the db with the appropriate types and then import, or
(2) Write some kind of filter that could "figure out" what data types the columns should be.
I went with the first option mainly because I didn't think I could actually write a program to do the type inference.
If you do decide to write a type inference tool/conversion, here are a couple of issues you may have to deal with:
(1) Excel dates are actually stored as the number of days since December 31st, 1899; how does one infer then that a column is dates as opposed to some piece of numerical data (population for example)?
(2) For text fields, do you just make the columns of type varchar(n) where n is the longest entry in that column, or do you make it an unbounded char field if one of the entries is longer than some upper limit? If so, what's a good upper limit?
(3) How do you automatically convert a float to a decimal with the correct precision and without loosing any places?
Obviously, this doesn't mean that you won't be able to (I'm a pretty bad programmer). I hope you do, because it'd be a really useful tool to have.
Just for (my) reference, I documented below what I did:
LOAD DATA INFILE
CREATE TABLE
with all columns as TEXT, except for Primary keyLOAD DATA LOCAL INFILE
loading all CSV data into TEXT fields.PROCEDURE ANALYSE
, I was able to ALTER TABLE
to give columns the right types and lengths. PROCEDURE ANALYSE
returns ENUM
for any column with few distinct values, which is not what I needed, but I found that useful later for normalization. Eye-balling 200 columns was a breeze with PROCEDURE ANALYSE
. Output from PhpMyAdmin propose table structure was junk.SELECT DISTINCT
on columns and INSERT
ing results to separate tables. I have added to the old table a column for FK first. Just after the INSERT
, I've got its ID and UPDATE
ed the FK column. When loop finished I've dropped old column leaving only FK column. Similarly with multiple dependent columns. It was much faster than I expected.python manage.py inspctdb
, copied output to models.py and added all those ForeignkeyField
s as FKs do not exist on MyISAM. Wrote a little python views.py, urls.py, few templates...TADAPandas can return a schema:
pandas.read_csv('data.csv').dtypes
References:
Quick and dirty workaround with phpmyadmin: