Quickly set up MySQL databases from large static datafiles (MySQL Community Server)
Multiple datafiles can be inserted and connected through Foreign Keys, all in one step.
For each datafile, the following input is needed:
- A chunked pandas dataframe, generated by one of it's many reader functions
(each chunk is processed by one insert statement) - A dictionary with table specifications:
- keys: column names (only specified columns are inserted)
- values: datatypes + additional specifications (like "PRIMARY KEY")
- nested dictionaries (optional): hold foreign key constraints
Check out the Jupyter notebook "demo" to see the module in action.
The pandas library already provides the convenient pd.DataFrame.to_sql() method. It uses SQLAlchemy and allows bulk inserts (full dataframe at once, or chunkwise). However, when selecting "mysqlconnector" as driver in the SQLAlchemy engine, I found the bulk insert to be slow. Using the MySQLCursor.executemany() method directly from mysql.connector seems to work a lot faster. Since I wanted to stick with this particular driver, here's my attempt at creating a custom bulk insert function.