Skip to content

Quickly set up MySQL databases from large static datafiles

Notifications You must be signed in to change notification settings

AntonHardock/mysql_pd_bulk

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

mysql_pd_bulk (MySQL bulk insert from pandas dataframes)

Quickly set up MySQL databases from large static datafiles (MySQL Community Server)

Multiple datafiles can be inserted and connected through Foreign Keys, all in one step.

For each datafile, the following input is needed:

  • A chunked pandas dataframe, generated by one of it's many reader functions
    (each chunk is processed by one insert statement)
  • A dictionary with table specifications:
    • keys: column names (only specified columns are inserted)
    • values: datatypes + additional specifications (like "PRIMARY KEY")
    • nested dictionaries (optional): hold foreign key constraints

Check out the Jupyter notebook "demo" to see the module in action.

Note:

The pandas library already provides the convenient pd.DataFrame.to_sql() method. It uses SQLAlchemy and allows bulk inserts (full dataframe at once, or chunkwise). However, when selecting "mysqlconnector" as driver in the SQLAlchemy engine, I found the bulk insert to be slow. Using the MySQLCursor.executemany() method directly from mysql.connector seems to work a lot faster. Since I wanted to stick with this particular driver, here's my attempt at creating a custom bulk insert function.

About

Quickly set up MySQL databases from large static datafiles

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published