My task here was a research project on employees of this corporation from the 1980s and 1990s. All that remained of the database of employees from this period are six CSV files.
I designed tables to hold data in the CSVs, imported the CSVs into a SQL database, and queried the data using Python/Pandas/Jupyter Notebook.
After inspecting the CSVs, I sketched out an entity relationship diagram (ERD) of the tables
Using the ERD, I created a table schema for each of the CSV files. I specified data types, primary keys, foreign keys, and other constraints. I then imported each CSV file into its corresponding SQL table.
The following are a sample of the queries run:
- Listing the employee number, last name, first name, gender, and salary of each employee
- Listing the manager of each department with specific information
- Listing all employees in Sales, including their employee number, last name, first name, and department name
- In descending order, listing the frequency count of employee last names
I imported the SQL database into Pandas. Using the Python SQL toolkit and Object Relational Mapper "SQLAlchemy", I used the create_engine
function to interact directly with the database using Jupyter Notebook.
I used SELECT
statements to call in the "salaries" and "titles" tables from the SQL database,
then combined the dataframes in order to better aggregate the data.
I then found the mean salary of each job title,
before using the Python plotting library Matplotlib
to create a bar graph visualization.