Photo by Twitter: @jankolario on Unsplash

Turn Your Excel Workbook Into a SQLite Database

Transforming spreadsheets into queryable database tables

Scott A. Adams

Published in

Towards Data Science

8 min readMay 18, 2020

Introduction

A relational database is a collection of data tables — sets of rows and columns that store individual pieces of data — that can be connected to each other. In this way, a relational database is not totally dissimilar from an Excel workbook with related datasets stored across multiple worksheets. With that thought in mind, this post moves through an example using Python to transform an Excel spreadsheet into a database that can be queried using Structured Query Language (SQL).

Data

The example in this post uses data from the Superstore-Sales dataset, which can be found here. This dataset is stored as an Excel workbook and contains example sales transaction data stored across the following four sheets: sales, orders, customers, and products.

Reading The Data In Python

First, if you have not installed the xlrd package using condaor pip, do so before starting a Python session, or else you will encounter the following error when trying to read in the Excel file (even with pandas installed).

Now let’s start a Python session and import pandas and sqlite3 using import pandas and import sqlite3. We will read the data from each Excel spreadsheet into a separate pandas data frame using read_excel, as shown in the code below.

In each code chunk, the first parameter in read_excel identifies the name of the file to be processed. If necessary, include the path in which the file resides. For example, when I ran this code my data was stored in the data directory, but my working directory was one directory up. The second parameter, sheet_name = , specifies the name of the spreadsheet in the workbook to be processed. The last parameter, header=0, tells Python that the first row in the spreadsheet being read contains the column headers. Remember that Python is zero-indexed, which is why we use0 and not 1 to identify the first row. By explicitly identifying the header row, the values in the first row of each spreadsheet will be treated as the column names for its respective data frame.

Let’s take a look at the first few rows of each data frame just to make sure everything looks correct.

Great! Now let’s make that database.

Creating a SQLite Database

There are multiple relational database products available and the specific product we will use here is SQLite. This is a lightweight SQL database engine that can be used to create databases stored as files on one’s personal computer. We can initiate a new SQLite database connection object and assign this object to a variable. Below, I named this variable db_conn.

db_conn = sqlite3.connect("data/superstore.db")

When this code above is executed, a new file named superstore.db will be created in the data directory (assuming superstore.db does not already exist in this directory). The newly created superstore.db file is an empty SQLite database at this point (i.e., it has no tables).db_conn is also pointing to the superstore.db database and can be thought of as an interface for working with this database. Cool! But wait…how do we actually run the SQL code that will create the data tables? Glad you asked. We now need to establish a cursor object, which is a tool that executes SQL code against the database of interest. Here, the cursor is assigned to a variable I named c.

c = db_conn.cursor()

The next step is to create the tables that will be included in the database, though it is important to emphasize that this next step of creating the tables will result in empty tables. Later on, we will populate our tables with the data from the four previously-created data frames. Before we move further, however, let’s take a look at the actual SQL code that will be used to create the sales table to better understand the table creation process.

Line #1 in the code above provides a command to create a table named sales and lines #2–8 create 7 new columns in sales: SalesID, OrderID, ProductID, Sales, Quantity, Discount, and Profit. The data type for each column is specified to the right of the respective column name. More information on SQLite datatypes can be found here. Note that the order the columns matches the order of the columns from the associated data frame.

The ordering of the columns in the CREATE TABLE statement is intentional as this ensures that the appropriate values from the data frame go to the intended column in the database table. For example, had I made OrderID the first column and SalesID the second, SalesID values would be written to the OrderID column in the sales table and the SalesID column in this table would contain OrderID values.

Line #9 establishes the table’s primary key, which is a column that contains values that uniquely identify each row. In the sales table, SalesID satisfies the requirements for a primary key because no two rows have the same SalesID value.

Line #10 in the previous set of code establishes the first of two foreign keys for sales. A foreign key is a column in one table that is the primary key in another table. For instance, notice that different rows can share the same OrderID value in the sales table, which disqualifies OrderID from being the primary key in this table. However, in the orders table each row does contain a unique OrderID value. As such, OrderID can serve as the primary key for orders.

Let’s return to the actual code in line #10 in the previous CREATE TABLE statement. The first part of this code,FOREIGN KEY(OrderID), establishes that the OrderID column in sales is a foreign key. The second part, REFERENCE orders(OrderID) then specifies the table and primary key to which OrderID refers. Line #11 from the CREATE TABLE statement follows the same logic for the ProductID column.

By specifying the primary and foreign keys, we are able to create a “map” that shows how the tables in the database relate to each other. For example, someone who is familiar with SQL concepts but unfamiliar with this particular data could take a look at the CREATE TABLE statement that creates the sales table and recognize that data from the sales and orders tables can be brought together, or joined, by matching on values in the OrderID columns from both tables. To briefly illustrate how joining works, consider how we can add OrderDate to the sales table by “joining” on OrderID, as illustrated below.

A basic visual demonstration of how joining works

Finishing the Database

With a basic overview of the table creation process established, let’s create all the database tables. This is where we get to put the cursor, c to work. We can run c.execute() and include the desired SQL code as a string in the parentheses to run said SQL code against the database to which we are currently connected (i.e., data/superstore.db).

It is now time to populate the tables in the database with the relevant data from the four data frames we previously created. Fortunately, this step can be accomplished with ease using pandas’ to_sql on each data frame (see more details on to_sql here). The code in the block below transfers the data from each of the four data frames into the appropriate tables in the database.

The first parameter in each line of code above identifies the name of the database table to which the values from the data frame will be written and the following parameter specifies the actual database connection. Next, if_exists='append' tells to_sql that if the table already exists, which it does in this case, then the values from the data frame should be inserted into the table. Lastly, with index=False we tell to_sql not to include the index as an additional column in the table.

Conclusion

Now we have a SQLite database that is ready to go! You can query the superstore.db database file with the SQLite command line shell or other database software that supports SQLite, such as DbVisualizer. You can also run queries on the database using read_sql from pandas (see more information here). For instance, if we want to see the first five rows from sales we could run the code shown below in our Python shell or notebook.

pd.read_sql("SELECT * FROM sales LIMIT 5", db_conn)

When you are finished working on this database in Python you can close the connection by running db_conn.close(). If you want to work with this database again after closing, just reestablish the database connection

db_conn = sqlite3.connect("data/superstore.db")

and run queries using read_sql. No need to establish a new cursor if you are not writing data to the database.

One parting word of clarification before we close. At the start of this post, I mentioned that the structure of an Excel workbook is similar to a SQL database in that both can be comprised of multiple related tables. To that end, this exercise is intended to demonstrate how an Excel workbook can easily be turned into a SQLite database file. Nonetheless, we just scratched the surface of database design and I want to stress that a well-designed database is more than just a collection of tables. Indeed, the process of normalization, which aims to remove data redundancies and promote data integrity, is very important in relational database design. Be aware that the database that we created could be normalized further. If you are interested in learning more about relational databases and the rationale behind storing data across multiple tables I do recommend reading more about normalization.

Alright, that will do it for this post. Thanks for reading and please reach out if you have any questions and/or constructive feedback.