Step-by-Step: Reading and Querying Data in SQLite using Python
Regardless of whether you’re an analyst, scientist, engineer, or developer, knowing how to query from SQL databases and tables is a valuable skill. By pairing SQL with Python, we get the extra benefit to automate tasks or go to data visualization rather quickly.
In this article we’ll focus on reading data from SQLite using Python. This should help you with your Python and SQL skills.
Step 1: Check you’ve already PIP installed the Packages
For this project, we’ll use SQLite3 and Pandas.
$ pip install sqlite3
$ pip install pandas
Step 2: Download Data, Connect to SQLite, and Query the Tables
Let’s download survey data about Mental Health in the Tech Industry, which is on Kaggle.com (Kaggle is a great resource for amazing open-source datasets!). Go to the link and click the button “Download (6MB)”. You may need to sign up before you can download the data. I simply downloaded the data to my Downloads folder and unzipped the file. After unzipping, notice the “.sqlite” extension.
Now that we have the data, let’s connect to the SQLite database and query the tables.
To select all of the data in one table in SQL, we simply run: “select * from database_name.table_name”. The * means all/ everything.
import sqlite3
import pandas as pd# You may need to update the directory path ("../...") below.
db_connection = sqlite3.connect("../Downloads/mental_health.sqlite")# We're selecting and printing all from the Answer table.
answer_table = pd.read_sql("select * from Answer", db_connection)
print(answer_table)# answer_table Output #
AnswerText SurveyID UserID QuestionID
0 37 2014 1 1
1 44 2014 2 1
2 32 2014 3 1
3 31 2014 4 1
4 31 2014 5 1
... ... ... ... ...
236893 Other 2016 2689 117
236894 Support 2016 2690 117
236895 Back-end Developer 2016 2691 117
236896 DevOps/SysAdmin 2016 2692 117
236897 Other 2016 2693 117# We're selecting and printing all from the Question table.
question_table = pd.read_sql("select * from Question", db_connection)
print(question_table)# question_table Output #
questiontext questionid
0 What is your age? 1
1 What is your gender? 2
2 What country do you live in? 3
3 If you live in the United States, which state ... 4
4 Are you self-employed? 5
.. ... ...
100 Do you think that team members/co-workers woul... 114
101 If yes, what condition(s) have you been diagno... 115
102 If maybe, what condition(s) do you believe you... 116
103 Which of the following best describes your wor... 117
104 Do you work remotely? 118# We're selecting and printing all from the Survey table.
survey_table = pd.read_sql("select * from Survey", db_connection)
print(survey_table)# survey_table Output # SurveyID Description
0 2014 mental health survey for 2014
1 2016 mental health survey for 2016
2 2017 mental health survey for 2017
3 2018 mental health survey for 2018
4 2019 mental health survey for 2019
We can answer some simple questions using our Answer table, such as: counting the number of unique UserIDs/ how many people took the survey.
sql_query = "select count(distinct UserID) as unique_ids from Answer"
distinct_userid_in_answer = pd.read_sql(sql_query, db_connection)print(distinct_userid_in_answer)# distinct_userid_in_answer Output #
unique_ids
0 4218
Let’s discuss some of the sql commands from this query:
- count(): Means we’re going to count whatever is inside count().
- Inside count() is distinct UserID. distinct means only get one occurrence for each value of (in this case) UserID.
- as unique_ids: Changes the column name output to (in this case) unique_ids.
So we now know 4,218 unique users answered the survey. But to answer more questions, we need to join our tables together, so let’s try that!
Step 3: Entity-Relationship Diagrams (ERDs)
Before we get into how to join tables in SQL, let’s first visualize how our tables relate to each other using Entity-Relationship Diagrams (ERDs).
Creating ERDs is a useful skill to have because they make it easier to explain how tables interact with each other.
Now let’s explain what some of these ERD signs and symbols mean.
- BoldColumnName (Pk): The Primary key (Pk) for each table. The Pk is helpful because it provides a unique value that can identify a specific row in a table. For example, in our code at the bottom of Step 2, we used UserID Pk to get the number of distinct/unique users.
- ColumnName (Fk): The Foreign key (Fk) for a table. The Fk of one table is the Pk of another table. As such, the Fk gives us the ability to link (or join) two or more tables.
- One to One Relationship: These are shown to the right of the survey_table and the left side of the question_table. For example, the One to One relationship between the answer_table and the survey_table means that each SurveyID in the answer_table can be tied back to one and only one SurveyID in the survey_table. This makes logical sense because the SurveyID is the primary key in the survey_table.
- One to Many Relationship: This can be seen on either side of the answer_table. The One to Many relationship between the question_table and the answer_table means that each questionid in the question_table must be in the answer_table at least once and up to many times. This makes logical sense because the same question will be asked in each survey.
Now that we’re more familiar with ERDs, let’s code the ERD above using SQL!
Step 4: Joining Tables
After looking at the tables again, I think it makes the most sense to join answer_table and question_table (because survey_table does not provide any new data. SurveyID is just the date of the survey). To join those two tables together, we simply run:
sql_query = """
select *
from Question
inner join Answer on Question.questionid = Answer.QuestionID
"""question_answer_table = pd.read_sql(sql_query, db_connection)
print(question_answer_table)
Let’s discuss some of the sql commands from this query:
- inner join: Combines rows that have matching values (You can learn more about joins in the links below).
- on Question.questionid = Answer.QuestionID: This explains the values (questionid and QuestionID) we wish to match.
Now that we can join tables together, let’s practice filtering them.
Step 5: Filtering Queries
Filtering queries helps you pull a subset of data. This is helpful if your data has millions (or billions) of rows and/or you’re only curious about specific columns of data.
To practice filtering our data, let’s answer these two questions:
- How many survey takers were 30 years old or younger during each survey year?
- How many survey takers work in tech companies as opposed to non-tech?
How many survey takers were 30 years old or younger during each survey year?
sql_query = """
select SurveyID, count(*) as surveyors_less_than_30
from Question
inner join Answer on Question.questionid = Answer.QuestionID
where questiontext in ('What is your age?')
and AnswerText < 30
group by SurveyID"""less_than_30_each_year = pd.read_sql(sql_query, db_connection)
print(less_than_30_each_year)# less_than_30_each_year Output # SurveyID count(*)
0 2014 521
1 2016 450
2 2017 205
3 2018 117
4 2019 99
Let’s discuss some of the sql commands from this query:
- where questiontext in (‘What is your age?’) and AnswerText < 30: Where statements are used to filter data. In this case, we wanted only questions that ask surveyors age and if their age was less than 30.
- group by SurveyID: Group by will group the same rows based on a column (SurveyID). This way we can aggregate up know how many surveys are from each other.
How many survey takers work in tech companies as opposed to non-tech?
sql_query = """
select AnswerText, count(*) as tech_or_nontech
from Question
inner join Answer on Question.questionid = Answer.QuestionID
where questiontext like ('Is your employer primarily a tech company/orga%')
group by AnswerText"""tech_or_nontech = pd.read_sql(sql_query, db_connection)
print(tech_or_nontech)
Let’s discuss some of the sql commands from this query:
- where questiontext like (‘Is your employer primarily a tech company/orga%’): This where statement filters questiontext for the “Is your employer primarily a tech company/orga…” question. The like operator means “is similar to.” The % wildcard at the end of the string means that the beginning text must match specifically, but the text after company/orga can be anything.
Your Turn to Explore More!
- Learn more about SQL using this Cheatsheet!
- Query the data to answer more questions related to gender, country of origin, or self-employed workers.
- Use the Seaborn data visualization library for analysis.
Thanks for reading! Feel free to comment and let me know if you want a YouTube tutorial.
Want to stay in touch? Connect with me on LinkedIn!