How Innovation in Tech is Empowering the App Development Sector? 1

How to Query Your Pandas Dataframe

Source Node: 1017694

How to Query Your Pandas Dataframe

A Data Scientist’s perspective on SQL-like Python functions.


By Matthew Przybyla, Senior Data Scientist at Favor Delivery



Photo by Bruce Hong on Unsplash [1].

Table of Contents

 
 

  1. Introduction
  2. Multiple Conditions
  3. Merging On Multiple, Specific Columns
  4. Summary
  5. References

Introduction

 
 
Whether you are transitioning from a data engineer/data analyst or wanting to become a more efficient data scientist, querying your dataframe can prove to be quite a useful method of returning specific rows that you want. It is important to note that there is a specific query function for pandas, appropriately named, query. However, I will instead be discussing the other ways that you can mimic querying, filtering, and merging your data. We will present common scenarios or questions that you would ask to your data, and rather than SQL, we will do it with Python. In the paragraphs below, I will outline some simple ways of querying rows for your pandas dataframe with the Python programming language.

Multiple Conditions

 
 



Sample data. Screenshot from Author [2].

 

As data scientists or data analysts, we want to return specific rows of data. One of these scenarios is where you want to apply multiple conditions, all in the same line of code. In order to display my example, I have created some fake sample data of a first and last name, as well as their respective gender and birthdate. This data is displayed above in the screenshot.

The example multiple conditions will essentially answer a specific question, just like when you use SQL. The question is, what percent of our data is Male gender OR a person who was born between 2010 and 2021.

Here is the code that will solve that question (there are a few ways to answer this question, but here is my specific way of doing it):

print(“Percent of data who are Males OR were born between 2010 and 2021:”, 100*round(df[(df[‘Gender’] == ‘M’) | (df[‘Birthdate’] >= ‘2010–01–01’) & (df[‘Birthdate’] <= ‘2021–01–01’)][‘Gender’].count()/df.shape [0],4), “%”)

To better visualize this code, I have also included this screenshot of that same code from above, along with the output/result. You can also apply these conditions to return the actual rows instead of getting the fraction or percent of rows out of the total rows.



Conditions code. Screenshot by Author [3].

 

Here is the order of commands we performed:

  • Return rows with Male Gender
  • Include the OR function |
  • Return the rows of Birthdate > 2010 and 2021
  • Combine those all, and then divide by the total amount of rows

As you can see, this code is similar to something you would see in SQL. I personally think it is easier in pandas because it can be less code, while also being able to visually see all the code in one easy spot, without having to scroll up and down (but this format is just my preference).

Merging On Multiple, Specific Columns

 
 



Merged dataframe result. Screenshot by Author [4].

 

We have probably seen how to merge dataframes together in other tutorials, so I wanted to add a unique approach that I have not really seen out there, which is merging on multiple, specific columns. In this scenario, we want to join two dataframes where two fields are shared between them. You could tell that if there are even more columns, this method could be even more useful.

We have our first dataframe, which is df, then we are merging our columns on a second dataframe, df2. Here is that code to achieve our expected result:

merged_df = df.merge(df2, how=’inner’, left_on=cols, right_on=cols )

To better visualize this merging and code, I have presented the screenshot below. You see what the second dataframe looks like below, with the First and Last names, just like they are in the first dataframe, but with a new column, Numeric. Then, we have out specific columns that we wanted to merge on, while returning columns GenderBirthdate, and the new Numeric column as well. The columns are a list of columns, which is named cols.



Merging dataframe. Screenshot by Author [5].

 

As you can see, this way of merging dataframes is a simple way to achieve the same results that you would get from a SQL query.

Summary

 
 
In this tutorial, we saw two common questions or queries that you would perform in SQL, but instead, have performed them with pandas dataframes in Python.

To summarize, here are the two scenarios we worked with:

  1. Returning the percent of rows out of the total dataset from multiple conditions
  2. Merging on multiple, specific columns to return a final dataframe with a new column

I hope you found my article both interesting and useful. Please feel free to comment down below if you agree or disagree with these methods. Why or why not? These can certainly be clarified even further, but I hope I was able to shed some light on some of the ways you can use pandas and Python instead of SQL. Thank you for reading!

Please feel free to check out my profile, Matt Przybyla, and other articles, as well as reach out to me on LinkedIn.

I am not affiliated with these companies.

 
References

[1] Photo by Bruce Hong on Unsplash, (2018)

[2] M. Przybyla, sample data screenshot, (2021)

[3] M. Przybyla, conditions code screenshot, (2021)

[4] M. Przybyla, merged dataframe result screenshot, (2021)

[5] M. Przybyla, merging dataframe screenshot, (2021)

 
Bio: Matthew Przybyla is Senior Data Scientist at Favor Delivery, and a freelance technical writer, especially in data science.

Original. Reposted with permission.

Related:

Source: https://www.kdnuggets.com/2021/08/query-pandas-dataframe.html

Time Stamp:

More from KDnuggets