Pandas DataFrames Tutorial

By: Dylan Manchester

Datasource: https://www.kaggle.com/lava18/google-play-store-apps


Introduction

Pandas is an open source Python package that can makes working with data more intuitive and faster.

The primary Pandas objects are Series and DataFrames.

Series are 1D arrays where the elements have index labels.

DataFrames are 2D arrays with labeled columns and indexed rows, comparable to an SQL table but with more features.

This tutorial focuses on the Pandas DataFrame, so we first need to import pandas


Creating DataFrames

DataFrame Method

Syntax: DataFrame(data=None, index=None, columns=None, dtype=None, copy=False)

No Data

The simplest way to create a DataFrame is to just provide the structure without any data.

List of Lists

When providing a list of lists, each internal list is interpreted as a row in the DataFrame.

It is important to note that all of the lists must be the same length for this to work.

Dictionary of Lists

If the lists we have represent a column rather than a row, we can use a dictionary of the lists where the key for each list becomes the column label.

This method also requires the lists to be the same length.

List of Dictionaries

For more less structured data, a list of dictionaries may be more appropriate.

Each dictionary represents a row and each key in the dictionary represents a column.

The dictionaries may have any number of keys and they are not required to match.

Reading Data

Though more often than not, data comes from external sources.

There are a few different input methods to create DataFrames from specific types of sources

Flat Files

Flat files include text or csv files that use a delimiter to seperate the columns.

Commonly used methods are read_table which assumes a tab delimiter while read_csv assumes a comma.

There are a plethora of optional parameters to explore when working with different datasets that can be found here.

Note the .head(n=5) method can be called on DataFrames to only show the first n rows.

HTML

Another practical datasource is an HTML file.

The read_html method will extract a list of tables from within an HTML source like a URL.

The tables found within the webpage are returned as a list.


Accessing Data

Now that we have covered how to get data into a DataFrame, we need to understand how to access data within a DataFrame

Basic Indexing

This indexing method is the most convenient when looking to access a specific column or list of columns.

Boolean Indexing

With boolean indexing denoted by square brackets, a subset of rows is returned based on a series of True/False values the same length.

This is very useful to extract rows that meet a certain criteria.

We can see that 23,998 of the reviews were marked as positive.

This can also be done with only one line and no extra variable

Boolean indexing can also be used to select a single index and it should be noted that a DataFrame is returned.

Loc Method

The loc method can be used to access a single or set of specified rows and columns.

Another convenient functionality of the loc method is that it can be used to add rows by creating index labels if they do not exist.

Iloc Method

If we do not want to use the index labels, we can also access the index position using the iloc method.

However, this cannot be used to add new rows to the DataFrame


Exploring DataFrames

With a basic understanding of DataFrames, we can start to explore them more in depth.

Overview

The first exploritory task when working with new data is to get an overall picture of it.

The .dtypes property shows us the datatype for each column of a DataFrame

The .shape property of a DataFrame returns the size as a tuple of (# of rows, # of columns)

We can see that there are 10,841 different apps and a total of 64,295 reviews.

The app DataFrame has 13 columns, the app_reviews DataFrame has 5 and all the columns have either an object or float64 datatype.

However, some of the columns of the app data such as Reviews and Price were assigned to the object datatype when they appeared to be numerical.

Let's figure out why:

When we try to cast the Reviews column to a float, we can see that 3.0M is a value in the column.

For the Price column, the first non-float value we find is $4.99.

This is important to know and we will resolve these values before casting the columns.

Before we start the cleaning stage, we will look for columns with a low number of unique values so we can casting them to the category datatype.

The category datatype is not always the right choice, but if used correctly it can reduce memory usage, increase performance and allow for custom sorting.

From the output, we can see that Category, Installs, Type, Content Rating, and Android Ver are all object datatype and have less than 50 unique values.

These columns could be stored more efficiently as a category datatype.

The Sentiment column only has 3 unique values so it would definitely benefit from being converted into a category datatype.


Data Cleaning

The next step in the data science lifecycle is to clean our data for analysis.

Modifying Values

First we will focus on the Price column.

Remember our test to find values that couldn't be converted to floats?

It seems that the Price has a $ character at the beginning in some cases.

We can use the replace method combined with regular expressions (regex) to remove the $ character.

Now that we solved that issue, let's retry the floatTest to see if there are other non-float values in the Price column.

It looks like one of the values in the Price column is Everyone which definitly does not seem right.

We can use boolean indexing to find the culprit row.

We can see that the values are shifted over by one column probably due to this app not having a Category.

It also happens that for this row the Reviews column contains the value 3.0M which is the same issue that the floatTest found.

Let's quickly fix it by replacing shifting the values over one column and replacing the missing Category with a null value.

Let's retry the floatTest again for both columns...

Sweet! It looks like our fixes worked!

Changing Column Types

Finally we can cast the columns using the astype method.

Using the memory_usage method before and after casting the columns shows how much memory we save.

Wow! The Apps DataFrame uses about half the memory after casting and the Setiment column uses about 1/8th!

Removing Null Values

The next cleaning process to undertake is to remove null values.

First we use the isna method to count the null values in each column to see how we should address them.

Some of the columns only have a few null values so we will remove the rows containing those.

However, the Rating column has numerous null values so we will create a seperate DataFrame with these values removed.

Now we have two DataFrames:

Many of these rows seem to have all null values except for the app name so we will remove all rows with null values.


Writing Data

Since our data is more usable now, we should create files to save it for future access.

To CSV

Our first option is to write to a csv file with the added option of compressing it to reduce storage.

To HDF

Another storage format is an HDF file where we can write multiple DataFrames to the same file using a key.

There are many other options for storage formats and optional parameters that can be explored here.


Manipulation Methods

Another important feature of Pandas DataFrames are the manipulation methods.

Concatinate

The concat method is for combining DataFrame either by column or by row.

Multiple DataFrames can be concatinated at a time and there are many join type and indexing options.

Append

The append method can easily concatinate rows, providing a subset of the concat method's functionality.

Merge

The merge method can be used to combine two DataFrames based on column values, similar to the SQL join function.

The left and right parameters are the respective DataFrames while the left_on and right_on parameters dictate which columns will be used for the join.

Join

The join method is for adding columns of one DataFrame to another based on index values which provides a subset of the functionality of merge.


Analysis

We will walk through two examples of useful analyses that can be performed on the cleaned data.

Most Expensive Category

This first analysis will find the 5 app categories with the highest average price.

We can start reading from the compressed csv.

Next we can group by Category and take the mean of each.

The groupby method is an essential DataFrame tool that is used quite often.

Sort the results by Price in descending order.

And output the top 5.

Since each step uses a method based on the datatype returned by the prior step, an identical analysis can also be done in one line.

The analysis results can be written to an html file.

Well Liked Apps

The second analysis will determine all apps that have a rating of 4.5 or higher and rank them by Sentiment.

We can start by reading from the HDF file.

Next we can use boolean indexing to select only the apps with ratings of at least 4.5.

We can replace the Sentiment category with numerical values.

Then we can merge the DataFrames to have all the necessary information in one DataFrame.

We can now find the mean of the Sentiment grouped by App.

The results will be sorted by average Sentiment score and then by Rating.

Another way we could find the mean of the Sentiment grouped by App is to use the pivot_table method.

Using App as the index, we are selecting that as the group by variable and the default aggregation function is mean so we do not have to specify it.

Using the eq and all methods together, we can check if every cell is the same for both DataFrames.

We can write the completed analysis back to the HDF file.


Conclusion

Congrats! You know enough now to perform your own analysis using Pandas!

Here are some ideas for furthur analysis of this dataset to get your feet wet:

Working with these ideas will get you into the mindset of using Pandas, but I think you are ready to take it to the next level!

The following resources will be useful as you continue using Pandas:

Good luck and happy coding!