Datasource: https://www.kaggle.com/lava18/google-play-store-apps
Pandas is an open source Python package that can makes working with data more intuitive and faster.
The primary Pandas objects are Series and DataFrames.
Series are 1D arrays where the elements have index labels.
DataFrames are 2D arrays with labeled columns and indexed rows, comparable to an SQL table but with more features.
This tutorial focuses on the Pandas DataFrame, so we first need to import pandas
import pandas as pd
df = pd.DataFrame(columns=['A', 'B', 'C'])
print(df)
df
Empty DataFrame Columns: [A, B, C] Index: []
A | B | C |
---|
df = pd.DataFrame(columns=['A','B','C'], index=['W', 'X','Y','Z'])
df
A | B | C | |
---|---|---|---|
W | NaN | NaN | NaN |
X | NaN | NaN | NaN |
Y | NaN | NaN | NaN |
Z | NaN | NaN | NaN |
When providing a list of lists, each internal list is interpreted as a row in the DataFrame.
It is important to note that all of the lists must be the same length for this to work.
data=[
["A",1,3],
["B",6,8],
["C",3,4],
["D",9,12],]
df = pd.DataFrame(data)
df
0 | 1 | 2 | |
---|---|---|---|
0 | A | 1 | 3 |
1 | B | 6 | 8 |
2 | C | 3 | 4 |
3 | D | 9 | 12 |
If the lists we have represent a column rather than a row, we can use a dictionary of the lists where the key for each list becomes the column label.
This method also requires the lists to be the same length.
data={
"Name":["John","Jo","Sam","April"],
"Year":["Senior","Junior","Freshman","Junior"],
"GPA":[3.5,3.0,4.0,3.8]
}
df = pd.DataFrame(data)
df
Name | Year | GPA | |
---|---|---|---|
0 | John | Senior | 3.5 |
1 | Jo | Junior | 3.0 |
2 | Sam | Freshman | 4.0 |
3 | April | Junior | 3.8 |
For more less structured data, a list of dictionaries may be more appropriate.
Each dictionary represents a row and each key in the dictionary represents a column.
The dictionaries may have any number of keys and they are not required to match.
data=[
{"Name":"John","GPA":3.5},
{},
{"Name":"Jo","Year":"Junior","Job":"Waiter","GPA":3.0},
{"Name":"Sam","Job":"Data Analyst","GPA":4.0},
{"Name":"April","Year":"Junior","GPA":3.8}]
df = pd.DataFrame(data)
df
Name | GPA | Year | Job | |
---|---|---|---|---|
0 | John | 3.5 | NaN | NaN |
1 | NaN | NaN | NaN | NaN |
2 | Jo | 3.0 | Junior | Waiter |
3 | Sam | 4.0 | NaN | Data Analyst |
4 | April | 3.8 | Junior | NaN |
Though more often than not, data comes from external sources.
There are a few different input methods to create DataFrames from specific types of sources
Flat files include text or csv files that use a delimiter to seperate the columns.
Commonly used methods are read_table
which assumes a tab delimiter while read_csv
assumes a comma.
There are a plethora of optional parameters to explore when working with different datasets that can be found here.
Note the .head(n=5)
method can be called on DataFrames to only show the first n rows.
apps = pd.read_table("data/googleplaystore.csv",delimiter=',')
apps.head()
App | Category | Rating | Reviews | Size | Installs | Type | Price | Content Rating | Genres | Last Updated | Current Ver | Android Ver | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Photo Editor & Candy Camera & Grid & ScrapBook | ART_AND_DESIGN | 4.1 | 159 | 19M | 10,000+ | Free | 0 | Everyone | Art & Design | January 7, 2018 | 1.0.0 | 4.0.3 and up |
1 | Coloring book moana | ART_AND_DESIGN | 3.9 | 967 | 14M | 500,000+ | Free | 0 | Everyone | Art & Design;Pretend Play | January 15, 2018 | 2.0.0 | 4.0.3 and up |
2 | U Launcher Lite – FREE Live Cool Themes, Hide ... | ART_AND_DESIGN | 4.7 | 87510 | 8.7M | 5,000,000+ | Free | 0 | Everyone | Art & Design | August 1, 2018 | 1.2.4 | 4.0.3 and up |
3 | Sketch - Draw & Paint | ART_AND_DESIGN | 4.5 | 215644 | 25M | 50,000,000+ | Free | 0 | Teen | Art & Design | June 8, 2018 | Varies with device | 4.2 and up |
4 | Pixel Draw - Number Art Coloring Book | ART_AND_DESIGN | 4.3 | 967 | 2.8M | 100,000+ | Free | 0 | Everyone | Art & Design;Creativity | June 20, 2018 | 1.1 | 4.4 and up |
app_reviews = pd.read_csv("data/googleplaystore_user_reviews.csv")
app_reviews.head()
App | Translated_Review | Sentiment | Sentiment_Polarity | Sentiment_Subjectivity | |
---|---|---|---|---|---|
0 | 10 Best Foods for You | I like eat delicious food. That's I'm cooking ... | Positive | 1.00 | 0.533333 |
1 | 10 Best Foods for You | This help eating healthy exercise regular basis | Positive | 0.25 | 0.288462 |
2 | 10 Best Foods for You | NaN | NaN | NaN | NaN |
3 | 10 Best Foods for You | Works great especially going grocery store | Positive | 0.40 | 0.875000 |
4 | 10 Best Foods for You | Best idea us | Positive | 1.00 | 0.300000 |
Another practical datasource is an HTML file.
The read_html
method will extract a list of tables from within an HTML source like a URL.
The tables found within the webpage are returned as a list.
df = pd.read_html("https://en.wikipedia.org/wiki/List_of_countries_by_population_(United_Nations)")
df[0].head()
Country/Area | UN continentalregion[4] | UN statisticalsubregion[4] | Population(1 July 2018) | Population(1 July 2019) | Change | |
---|---|---|---|---|---|---|
0 | China[a] | Asia | Eastern Asia | 1427647786 | 1433783686 | +0.43% |
1 | India | Asia | Southern Asia | 1352642280 | 1366417754 | +1.02% |
2 | United States | Americas | Northern America | 327096265 | 329064917 | +0.60% |
3 | Indonesia | Asia | South-eastern Asia | 267670543 | 270625568 | +1.10% |
4 | Pakistan | Asia | Southern Asia | 212228286 | 216565318 | +2.04% |
app_reviews.head()[["App"]]
App | |
---|---|
0 | 10 Best Foods for You |
1 | 10 Best Foods for You |
2 | 10 Best Foods for You |
3 | 10 Best Foods for You |
4 | 10 Best Foods for You |
app_reviews.head()[["Sentiment","Sentiment_Polarity","Sentiment_Subjectivity"]]
Sentiment | Sentiment_Polarity | Sentiment_Subjectivity | |
---|---|---|---|
0 | Positive | 1.00 | 0.533333 |
1 | Positive | 0.25 | 0.288462 |
2 | NaN | NaN | NaN |
3 | Positive | 0.40 | 0.875000 |
4 | Positive | 1.00 | 0.300000 |
With boolean indexing denoted by square brackets, a subset of rows is returned based on a series of True/False values the same length.
This is very useful to extract rows that meet a certain criteria.
bools = app_reviews["Sentiment"]=="Positive"
print(bools)
app_reviews[bools]
0 True 1 True 2 False 3 True 4 True ... 64290 False 64291 False 64292 False 64293 False 64294 False Name: Sentiment, Length: 64295, dtype: bool
App | Translated_Review | Sentiment | Sentiment_Polarity | Sentiment_Subjectivity | |
---|---|---|---|---|---|
0 | 10 Best Foods for You | I like eat delicious food. That's I'm cooking ... | Positive | 1.000000 | 0.533333 |
1 | 10 Best Foods for You | This help eating healthy exercise regular basis | Positive | 0.250000 | 0.288462 |
3 | 10 Best Foods for You | Works great especially going grocery store | Positive | 0.400000 | 0.875000 |
4 | 10 Best Foods for You | Best idea us | Positive | 1.000000 | 0.300000 |
5 | 10 Best Foods for You | Best way | Positive | 1.000000 | 0.300000 |
... | ... | ... | ... | ... | ... |
64217 | Housing-Real Estate & Property | I able set range 1cr, scroll space 0-1cr range... | Positive | 0.233333 | 0.550000 |
64221 | Housing-Real Estate & Property | Everything old stuff neither clear sold proper... | Positive | 0.021591 | 0.259470 |
64222 | Housing-Real Estate & Property | Most ads older many agents ..not much owner po... | Positive | 0.173333 | 0.486667 |
64223 | Housing-Real Estate & Property | If photos posted portal load, fit purpose. I'm... | Positive | 0.225000 | 0.447222 |
64227 | Housing-Real Estate & Property | I property business got link SMS happy perform... | Positive | 0.800000 | 1.000000 |
23998 rows × 5 columns
We can see that 23,998 of the reviews were marked as positive.
This can also be done with only one line and no extra variable
app_reviews[app_reviews["Sentiment"]=="Positive"]
App | Translated_Review | Sentiment | Sentiment_Polarity | Sentiment_Subjectivity | |
---|---|---|---|---|---|
0 | 10 Best Foods for You | I like eat delicious food. That's I'm cooking ... | Positive | 1.000000 | 0.533333 |
1 | 10 Best Foods for You | This help eating healthy exercise regular basis | Positive | 0.250000 | 0.288462 |
3 | 10 Best Foods for You | Works great especially going grocery store | Positive | 0.400000 | 0.875000 |
4 | 10 Best Foods for You | Best idea us | Positive | 1.000000 | 0.300000 |
5 | 10 Best Foods for You | Best way | Positive | 1.000000 | 0.300000 |
... | ... | ... | ... | ... | ... |
64217 | Housing-Real Estate & Property | I able set range 1cr, scroll space 0-1cr range... | Positive | 0.233333 | 0.550000 |
64221 | Housing-Real Estate & Property | Everything old stuff neither clear sold proper... | Positive | 0.021591 | 0.259470 |
64222 | Housing-Real Estate & Property | Most ads older many agents ..not much owner po... | Positive | 0.173333 | 0.486667 |
64223 | Housing-Real Estate & Property | If photos posted portal load, fit purpose. I'm... | Positive | 0.225000 | 0.447222 |
64227 | Housing-Real Estate & Property | I property business got link SMS happy perform... | Positive | 0.800000 | 1.000000 |
23998 rows × 5 columns
Boolean indexing can also be used to select a single index and it should be noted that a DataFrame is returned.
apps[apps.index==1057]
App | Category | Rating | Reviews | Size | Installs | Type | Price | Content Rating | Genres | Last Updated | Current Ver | Android Ver | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1057 | Rabo Banking | FINANCE | 3.4 | 31906 | Varies with device | 1,000,000+ | Free | 0 | Everyone | Finance | July 19, 2018 | 5.16.0 | 4.0 and up |
The loc
method can be used to access a single or set of specified rows and columns.
apps.loc[[1057]]
App | Category | Rating | Reviews | Size | Installs | Type | Price | Content Rating | Genres | Last Updated | Current Ver | Android Ver | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1057 | Rabo Banking | FINANCE | 3.4 | 31906 | Varies with device | 1,000,000+ | Free | 0 | Everyone | Finance | July 19, 2018 | 5.16.0 | 4.0 and up |
app_reviews.loc[:,["App"]]
App | |
---|---|
0 | 10 Best Foods for You |
1 | 10 Best Foods for You |
2 | 10 Best Foods for You |
3 | 10 Best Foods for You |
4 | 10 Best Foods for You |
... | ... |
64290 | Houzz Interior Design Ideas |
64291 | Houzz Interior Design Ideas |
64292 | Houzz Interior Design Ideas |
64293 | Houzz Interior Design Ideas |
64294 | Houzz Interior Design Ideas |
64295 rows × 1 columns
apps.loc[1057:1060]
App | Category | Rating | Reviews | Size | Installs | Type | Price | Content Rating | Genres | Last Updated | Current Ver | Android Ver | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1057 | Rabo Banking | FINANCE | 3.4 | 31906 | Varies with device | 1,000,000+ | Free | 0 | Everyone | Finance | July 19, 2018 | 5.16.0 | 4.0 and up |
1058 | Capitec Remote Banking | FINANCE | 4.3 | 20672 | Varies with device | 1,000,000+ | Free | 0 | Everyone | Finance | May 8, 2018 | Varies with device | Varies with device |
1059 | Itau bank | FINANCE | 4.2 | 957973 | 40M | 10,000,000+ | Free | 0 | Everyone | Finance | July 30, 2018 | 6.5.7 | 4.2 and up |
1060 | Nubank | FINANCE | 4.7 | 130582 | 24M | 5,000,000+ | Free | 0 | Everyone | Finance | August 2, 2018 | Varies with device | Varies with device |
Another convenient functionality of the loc
method is that it can be used to add rows by creating index labels if they do not exist.
df = pd.DataFrame(columns=['Name','Year','GPA'])
df.loc[0]=["John","Senior",3.5]
df.loc[1]=["Jo","Junior",3.0]
df.loc['a']=["Sam","Freshman",4.0]
df.loc['b']=["April","Junior",3.8]
df.loc['x','GPA']=4.0
df.loc['x','Year']="Freshman"
df
Name | Year | GPA | |
---|---|---|---|
0 | John | Senior | 3.5 |
1 | Jo | Junior | 3.0 |
a | Sam | Freshman | 4.0 |
b | April | Junior | 3.8 |
x | NaN | Freshman | 4.0 |
If we do not want to use the index labels, we can also access the index position using the iloc
method.
df.iloc[4,0]=["Alex"]
df
Name | Year | GPA | |
---|---|---|---|
0 | John | Senior | 3.5 |
1 | Jo | Junior | 3.0 |
a | Sam | Freshman | 4.0 |
b | April | Junior | 3.8 |
x | Alex | Freshman | 4.0 |
However, this cannot be used to add new rows to the DataFrame
df.iloc[5]=["James","Senior",1.0]
--------------------------------------------------------------------------- IndexError Traceback (most recent call last) <ipython-input-20-d3d23e3ccd8d> in <module> ----> 1 df.iloc[5]=["James","Senior",1.0] C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\indexing.py in __setitem__(self, key, value) 687 key = com.apply_if_callable(key, self.obj) 688 indexer = self._get_setitem_indexer(key) --> 689 self._has_valid_setitem_indexer(key) 690 691 iloc = self if self.name == "iloc" else self.obj.iloc C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\indexing.py in _has_valid_setitem_indexer(self, indexer) 1399 elif is_integer(i): 1400 if i >= len(ax): -> 1401 raise IndexError("iloc cannot enlarge its target object") 1402 elif isinstance(i, dict): 1403 raise IndexError("iloc cannot enlarge its target object") IndexError: iloc cannot enlarge its target object
With a basic understanding of DataFrames, we can start to explore them more in depth.
The first exploritory task when working with new data is to get an overall picture of it.
The .dtypes
property shows us the datatype for each column of a DataFrame
The .shape
property of a DataFrame returns the size as a tuple of (# of rows, # of columns)
print("App DataFrame Summary:")
print(apps.dtypes)
print(apps.shape)
App DataFrame Summary: App object Category object Rating float64 Reviews object Size object Installs object Type object Price object Content Rating object Genres object Last Updated object Current Ver object Android Ver object dtype: object (10841, 13)
print("App Reviews DataFrame Summary:")
print(app_reviews.dtypes)
print(app_reviews.shape)
App Reviews DataFrame Summary: App object Translated_Review object Sentiment object Sentiment_Polarity float64 Sentiment_Subjectivity float64 dtype: object (64295, 5)
We can see that there are 10,841 different apps and a total of 64,295 reviews.
The app
DataFrame has 13 columns, the app_reviews
DataFrame has 5 and all the columns have either an object
or float64
datatype.
However, some of the columns of the app data such as Reviews and Price were assigned to the object
datatype when they appeared to be numerical.
Let's figure out why:
def floatTest(df, column):
'''
input:
df: DataFrame to test
column: string of column name to be tested
'''
# Test to make sure column is actually a column in df
if column in df:
try:
# Loop over elements of the column and try to turn them into a float
for i in df[column]:
x=i
float(i)
print("Casting "+column+" to float is possible")
except ValueError:
# If an error occurs while trying to convert to a float, print the value that caused the error
print("First non-float value in the " + column +" column: "+ x)
else:
print(column+" is not a column in the DataFrame provided")
floatTest(apps,"Reviews")
floatTest(apps,"Price")
First non-float value in the Reviews column: 3.0M First non-float value in the Price column: $4.99
When we try to cast the Reviews
column to a float, we can see that 3.0M
is a value in the column.
For the Price
column, the first non-float value we find is $4.99
.
This is important to know and we will resolve these values before casting the columns.
Before we start the cleaning stage, we will look for columns with a low number of unique values so we can casting them to the category
datatype.
The category
datatype is not always the right choice, but if used correctly it can reduce memory usage, increase performance and allow for custom sorting.
def uniqueTable(df):
'''
input:
df: DataFrame
output:
DataFrame with datatype and number of unique values per column
'''
# Create empty DataFrame with 2 columns: "Data Type" and "Unique Values"
uniques = pd.DataFrame(columns=["Data Type", "Unique Values"])
for i in df.columns:
# Add the datatype and number of unique values to the DataFrame as a row indexed with the column name
uniques.loc[i] = [df[i].dtype,df[i].nunique()]
return uniques
print("Number of unique values in each column of apps DataFrame:")
uniqueTable(apps)
Number of unique values in each column of apps DataFrame:
Data Type | Unique Values | |
---|---|---|
App | object | 9660 |
Category | object | 34 |
Rating | float64 | 40 |
Reviews | object | 6002 |
Size | object | 462 |
Installs | object | 22 |
Type | object | 3 |
Price | object | 93 |
Content Rating | object | 6 |
Genres | object | 120 |
Last Updated | object | 1378 |
Current Ver | object | 2832 |
Android Ver | object | 33 |
From the output, we can see that Category, Installs, Type, Content Rating, and Android Ver are all object
datatype and have less than 50 unique values.
These columns could be stored more efficiently as a category
datatype.
print("Number of unique values in each column of app_reviews DataFrame:")
uniqueTable(app_reviews)
Number of unique values in each column of app_reviews DataFrame:
Data Type | Unique Values | |
---|---|---|
App | object | 1074 |
Translated_Review | object | 27994 |
Sentiment | object | 3 |
Sentiment_Polarity | float64 | 5410 |
Sentiment_Subjectivity | float64 | 4474 |
The Sentiment column only has 3 unique values so it would definitely benefit from being converted into a category
datatype.
floatTest(apps,"Price")
First non-float value in the Price column: $4.99
It seems that the Price has a $ character at the beginning in some cases.
We can use the replace
method combined with regular expressions (regex) to remove the $ character.
apps["Price"] = apps["Price"].replace('\$', '', regex=True)
Now that we solved that issue, let's retry the floatTest
to see if there are other non-float values in the Price column.
floatTest(apps,"Price")
First non-float value in the Price column: Everyone
It looks like one of the values in the Price column is Everyone
which definitly does not seem right.
We can use boolean indexing to find the culprit row.
apps[apps["Price"]=="Everyone"]
App | Category | Rating | Reviews | Size | Installs | Type | Price | Content Rating | Genres | Last Updated | Current Ver | Android Ver | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
10472 | Life Made WI-Fi Touchscreen Photo Frame | 1.9 | 19.0 | 3.0M | 1,000+ | Free | 0 | Everyone | NaN | February 11, 2018 | 1.0.19 | 4.0 and up | NaN |
We can see that the values are shifted over by one column probably due to this app not having a Category.
It also happens that for this row the Reviews column contains the value 3.0M
which is the same issue that the floatTest
found.
Let's quickly fix it by replacing shifting the values over one column and replacing the missing Category with a null value.
apps.loc[10472,"Rating":"Android Ver"] = apps.loc[10472,"Category":"Current Ver"].to_numpy()
apps.loc[10472,"Category"] = None
apps.loc[[10472]]
App | Category | Rating | Reviews | Size | Installs | Type | Price | Content Rating | Genres | Last Updated | Current Ver | Android Ver | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
10472 | Life Made WI-Fi Touchscreen Photo Frame | None | 1.9 | 19.0 | 3.0M | 1,000+ | Free | 0 | Everyone | NaN | February 11, 2018 | 1.0.19 | 4.0 and up |
Let's retry the floatTest
again for both columns...
floatTest(apps,"Price")
floatTest(apps,"Reviews")
Casting Price to float is possible Casting Reviews to float is possible
Sweet! It looks like our fixes worked!
Finally we can cast the columns using the astype
method.
Using the memory_usage
method before and after casting the columns shows how much memory we save.
print("Memory usage of Apps DataFrame before cast: " + str(apps.memory_usage().sum()))
Memory usage of Apps DataFrame before cast: 1127592
casts = {
"Category":"category",
"Rating":"float64",
"Reviews":"float64",
"Installs":"category",
"Type":"category",
"Price": "float64",
"Content Rating":"category",
"Genres":"category",
"Android Ver":"category"}
apps = apps.astype(casts)
apps.dtypes
App object Category category Rating float64 Reviews float64 Size object Installs category Type category Price float64 Content Rating category Genres category Last Updated object Current Ver object Android Ver category dtype: object
print("Memory usage of Apps Data after cast: " + str(apps.memory_usage().sum()))
Memory usage of Apps Data after cast: 681130
print("Memory usage of Sentiment Column before cast: " + str(app_reviews["Sentiment"].memory_usage()) + '\n')
app_reviews = app_reviews.astype({"Sentiment":'category'})
print("Memory usage of Sentiment Column after cast: " + str(app_reviews["Sentiment"].memory_usage())+'\n')
app_reviews.dtypes
Memory usage of Sentiment Column before cast: 514488 Memory usage of Sentiment Column after cast: 64555
App object Translated_Review object Sentiment category Sentiment_Polarity float64 Sentiment_Subjectivity float64 dtype: object
Wow! The Apps DataFrame uses about half the memory after casting and the Setiment column uses about 1/8th!
The next cleaning process to undertake is to remove null values.
First we use the isna
method to count the null values in each column to see how we should address them.
def nulls(df):
'''
input:
df: DataFrame
'''
# Loop over columns
for col in df.columns:
# Print the column name and number number of null values in it
n=df[df[col].isna()]
print(str(col)+" : "+str(len(n)))
nulls(apps)
App : 0 Category : 1 Rating : 1474 Reviews : 0 Size : 0 Installs : 0 Type : 1 Price : 0 Content Rating : 0 Genres : 1 Last Updated : 0 Current Ver : 8 Android Ver : 2
Some of the columns only have a few null values so we will remove the rows containing those.
However, the Rating column has numerous null values so we will create a seperate DataFrame with these values removed.
apps.dropna(subset=["Category","Type","Genres","Current Ver","Android Ver"],inplace=True)
apps_with_rating=apps.dropna()
nulls(apps)
App : 0 Category : 0 Rating : 1469 Reviews : 0 Size : 0 Installs : 0 Type : 0 Price : 0 Content Rating : 0 Genres : 0 Last Updated : 0 Current Ver : 0 Android Ver : 0
nulls(apps_with_rating)
App : 0 Category : 0 Rating : 0 Reviews : 0 Size : 0 Installs : 0 Type : 0 Price : 0 Content Rating : 0 Genres : 0 Last Updated : 0 Current Ver : 0 Android Ver : 0
Now we have two DataFrames:
nulls(app_reviews)
App : 0 Translated_Review : 26868 Sentiment : 26863 Sentiment_Polarity : 26863 Sentiment_Subjectivity : 26863
Many of these rows seem to have all null values except for the app name so we will remove all rows with null values.
app_reviews.dropna(inplace=True)
apps.to_csv("data/cleaned/googleplaystoreclean.csv",mode='w',compression='zip')
Another storage format is an HDF file where we can write multiple DataFrames to the same file using a key.
apps_with_rating.to_hdf("data/cleaned/appdata.hd5",key='ratings',mode='w',format='table')
app_reviews.to_hdf("data/cleaned/appdata.hd5",key='reviews',format='table')
There are many other options for storage formats and optional parameters that can be explored here.
df1 = pd.DataFrame([
[0,1,2],
[1,2,3],
[2,3,4]])
df2 = pd.DataFrame([
[4,5,6],
[6,7,8],
[8,9,10]])
df3 = pd.DataFrame([
[11,12,13],
[13,14,15],
[15,16,17]])
pd.concat([df1,df2,df3],axis=1,ignore_index=True)
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | |
---|---|---|---|---|---|---|---|---|---|
0 | 0 | 1 | 2 | 4 | 5 | 6 | 11 | 12 | 13 |
1 | 1 | 2 | 3 | 6 | 7 | 8 | 13 | 14 | 15 |
2 | 2 | 3 | 4 | 8 | 9 | 10 | 15 | 16 | 17 |
pd.concat([df1,df2,df3],axis=0,keys=["A","B","C"])
0 | 1 | 2 | ||
---|---|---|---|---|
A | 0 | 0 | 1 | 2 |
1 | 1 | 2 | 3 | |
2 | 2 | 3 | 4 | |
B | 0 | 4 | 5 | 6 |
1 | 6 | 7 | 8 | |
2 | 8 | 9 | 10 | |
C | 0 | 11 | 12 | 13 |
1 | 13 | 14 | 15 | |
2 | 15 | 16 | 17 |
The append
method can easily concatinate rows, providing a subset of the concat
method's functionality.
df1.append([df2,df3],ignore_index=True)
0 | 1 | 2 | |
---|---|---|---|
0 | 0 | 1 | 2 |
1 | 1 | 2 | 3 |
2 | 2 | 3 | 4 |
3 | 4 | 5 | 6 |
4 | 6 | 7 | 8 |
5 | 8 | 9 | 10 |
6 | 11 | 12 | 13 |
7 | 13 | 14 | 15 |
8 | 15 | 16 | 17 |
The merge
method can be used to combine two DataFrames based on column values, similar to the SQL join function.
The left
and right
parameters are the respective DataFrames while the left_on
and right_on
parameters dictate which columns will be used for the join.
data1 = {
"Student":["John","Chris","James"],
"Class":["A","A","C"],
"Midterm_Grade":[50,60,80],
"Final":[60,60,80],
}
data2 = {
"Class":["A","B","C","D","E"],
"Student Average":[77,88,99,87,78],
}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
print(df1)
print(df2)
Student Class Midterm_Grade Final 0 John A 50 60 1 Chris A 60 60 2 James C 80 80 Class Student Average 0 A 77 1 B 88 2 C 99 3 D 87 4 E 78
pd.merge(left=df1,right=df2,left_on="Class",right_on="Class",how="outer",validate="m:1")
Student | Class | Midterm_Grade | Final | Student Average | |
---|---|---|---|---|---|
0 | John | A | 50.0 | 60.0 | 77 |
1 | Chris | A | 60.0 | 60.0 | 77 |
2 | James | C | 80.0 | 80.0 | 99 |
3 | NaN | B | NaN | NaN | 88 |
4 | NaN | D | NaN | NaN | 87 |
5 | NaN | E | NaN | NaN | 78 |
The join
method is for adding columns of one DataFrame to another based on index values which provides a subset of the functionality of merge
.
df1.set_index("Class").join(df2.set_index("Class"),how="outer")
Student | Midterm_Grade | Final | Student Average | |
---|---|---|---|---|
Class | ||||
A | John | 50.0 | 60.0 | 77 |
A | Chris | 60.0 | 60.0 | 77 |
B | NaN | NaN | NaN | 88 |
C | James | 80.0 | 80.0 | 99 |
D | NaN | NaN | NaN | 87 |
E | NaN | NaN | NaN | 78 |
df = pd.read_csv("data/cleaned/googleplaystoreclean.csv",index_col=[0],compression='zip',usecols=["Category","Price"])
Next we can group by Category and take the mean of each.
The groupby
method is an essential DataFrame tool that is used quite often.
category_avg_price = df.groupby(by=["Category"],axis='index').mean()
category_avg_price
Price | |
---|---|
Category | |
ART_AND_DESIGN | 0.093281 |
AUTO_AND_VEHICLES | 0.158471 |
BEAUTY | 0.000000 |
BOOKS_AND_REFERENCE | 0.520739 |
BUSINESS | 0.402761 |
COMICS | 0.000000 |
COMMUNICATION | 0.214832 |
DATING | 0.134316 |
EDUCATION | 0.115128 |
ENTERTAINMENT | 0.053557 |
EVENTS | 1.718594 |
FAMILY | 1.236682 |
FINANCE | 7.925765 |
FOOD_AND_DRINK | 0.066772 |
GAME | 0.251136 |
HEALTH_AND_FITNESS | 0.197478 |
HOUSE_AND_HOME | 0.000000 |
LIBRARIES_AND_DEMO | 0.011786 |
LIFESTYLE | 6.180288 |
MAPS_AND_NAVIGATION | 0.196715 |
MEDICAL | 3.110065 |
NEWS_AND_MAGAZINES | 0.014064 |
PARENTING | 0.159667 |
PERSONALIZATION | 0.390949 |
PHOTOGRAPHY | 0.400627 |
PRODUCTIVITY | 0.591816 |
SHOPPING | 0.021077 |
SOCIAL | 0.054136 |
SPORTS | 0.260417 |
TOOLS | 0.316599 |
TRAVEL_AND_LOCAL | 0.193605 |
VIDEO_PLAYERS | 0.059771 |
WEATHER | 0.395366 |
Sort the results by Price in descending order.
sorted_avg_price = category_avg_price.sort_values(by=["Price"],axis='index',ascending=False)
sorted_avg_price
Price | |
---|---|
Category | |
FINANCE | 7.925765 |
LIFESTYLE | 6.180288 |
MEDICAL | 3.110065 |
EVENTS | 1.718594 |
FAMILY | 1.236682 |
PRODUCTIVITY | 0.591816 |
BOOKS_AND_REFERENCE | 0.520739 |
BUSINESS | 0.402761 |
PHOTOGRAPHY | 0.400627 |
WEATHER | 0.395366 |
PERSONALIZATION | 0.390949 |
TOOLS | 0.316599 |
SPORTS | 0.260417 |
GAME | 0.251136 |
COMMUNICATION | 0.214832 |
HEALTH_AND_FITNESS | 0.197478 |
MAPS_AND_NAVIGATION | 0.196715 |
TRAVEL_AND_LOCAL | 0.193605 |
PARENTING | 0.159667 |
AUTO_AND_VEHICLES | 0.158471 |
DATING | 0.134316 |
EDUCATION | 0.115128 |
ART_AND_DESIGN | 0.093281 |
FOOD_AND_DRINK | 0.066772 |
VIDEO_PLAYERS | 0.059771 |
SOCIAL | 0.054136 |
ENTERTAINMENT | 0.053557 |
SHOPPING | 0.021077 |
NEWS_AND_MAGAZINES | 0.014064 |
LIBRARIES_AND_DEMO | 0.011786 |
COMICS | 0.000000 |
BEAUTY | 0.000000 |
HOUSE_AND_HOME | 0.000000 |
And output the top 5.
analysis1 = sorted_avg_price.iloc[0:5] # identical to sorted_avg_price.head()
analysis1
Price | |
---|---|
Category | |
FINANCE | 7.925765 |
LIFESTYLE | 6.180288 |
MEDICAL | 3.110065 |
EVENTS | 1.718594 |
FAMILY | 1.236682 |
Since each step uses a method based on the datatype returned by the prior step, an identical analysis can also be done in one line.
pd.read_csv("data/cleaned/googleplaystoreclean.csv",index_col=[0],usecols=["Category","Price"],compression='zip').groupby(by=["Category"],axis='index').mean().sort_values(by=["Price"],axis='index',ascending=False).iloc[0:5]
Price | |
---|---|
Category | |
FINANCE | 7.925765 |
LIFESTYLE | 6.180288 |
MEDICAL | 3.110065 |
EVENTS | 1.718594 |
FAMILY | 1.236682 |
The analysis results can be written to an html file.
analysis1.to_html("data/cleaned/Top5MostExpensiveCategories.html")
The second analysis will determine all apps that have a rating of 4.5 or higher and rank them by Sentiment.
We can start by reading from the HDF file.
rating = pd.read_hdf("data/cleaned/appdata.hd5",key="ratings",columns=["App","Rating"])
review = pd.read_hdf("data/cleaned/appdata.hd5",key="reviews",columns=["App","Sentiment"])
Next we can use boolean indexing to select only the apps with ratings of at least 4.5.
rating = rating[rating["Rating"]>=4.5]
We can replace the Sentiment category with numerical values.
review["Sentiment"].dtypes
CategoricalDtype(categories=['Negative', 'Neutral', 'Positive'], ordered=False)
review.replace({"Positive":1,"Neutral":0,"Negative":-1},inplace=True)
Then we can merge the DataFrames to have all the necessary information in one DataFrame.
rating_review = rating.merge(right=review,how="inner",left_on="App",right_on="App")
rating_review
App | Rating | Sentiment | |
---|---|---|---|
0 | Colorfit - Drawing & Coloring | 4.7 | 1 |
1 | Colorfit - Drawing & Coloring | 4.7 | -1 |
2 | Colorfit - Drawing & Coloring | 4.7 | 1 |
3 | Colorfit - Drawing & Coloring | 4.7 | -1 |
4 | Colorfit - Drawing & Coloring | 4.7 | 1 |
... | ... | ... | ... |
26878 | A+ Gallery - Photos & Videos | 4.5 | 1 |
26879 | A+ Gallery - Photos & Videos | 4.5 | 1 |
26880 | A+ Gallery - Photos & Videos | 4.5 | 1 |
26881 | A+ Gallery - Photos & Videos | 4.5 | 0 |
26882 | A+ Gallery - Photos & Videos | 4.5 | 1 |
26883 rows × 3 columns
We can now find the mean of the Sentiment grouped by App.
The results will be sorted by average Sentiment score and then by Rating.
analysis2 = rating_review.groupby(by="App",sort=False).mean().sort_values(by=["Sentiment","Rating"],ascending=False)
analysis2
Rating | Sentiment | |
---|---|---|
App | ||
Down Dog: Great Yoga Anywhere | 4.9 | 1.000000 |
GPS Speedometer and Odometer | 4.8 | 1.000000 |
Brightest Flashlight Free ® | 4.7 | 1.000000 |
Calculator - unit converter | 4.7 | 1.000000 |
Daniel Tiger for Parents | 4.7 | 1.000000 |
... | ... | ... |
Cooking Fever | 4.5 | -0.177778 |
Call Blocker | 4.6 | -0.333333 |
Free Live Talk-Video Call | 4.7 | -1.000000 |
Discover Mobile | 4.6 | -1.000000 |
Fruit Block - Puzzle Legend | 4.6 | -1.000000 |
263 rows × 2 columns
Another way we could find the mean of the Sentiment grouped by App is to use the pivot_table
method.
Using App as the index, we are selecting that as the group by variable and the default aggregation function is mean so we do not have to specify it.
Using the eq
and all
methods together, we can check if every cell is the same for both DataFrames.
analysis2_pivot = rating_review.pivot_table(index="App").sort_values(by=["Sentiment","Rating"],ascending=False)
analysis2_pivot.eq(analysis2).all()
Rating True Sentiment True dtype: bool
We can write the completed analysis back to the HDF file.
analysis2.to_hdf("data/cleaned/appdata.hd5",key='analysis')
Congrats! You know enough now to perform your own analysis using Pandas!
Here are some ideas for furthur analysis of this dataset to get your feet wet:
Working with these ideas will get you into the mindset of using Pandas, but I think you are ready to take it to the next level!
The following resources will be useful as you continue using Pandas:
Good luck and happy coding!