Pandas is one of the most powerfull libraries that are used for data analysis in python. Pandas provide two primary data structures Series (1- dimentional) and Dataframe( 2 dimensional) which can be used to express the data intuitively and perform fast operations. Pandas is built on top of numpy which is another powerful and popular library used in python for scientific computations.
Import the pandas library. Here I renamed it to 'pd' just for the ease of use. Make sure you don't get errors in import
import pandas as pd
from matplotlib import pyplot as plt # required for plots
plt.style.use= 'default'# To display plots
Following are some of the functions available in pandas for reading a dataset
df=pd.read_csv('cars.csv', sep=';')
We can use following functions to get basic understanding of the dataset: (Assume 'df' contains data)
Let's look at these one by one
Retrieves Shape of the dataset
df.shape
Retrive the column names from data
df.columns.tolist()
Retrieve some of the rows from the begining of the data.'n'is the number of rows to be retrived (default=6)
df.head()
Retrieve some of the rows from the end of the data.'n'is the number of rows to be retrived (default=6)
df.tail()
Generate Summary statistics form the dataset
df.describe()
This function checks if there are any missing values in the dataset. Returns true if the value is missing. The missing values in the dataset need to be handled carefully in order to maintain consistancy in data
df.isnull()
# any() returns true if atleast one of the values is truthy
df.isnull().any()
Data manipulation is required to process the data in order to get hidden relationships beteen the features. Following are some basic functions used for data manipulation in pandas
# Get Column Values
# Syntax: df[column_name]
# Return: Column values
df["Origin"][:10]
# Get multiple columns
# Syntax: df[list of column_names]
# Return: dataframe of specified columns
df[["Car", "Model", "Origin"]]
# Another way
col_list=["Car", "Model", "Origin"]
df[col_list]
# Maths with data frame columns. The mathematical operation applies to all the rows of selected rows
# Multiplication
df["Weight"]*0.001
df[["MPG", "Displacement"]]*2
#Division
df["MPG"]/2
df[["MPG", "Displacement"]]/2
#Comparisons
df["MPG"][:10] >30 # returns True for if MPG value > 10 otherwise returns False
#Indexing using comparison
df[df["MPG"]>40]
#Get the columns with condition on any column
df[df["MPG"]>40][["Weight","Car"]]
# Slicing Dataframe
df[1:5]
df["MPG"][:5]
# Group the data according to one of the categorical columns
# In cars data set Cylinders, Model, Origin are the categorical variables
data=df.groupby("Origin") # returns pandas object. Can use list of columns
list(data)
data=df.groupby("Origin")["MPG"].mean() # returns pandas object. Can use list of columns
list(data)
# Data Aggregation methods (Can be used independently or along with 'groupby')
#Mean
df.mean() #Returns mean value of all the columns
#Max
df.max() #Returns max value of all the columns
#Min
df.min() #Returns min value of all the columns
#Count
df.count() # Returns the count of column values
#sum
df.sum() # Returns the sum of all the columns
# Drop columns that are not required
dfNew=df.drop("Origin", axis=1)
print(df.columns.tolist())
print(dfNew.columns.tolist())
Pandas provide several plot options that we can use to visualize our dataset. Visualisations help to understand the relations between the various features from the dataset. Following are some of the plot options that we can use.
Following plot options are also available to be used with pandas dataframe
df.groupby("Origin")["Model"].count().plot.pie()
df.groupby("Cylinders")["MPG"].mean().plot.bar()
plt.xlabel("Mean MPG")
df.groupby("Cylinders")["Horsepower"].mean().plot.barh()
plt.xlabel("Mean Horsepower")
df.groupby("Cylinders")["Acceleration"].median().plot.line()
plt.legend(["Mean Acceleration"])
df.groupby("Cylinders")["Weight"].plot.kde()
plt.legend()
plt.xlabel("Weight")
df.groupby("Cylinders")["Weight"].mean().plot.line()
df.plot.scatter(y="MPG" ,x="Weight")
df.plot.scatter(x="Displacement" ,y="MPG")
df.plot.scatter(y='MPG',x='Displacement')
df.hist(xrot=30,figsize=(14,10))
df.groupby("Origin").hist(xrot=30,figsize=(14,10))
pd.scatter_matrix(df,figsize=(14,10))