Tutorial 1 : Data Analysis with Pandas¶

In this tutorial we will explore 'cars.csv' dataset using basic data manipulation and visualisation functions provided by pandas library¶

You will learn :¶

To read dataset using pandas
To get quick summary of data
To manipulate dataset columns for data processing
To plot the porcessed data for insights

We assume you know:¶

Basic python
Basic understanding of plots

Resources:¶

Pandas Documentation
Cars Dataset
Find source code of this tutorial here

Let's Begin.....¶

What is Pandas?¶

Pandas is one of the most powerfull libraries that are used for data analysis in python. Pandas provide two primary data structures Series (1- dimentional) and Dataframe( 2 dimensional) which can be used to express the data intuitively and perform fast operations. Pandas is built on top of numpy which is another powerful and popular library used in python for scientific computations.

Import Library¶

Import the pandas library. Here I renamed it to 'pd' just for the ease of use. Make sure you don't get errors in import

import pandas as pd

from matplotlib import pyplot as plt # required for plots
plt.style.use= 'default'# To display plots

Read Dataset¶

Following are some of the functions available in pandas for reading a dataset

read_csv() (Press 'Shift+tab' to see the function arguments )
- filepath
- sep
- names
- parse_dates
- index_col
read_excel()
read_json()
read_pickle()
read_sql()

df=pd.read_csv('cars.csv', sep=';')

Know Your Dataset¶

We can use following functions to get basic understanding of the dataset: (Assume 'df' contains data)

df.shape
df.columns
df.head()
df.tail()
df.describe()

Let's look at these one by one

#### df.shape

Retrieves Shape of the dataset

df.shape

(406, 9)

#### df.columns

Retrive the column names from data

df.columns.tolist()

['Car',
 'MPG',
 'Cylinders',
 'Displacement',
 'Horsepower',
 'Weight',
 'Acceleration',
 'Model',
 'Origin']

#### df.head(n)

Retrieve some of the rows from the begining of the data.'n'is the number of rows to be retrived (default=6)

df.head()

#### df.tail(n)

Retrieve some of the rows from the end of the data.'n'is the number of rows to be retrived (default=6)

df.tail()

#### df.describe()

Generate Summary statistics form the dataset

df.describe()

df.isnull()

This function checks if there are any missing values in the dataset. Returns true if the value is missing. The missing values in the dataset need to be handled carefully in order to maintain consistancy in data

df.isnull()

# any() returns true if atleast one of the values is truthy 
df.isnull().any()

Car             False
MPG             False
Cylinders       False
Displacement    False
Horsepower      False
Weight          False
Acceleration    False
Model           False
Origin          False
dtype: bool

Data Manipulation¶

Data manipulation is required to process the data in order to get hidden relationships beteen the features. Following are some basic functions used for data manipulation in pandas

# Get Column Values
# Syntax: df[column_name]
# Return: Column values
df["Origin"][:10]

0    US
1    US
2    US
3    US
4    US
5    US
6    US
7    US
8    US
9    US
Name: Origin, dtype: object

# Get multiple columns
# Syntax: df[list of column_names]
# Return: dataframe of specified columns

df[["Car", "Model", "Origin"]]

# Another way 
col_list=["Car", "Model", "Origin"]
df[col_list]

# Maths with data frame columns. The mathematical operation applies to all the rows of selected rows
# Multiplication
df["Weight"]*0.001
df[["MPG", "Displacement"]]*2

#Division
df["MPG"]/2
df[["MPG", "Displacement"]]/2

#Comparisons
df["MPG"][:10] >30 # returns True for if MPG value > 10 otherwise returns False

0    False
1    False
2    False
3    False
4    False
5    False
6    False
7    False
8    False
9    False
Name: MPG, dtype: bool

#Indexing using comparison
df[df["MPG"]>40]

#Get the columns with condition on any column
df[df["MPG"]>40][["Weight","Car"]]

# Slicing Dataframe
df[1:5]
df["MPG"][:5]

0    18.0
1    15.0
2    18.0
3    16.0
4    17.0
Name: MPG, dtype: float64

# Group the data according to one of the categorical columns
# In cars data set Cylinders, Model, Origin are the categorical variables

data=df.groupby("Origin") # returns pandas object. Can use list of columns
list(data)

data=df.groupby("Origin")["MPG"].mean() # returns pandas object. Can use list of columns
list(data)

[26.745205479452057, 30.450632911392397, 19.688188976377948]

# Data Aggregation methods (Can be used independently or along with 'groupby')

#Mean
df.mean()  #Returns mean value of all the columns

#Max 
df.max()   #Returns max value of all the columns 

#Min
df.min()   #Returns min value of all the columns 

#Count
df.count() # Returns the count of column values 

#sum 
df.sum()  # Returns the sum of all the columns

Car             Chevrolet Chevelle MalibuBuick Skylark 320Plym...
MPG                                                        9358.8
Cylinders                                                    2223
Displacement                                              79080.5
Horsepower                                                  42033
Weight                                                1.20964e+06
Acceleration                                                 6301
Model                                                       30824
Origin          USUSUSUSUSUSUSUSUSUSEuropeUSUSUSUSUSUSUSUSUSJa...
dtype: object

# Drop columns that are not required
dfNew=df.drop("Origin", axis=1)
print(df.columns.tolist())
print(dfNew.columns.tolist())

['Car', 'MPG', 'Cylinders', 'Displacement', 'Horsepower', 'Weight', 'Acceleration', 'Model', 'Origin']
['Car', 'MPG', 'Cylinders', 'Displacement', 'Horsepower', 'Weight', 'Acceleration', 'Model']

Data Visualization¶

Pandas provide several plot options that we can use to visualize our dataset. Visualisations help to understand the relations between the various features from the dataset. Following are some of the plot options that we can use.

*df.plot.**

Following plot options are also available to be used with pandas dataframe

df.plot.area
df.plot.barh
df.plot.density
df.plot.hist
df.plot.line
df.plot.scatter
df.plot.bar
df.plot.box
df.plot.hexbin
df.plot.kde
df.plot.pie

For more details on plots visit: Pandas Visualization ¶

Analysis¶

df.groupby("Origin")["Model"].count().plot.pie()

<matplotlib.axes._subplots.AxesSubplot at 0x7ff92dc26160>

The above pie-chart shows that mejority of the models are originated in US¶

df.groupby("Cylinders")["MPG"].mean().plot.bar()
plt.xlabel("Mean MPG")

<matplotlib.text.Text at 0x7ff93646f908>

The above bar chart representation shows that cars having 4 and 5 cylinders gave the highest mean MPG compared to others¶

df.groupby("Cylinders")["Horsepower"].mean().plot.barh()
plt.xlabel("Mean Horsepower")

<matplotlib.text.Text at 0x7ff92deb3320>

The above chart clearly shows that the cars with 8 cylinders produced highest horspower compared to others¶

df.groupby("Cylinders")["Acceleration"].median().plot.line()
plt.legend(["Mean Acceleration"])

<matplotlib.legend.Legend at 0x7ff92d8bc908>

df.groupby("Cylinders")["Weight"].plot.kde()
plt.legend()
plt.xlabel("Weight")

<matplotlib.text.Text at 0x7ff92d868eb8>

df.groupby("Cylinders")["Weight"].mean().plot.line()

<matplotlib.axes._subplots.AxesSubplot at 0x7ff92d793630>

The above plots show relation ship between weight and the cylinders. It shows that the cars with higher number of cylinders will weigh more than the one with less number of cylinders¶

df.plot.scatter(y="MPG" ,x="Weight")

<matplotlib.axes._subplots.AxesSubplot at 0x7ff92d7bf940>

The above chart shows the relationship between MPG and Weight of the car. We can see that MPG tends to decrease as Weight of the vehicle increases¶

df.plot.scatter(x="Displacement" ,y="MPG")

<matplotlib.axes._subplots.AxesSubplot at 0x7ff92d6934e0>

df.plot.scatter(y='MPG',x='Displacement')

<matplotlib.axes._subplots.AxesSubplot at 0x7ff92d71ea20>

The above chart shows the relationship between MPG and Displacement. We can see that MPG tends to decrease with increase in Displacement¶

Some more functions for visualization:¶

histogram plot for all numerical features¶

df.hist()

df.hist(xrot=30,figsize=(14,10))

array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7ff92d6c1518>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff92d5a93c8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff92d571e80>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7ff92d52dcf8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff92d4f5e10>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff92d4bb588>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7ff92d404d68>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff92d3c5588>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff92d394710>]], dtype=object)

Histogram plots for all numerical variables grouped by categorical variable "Origin":¶

df.groupby("Origin").hist(xrot=30,figsize=(14,10))

Origin
Europe    [[Axes(0.125,0.684722;0.215278x0.215278), Axes...
Japan     [[Axes(0.125,0.684722;0.215278x0.215278), Axes...
US        [[Axes(0.125,0.684722;0.215278x0.215278), Axes...
dtype: object

Combined representation of all the numerical features using scatter plots (one for each combination of 2 variables) and histogram plots individual variables¶

pd.scatter_matrix(df)

pd.scatter_matrix(df,figsize=(14,10))

array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7ff92c821cf8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff92c4f6da0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff92c4bbdd8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff92c400fd0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff92c3d0320>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff92c38b470>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff92c359240>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7ff92c310470>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff92c2df780>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff92c2a75c0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff92c26bba8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff92c234dd8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff92c1f6160>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff92c141cf8>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7ff92c17c5c0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff92c0c9e10>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff92c114860>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff92c04dc88>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff92c016da0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff927fde198>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff927fa6b70>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7ff927f634e0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff927f2c400>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff927eed4e0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff927eb6f98>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff927e52c50>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff927e3f7f0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff927d8e2e8>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7ff927d4f0b8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff927d175c0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff927ccfd68>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff927c9ca58>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff927ceab00>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff927c27780>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff927bf17b8>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7ff927bb0860>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff927b00278>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff927b37ba8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff927affac8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff927a40ba8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff927a0e6a0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff927a28908>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7ff927993eb8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff9279629b0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff927922780>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff9278ecc88>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff9278a8470>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff927875160>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff9278419e8>]], dtype=object)

	Car	MPG	Cylinders	Displacement	Horsepower	Weight	Acceleration	Model	Origin
401	Ford Mustang GL	27.0	4	140.0	86.0	2790.0	15.6	82	US
402	Volkswagen Pickup	44.0	4	97.0	52.0	2130.0	24.6	82	Europe
403	Dodge Rampage	32.0	4	135.0	84.0	2295.0	11.6	82	US
404	Ford Ranger	28.0	4	120.0	79.0	2625.0	18.6	82	US
405	Chevy S-10	31.0	4	119.0	82.0	2720.0	19.4	82	US

	MPG	Cylinders	Displacement	Horsepower	Weight	Acceleration	Model
count	406.000000	406.000000	406.000000	406.000000	406.000000	406.000000	406.000000
mean	23.051232	5.475369	194.779557	103.529557	2979.413793	15.519704	75.921182
std	8.401777	1.712160	104.922458	40.520659	847.004328	2.803359	3.748737
min	0.000000	3.000000	68.000000	0.000000	1613.000000	8.000000	70.000000
25%	17.000000	4.000000	105.000000	75.000000	2226.500000	13.700000	73.000000
50%	22.350000	4.000000	151.000000	93.500000	2822.500000	15.500000	76.000000
75%	29.000000	8.000000	302.000000	129.000000	3618.250000	17.175000	79.000000
max	46.600000	8.000000	455.000000	230.000000	5140.000000	24.800000	82.000000

	Weight	Car
251	1985.0	Volkswagen Rabbit Custom Diesel
316	2144.0	Volkswagen Rabbit
329	2110.0	Mazda GLC
331	2110.0	Datsun 210
332	2085.0	Volkswagen Rabbit C (Diesel)
333	2335.0	Volkswagen Dasher (diesel)
336	1850.0	Honda Civic 1500 gl
337	1835.0	Renault Lecar Deluxe
402	2130.0	Volkswagen Pickup

	Car	MPG	Cylinders	Displacement	Horsepower	Weight	Acceleration	Model	Origin
0	Chevrolet Chevelle Malibu	18.0	8	307.0	130.0	3504.0	12.0	70	US
1	Buick Skylark 320	15.0	8	350.0	165.0	3693.0	11.5	70	US
2	Plymouth Satellite	18.0	8	318.0	150.0	3436.0	11.0	70	US
3	AMC Rebel SST	16.0	8	304.0	150.0	3433.0	12.0	70	US
4	Ford Torino	17.0	8	302.0	140.0	3449.0	10.5	70	US

	Car	Model	Origin
0	Chevrolet Chevelle Malibu	70	US
1	Buick Skylark 320	70	US
2	Plymouth Satellite	70	US
3	AMC Rebel SST	70	US
4	Ford Torino	70	US
5	Ford Galaxie 500	70	US
6	Chevrolet Impala	70	US
7	Plymouth Fury iii	70	US
8	Pontiac Catalina	70	US
9	AMC Ambassador DPL	70	US
10	Citroen DS-21 Pallas	70	Europe
11	Chevrolet Chevelle Concours (sw)	70	US
12	Ford Torino (sw)	70	US
13	Plymouth Satellite (sw)	70	US
14	AMC Rebel SST (sw)	70	US
15	Dodge Challenger SE	70	US
16	Plymouth 'Cuda 340	70	US
17	Ford Mustang Boss 302	70	US
18	Chevrolet Monte Carlo	70	US
19	Buick Estate Wagon (sw)	70	US
20	Toyota Corolla Mark ii	70	Japan
21	Plymouth Duster	70	US
22	AMC Hornet	70	US
23	Ford Maverick	70	US
24	Datsun PL510	70	Japan
25	Volkswagen 1131 Deluxe Sedan	70	Europe
26	Peugeot 504	70	Europe
27	Audi 100 LS	70	Europe
28	Saab 99e	70	Europe
29	BMW 2002	70	Europe
...	...	...	...
376	Chevrolet Cavalier Wagon	82	US
377	Chevrolet Cavalier 2-door	82	US
378	Pontiac J2000 SE Hatchback	82	US
379	Dodge Aries SE	82	US
380	Pontiac Phoenix	82	US
381	Ford Fairmont Futura	82	US
382	AMC Concord DL	82	US
383	Volkswagen Rabbit l	82	Europe
384	Mazda GLC Custom l	82	Japan