Tutorial 1 : Data Analysis with Pandas

In this tutorial we will explore 'cars.csv' dataset using basic data manipulation and visualisation functions provided by pandas library

You will learn :

  • To read dataset using pandas
  • To get quick summary of data
  • To manipulate dataset columns for data processing
  • To plot the porcessed data for insights

We assume you know:

  • Basic python
  • Basic understanding of plots

Resources:

Let's Begin.....

What is Pandas?

Pandas is one of the most powerfull libraries that are used for data analysis in python. Pandas provide two primary data structures Series (1- dimentional) and Dataframe( 2 dimensional) which can be used to express the data intuitively and perform fast operations. Pandas is built on top of numpy which is another powerful and popular library used in python for scientific computations.

Import Library

Import the pandas library. Here I renamed it to 'pd' just for the ease of use. Make sure you don't get errors in import

In [56]:
import pandas as pd

from matplotlib import pyplot as plt # required for plots
plt.style.use= 'default'# To display plots

Read Dataset

Following are some of the functions available in pandas for reading a dataset

  • read_csv() (Press 'Shift+tab' to see the function arguments )
    • filepath
    • sep
    • names
    • parse_dates
    • index_col
  • read_excel()
  • read_json()
  • read_pickle()
  • read_sql()
In [57]:
df=pd.read_csv('cars.csv', sep=';')

Know Your Dataset

We can use following functions to get basic understanding of the dataset: (Assume 'df' contains data)

  • df.shape
  • df.columns
  • df.head()
  • df.tail()
  • df.describe()

Let's look at these one by one

  • #### df.shape

Retrieves Shape of the dataset

In [58]:
df.shape
Out[58]:
(406, 9)
  • #### df.columns

Retrive the column names from data

In [59]:
df.columns.tolist()
Out[59]:
['Car',
 'MPG',
 'Cylinders',
 'Displacement',
 'Horsepower',
 'Weight',
 'Acceleration',
 'Model',
 'Origin']
  • #### df.head(n)

Retrieve some of the rows from the begining of the data.'n'is the number of rows to be retrived (default=6)

In [60]:
df.head()
Out[60]:
Car MPG Cylinders Displacement Horsepower Weight Acceleration Model Origin
0 Chevrolet Chevelle Malibu 18.0 8 307.0 130.0 3504.0 12.0 70 US
1 Buick Skylark 320 15.0 8 350.0 165.0 3693.0 11.5 70 US
2 Plymouth Satellite 18.0 8 318.0 150.0 3436.0 11.0 70 US
3 AMC Rebel SST 16.0 8 304.0 150.0 3433.0 12.0 70 US
4 Ford Torino 17.0 8 302.0 140.0 3449.0 10.5 70 US
  • #### df.tail(n)

Retrieve some of the rows from the end of the data.'n'is the number of rows to be retrived (default=6)

In [61]:
df.tail()
Out[61]:
Car MPG Cylinders Displacement Horsepower Weight Acceleration Model Origin
401 Ford Mustang GL 27.0 4 140.0 86.0 2790.0 15.6 82 US
402 Volkswagen Pickup 44.0 4 97.0 52.0 2130.0 24.6 82 Europe
403 Dodge Rampage 32.0 4 135.0 84.0 2295.0 11.6 82 US
404 Ford Ranger 28.0 4 120.0 79.0 2625.0 18.6 82 US
405 Chevy S-10 31.0 4 119.0 82.0 2720.0 19.4 82 US
  • #### df.describe()

Generate Summary statistics form the dataset

In [62]:
df.describe()
Out[62]:
MPG Cylinders Displacement Horsepower Weight Acceleration Model
count 406.000000 406.000000 406.000000 406.000000 406.000000 406.000000 406.000000
mean 23.051232 5.475369 194.779557 103.529557 2979.413793 15.519704 75.921182
std 8.401777 1.712160 104.922458 40.520659 847.004328 2.803359 3.748737
min 0.000000 3.000000 68.000000 0.000000 1613.000000 8.000000 70.000000
25% 17.000000 4.000000 105.000000 75.000000 2226.500000 13.700000 73.000000
50% 22.350000 4.000000 151.000000 93.500000 2822.500000 15.500000 76.000000
75% 29.000000 8.000000 302.000000 129.000000 3618.250000 17.175000 79.000000
max 46.600000 8.000000 455.000000 230.000000 5140.000000 24.800000 82.000000
  • df.isnull()

This function checks if there are any missing values in the dataset. Returns true if the value is missing. The missing values in the dataset need to be handled carefully in order to maintain consistancy in data

In [63]:
df.isnull()

# any() returns true if atleast one of the values is truthy 
df.isnull().any()
Out[63]:
Car             False
MPG             False
Cylinders       False
Displacement    False
Horsepower      False
Weight          False
Acceleration    False
Model           False
Origin          False
dtype: bool

Data Manipulation

Data manipulation is required to process the data in order to get hidden relationships beteen the features. Following are some basic functions used for data manipulation in pandas

In [64]:
# Get Column Values
# Syntax: df[column_name]
# Return: Column values
df["Origin"][:10]
Out[64]:
0    US
1    US
2    US
3    US
4    US
5    US
6    US
7    US
8    US
9    US
Name: Origin, dtype: object
In [65]:
# Get multiple columns
# Syntax: df[list of column_names]
# Return: dataframe of specified columns

df[["Car", "Model", "Origin"]]

# Another way 
col_list=["Car", "Model", "Origin"]
df[col_list]
Out[65]:
Car Model Origin
0 Chevrolet Chevelle Malibu 70 US
1 Buick Skylark 320 70 US
2 Plymouth Satellite 70 US
3 AMC Rebel SST 70 US
4 Ford Torino 70 US
5 Ford Galaxie 500 70 US
6 Chevrolet Impala 70 US
7 Plymouth Fury iii 70 US
8 Pontiac Catalina 70 US
9 AMC Ambassador DPL 70 US
10 Citroen DS-21 Pallas 70 Europe
11 Chevrolet Chevelle Concours (sw) 70 US
12 Ford Torino (sw) 70 US
13 Plymouth Satellite (sw) 70 US
14 AMC Rebel SST (sw) 70 US
15 Dodge Challenger SE 70 US
16 Plymouth 'Cuda 340 70 US
17 Ford Mustang Boss 302 70 US
18 Chevrolet Monte Carlo 70 US
19 Buick Estate Wagon (sw) 70 US
20 Toyota Corolla Mark ii 70 Japan
21 Plymouth Duster 70 US
22 AMC Hornet 70 US
23 Ford Maverick 70 US
24 Datsun PL510 70 Japan
25 Volkswagen 1131 Deluxe Sedan 70 Europe
26 Peugeot 504 70 Europe
27 Audi 100 LS 70 Europe
28 Saab 99e 70 Europe
29 BMW 2002 70 Europe
... ... ... ...
376 Chevrolet Cavalier Wagon 82 US
377 Chevrolet Cavalier 2-door 82 US
378 Pontiac J2000 SE Hatchback 82 US
379 Dodge Aries SE 82 US
380 Pontiac Phoenix 82 US
381 Ford Fairmont Futura 82 US
382 AMC Concord DL 82 US
383 Volkswagen Rabbit l 82 Europe
384 Mazda GLC Custom l 82 Japan
385 Mazda GLC Custom 82 Japan
386 Plymouth Horizon Miser 82 US
387 Mercury Lynx l 82 US
388 Nissan Stanza XE 82 Japan
389 Honda Accord 82 Japan
390 Toyota Corolla 82 Japan
391 Honda Civic 82 Japan
392 Honda Civic (auto) 82 Japan
393 Datsun 310 GX 82 Japan
394 Buick Century Limited 82 US
395 Oldsmobile Cutlass Ciera (diesel) 82 US
396 Chrysler Lebaron Medallion 82 US
397 Ford Grenada l 82 US
398 Toyota Celica GT 82 Japan
399 Dodge Charger 2.2 82 US
400 Chevrolet Camaro 82 US
401 Ford Mustang GL 82 US
402 Volkswagen Pickup 82 Europe
403 Dodge Rampage 82 US
404 Ford Ranger 82 US
405 Chevy S-10 82 US

406 rows × 3 columns

In [66]:
# Maths with data frame columns. The mathematical operation applies to all the rows of selected rows
# Multiplication
df["Weight"]*0.001
df[["MPG", "Displacement"]]*2

#Division
df["MPG"]/2
df[["MPG", "Displacement"]]/2

#Comparisons
df["MPG"][:10] >30 # returns True for if MPG value > 10 otherwise returns False
Out[66]:
0    False
1    False
2    False
3    False
4    False
5    False
6    False
7    False
8    False
9    False
Name: MPG, dtype: bool
In [67]:
#Indexing using comparison
df[df["MPG"]>40]

#Get the columns with condition on any column
df[df["MPG"]>40][["Weight","Car"]]
Out[67]:
Weight Car
251 1985.0 Volkswagen Rabbit Custom Diesel
316 2144.0 Volkswagen Rabbit
329 2110.0 Mazda GLC
331 2110.0 Datsun 210
332 2085.0 Volkswagen Rabbit C (Diesel)
333 2335.0 Volkswagen Dasher (diesel)
336 1850.0 Honda Civic 1500 gl
337 1835.0 Renault Lecar Deluxe
402 2130.0 Volkswagen Pickup
In [68]:
# Slicing Dataframe
df[1:5]
df["MPG"][:5]
Out[68]:
0    18.0
1    15.0
2    18.0
3    16.0
4    17.0
Name: MPG, dtype: float64
In [69]:
# Group the data according to one of the categorical columns
# In cars data set Cylinders, Model, Origin are the categorical variables

data=df.groupby("Origin") # returns pandas object. Can use list of columns
list(data)

data=df.groupby("Origin")["MPG"].mean() # returns pandas object. Can use list of columns
list(data)
Out[69]:
[26.745205479452057, 30.450632911392397, 19.688188976377948]
In [70]:
# Data Aggregation methods (Can be used independently or along with 'groupby')

#Mean
df.mean()  #Returns mean value of all the columns

#Max 
df.max()   #Returns max value of all the columns 

#Min
df.min()   #Returns min value of all the columns 

#Count
df.count() # Returns the count of column values 

#sum 
df.sum()  # Returns the sum of all the columns
Out[70]:
Car             Chevrolet Chevelle MalibuBuick Skylark 320Plym...
MPG                                                        9358.8
Cylinders                                                    2223
Displacement                                              79080.5
Horsepower                                                  42033
Weight                                                1.20964e+06
Acceleration                                                 6301
Model                                                       30824
Origin          USUSUSUSUSUSUSUSUSUSEuropeUSUSUSUSUSUSUSUSUSJa...
dtype: object
In [71]:
# Drop columns that are not required
dfNew=df.drop("Origin", axis=1)
print(df.columns.tolist())
print(dfNew.columns.tolist())
['Car', 'MPG', 'Cylinders', 'Displacement', 'Horsepower', 'Weight', 'Acceleration', 'Model', 'Origin']
['Car', 'MPG', 'Cylinders', 'Displacement', 'Horsepower', 'Weight', 'Acceleration', 'Model']

Data Visualization

Pandas provide several plot options that we can use to visualize our dataset. Visualisations help to understand the relations between the various features from the dataset. Following are some of the plot options that we can use.

  • *df.plot.**

Following plot options are also available to be used with pandas dataframe

  • df.plot.area
  • df.plot.barh
  • df.plot.density
  • df.plot.hist
  • df.plot.line
  • df.plot.scatter
  • df.plot.bar
  • df.plot.box
  • df.plot.hexbin
  • df.plot.kde
  • df.plot.pie

For more details on plots visit: Pandas Visualization

Analysis

In [72]:
df.groupby("Origin")["Model"].count().plot.pie()
Out[72]:
<matplotlib.axes._subplots.AxesSubplot at 0x7ff92dc26160>

The above pie-chart shows that mejority of the models are originated in US

In [73]:
df.groupby("Cylinders")["MPG"].mean().plot.bar()
plt.xlabel("Mean MPG")
Out[73]:
<matplotlib.text.Text at 0x7ff93646f908>

The above bar chart representation shows that cars having 4 and 5 cylinders gave the highest mean MPG compared to others

In [74]:
df.groupby("Cylinders")["Horsepower"].mean().plot.barh()
plt.xlabel("Mean Horsepower")
Out[74]:
<matplotlib.text.Text at 0x7ff92deb3320>

The above chart clearly shows that the cars with 8 cylinders produced highest horspower compared to others

In [75]:
df.groupby("Cylinders")["Acceleration"].median().plot.line()
plt.legend(["Mean Acceleration"])
Out[75]:
<matplotlib.legend.Legend at 0x7ff92d8bc908>
In [76]:
df.groupby("Cylinders")["Weight"].plot.kde()
plt.legend()
plt.xlabel("Weight")
Out[76]:
<matplotlib.text.Text at 0x7ff92d868eb8>
In [77]:
df.groupby("Cylinders")["Weight"].mean().plot.line()
Out[77]:
<matplotlib.axes._subplots.AxesSubplot at 0x7ff92d793630>

The above plots show relation ship between weight and the cylinders. It shows that the cars with higher number of cylinders will weigh more than the one with less number of cylinders

In [78]:
df.plot.scatter(y="MPG" ,x="Weight")
Out[78]:
<matplotlib.axes._subplots.AxesSubplot at 0x7ff92d7bf940>

The above chart shows the relationship between MPG and Weight of the car. We can see that MPG tends to decrease as Weight of the vehicle increases

In [79]:
df.plot.scatter(x="Displacement" ,y="MPG")
Out[79]:
<matplotlib.axes._subplots.AxesSubplot at 0x7ff92d6934e0>
In [80]:
df.plot.scatter(y='MPG',x='Displacement')
Out[80]:
<matplotlib.axes._subplots.AxesSubplot at 0x7ff92d71ea20>

The above chart shows the relationship between MPG and Displacement. We can see that MPG tends to decrease with increase in Displacement

Some more functions for visualization:

histogram plot for all numerical features

  • df.hist()
In [81]:
df.hist(xrot=30,figsize=(14,10))
Out[81]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7ff92d6c1518>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff92d5a93c8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff92d571e80>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7ff92d52dcf8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff92d4f5e10>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff92d4bb588>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7ff92d404d68>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff92d3c5588>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff92d394710>]], dtype=object)

Histogram plots for all numerical variables grouped by categorical variable "Origin":

In [82]:
df.groupby("Origin").hist(xrot=30,figsize=(14,10))
Out[82]:
Origin
Europe    [[Axes(0.125,0.684722;0.215278x0.215278), Axes...
Japan     [[Axes(0.125,0.684722;0.215278x0.215278), Axes...
US        [[Axes(0.125,0.684722;0.215278x0.215278), Axes...
dtype: object

Combined representation of all the numerical features using scatter plots (one for each combination of 2 variables) and histogram plots individual variables

  • pd.scatter_matrix(df)
In [83]:
pd.scatter_matrix(df,figsize=(14,10))
Out[83]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7ff92c821cf8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff92c4f6da0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff92c4bbdd8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff92c400fd0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff92c3d0320>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff92c38b470>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff92c359240>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7ff92c310470>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff92c2df780>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff92c2a75c0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff92c26bba8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff92c234dd8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff92c1f6160>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff92c141cf8>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7ff92c17c5c0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff92c0c9e10>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff92c114860>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff92c04dc88>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff92c016da0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff927fde198>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff927fa6b70>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7ff927f634e0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff927f2c400>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff927eed4e0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff927eb6f98>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff927e52c50>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff927e3f7f0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff927d8e2e8>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7ff927d4f0b8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff927d175c0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff927ccfd68>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff927c9ca58>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff927ceab00>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff927c27780>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff927bf17b8>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7ff927bb0860>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff927b00278>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff927b37ba8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff927affac8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff927a40ba8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff927a0e6a0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff927a28908>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7ff927993eb8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff9279629b0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff927922780>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff9278ecc88>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff9278a8470>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff927875160>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7ff9278419e8>]], dtype=object)