Pandas

An important Python library for Machine Learning.

Deep Patel
4 min readFeb 17, 2021

Python Pandas is an open-source library that provides high-performance data manipulation in Python. This tutorial is designed for both beginners and professionals.

Key Features of Pandas

  • It’s fast and efficient DataFrame object indexing easy.
  • Used for reshaping and pivoting of the data sets.
  • Group by data for aggregations and transformations.
  • It is used for data alignment and integration of the missing data.
  • Provide the functionality of Time Series.
  • Process a variety of data sets in different formats like matrix data, tabular heterogeneous, time series.
  • Handle multiple operations of the data sets such as subsetting, slicing, filtering, groupBy, re-ordering, and re-shaping.
  • It integrates with the other libraries such as SciPy, and sci-kit learn.
  • Provides fast performance, and If you want to speed it, even more, you can use Cython(It is an optimizing static compiler for Python).

Benefits of Pandas

The benefits of pandas over using other language are as follows:

  • Data Representation: It represents the data in a form that is suited for data analysis through its DataFrame and Series.
  • Clear code: The clear API of the Pandas allows you to focus on the core part of the code. So, it provides clear and concise code for the user.

Similar to matrix in NumPY, we create series of data known as Data Frame in Pandas.

A data frame is a table or a two-dimensional array-like structure in which each column contains values of one variable and each row contains one set of values from each column.

Here, I will be discussing the most important features of the pandas library that you will surely use somewhere in your project.

#Creating DataFrame

# Import pandas library
import pandas as pd
dict={'a':[11,21,31],'b':[12,22,32]}
#Create dataframe with dictionary
df=pd.DataFrame(dict)
print(df) #pandas.core.frame.DataFrame
Output:dfa b0 11 121 21 222 31 32
This is how a dataframe looks like. It contains two columns a ,b with three rows 0, 1, 2.
#Create a dataframe by importing csv file
df = pd.read_csv('path_of_file.csv', header=None)
#header = none means no column name to include
#Knowing the shape(no. of rows, no. of columns) of datadrame
df.shape
# show the first 5 rows using dataframe.head() method
print("The first 5 rows of the dataframe")
df.head(5)
#get complete information of dataframe
df.info()
#assigning names to columns via list
df.columns = ['a', 'b', 'c'....]
#Convert "?" to NaN or find missing values (?, nan are common form of null values)
df.replace("?", np.nan, inplace = True)
#drop rows with nan (null values)
df.dropna(inplace=True)
df.dropna(['a', 'b'], axis=1, inplace=True)
#dropping rows with null values in specific column
#Replacing nan (null) values
#a. replace it by mean -
avg_norm_loss = df["normalizedlosses"].astype("float").mean(axis=0)
df["normalized-losses"].replace(np.nan, avg_norm_loss, inplace=True)
#b. replace it by frequency
df['num-of-doors'].value_counts().idxmax()
df["num-of-doors"].replace(np.nan, "four", inplace=True)
#sum of total null values present in each column
df.isnull().sum()
#finding sum or mean or mode of particular column in similar way
p = df['Item_Weight'].mean()
#for sum replace with sum()
#finding correlation between all columns
#Pandas dataframe. corr() is used to find the pairwise correlation of all columns in the dataframe. Any null values are automatically excluded. For any non-numeric data type columns in the dataframe it is ignored.
df.corr()
#drop or remove particular column
df.drop("Item_Weight", axis = 1, inplace=True)
#finding all unique values in particular column
df['Item_Type'].unique()
#replacing values of cells in particular column
df['Item_Fat_Content'].replace(['low fat','LF','reg'],['Low Fat','Low Fat','Regular'],inplace = True)
#Creating dummy variable#Because categorical data-type is not supported in training
#The get_dummies() function is used to convert categorical variable into dummy/indicator variables. Data of which to get dummy indicators.
dummy_variable_1 = pd.get_dummies(df["Item_Type"])
df = pd.concat([df, dummy_variable_1], axis=1)
df.drop("Item_Type", axis = 1, inplace=True)
df.head()
#Another approach to encoding categorical values is to use a technique called label encoding. Label encoding is simply converting each value in a column to a number.
from sklearn.preprocessing import LabelEncoder
categorical_column = ['Gender','Age','City_Category','Stay_In_Current_City_Years']
le = LabelEncoder()
for i in categorical_column:
x_data[i] = le.fit_transform(x_data[i])
x_data.head()
#Changing data-type of particular column
df['Product_Category_2'] = df['Product_Category_2'].astype(int)
#Copying data-frame
y_data = df['Purchase'].copy()
#Add or append new column in existing dataframe
df_submission['Purchase'] = predictions
#Set particular column as index
df.set_index('Month', inplace=True)
#Slice to select particular rows and columns
df_questions = df.iloc[:,5:33]
#DESCRIBE - statistical summary of each column, such as count, column mean value, column standard deviation
df.describe()
#Data NORMALIZATION
#Target: would like to Normalize those variables so their value range from 0 to 1

# replace (original value) by (original value)/(maximum value)
df['length'] = df['length']/df['length'].max()
#Binning - Binning is a process of transforming continuous numerical variables into discrete categorical 'bins', for grouped analysis or grouping data. Look at following example:
# quality > 6 is good and less is bad
# all values <6.5 is bad and >6.5 is good

bins = [2, 6.5, 8]
print(bins)
group_names = ['bad', 'good']
df['quality'] = pd.cut(df['quality'], bins = bins, labels = group_names)
print(df['quality'])
#Selecting particular row with index
df.iloc[0]

In my further blogs, I will discuss complete projects where we will use all these functions of Pandas in detail.

For further reference on pandas go here.

--

--

Deep Patel

Learning and exploring this beautiful world with amazing tech.