Home » Label Encoding in Python

Label Encoding in Python

by Online Tutorials Library

Label Encoding in Python

An Introduction

Before we begin learning the categorical variable encoding, let us understand the basics of data types and their scales at first. It becomes essential for the learners to understand these topics in order to proceed to work with categorical variable encoding. As we all know, Data is a distinct type of information usually formatted in a specific manner. We can categorize Data into three types, called, structured data, semi-structured data, and unstructured data.

The data embodied in the form of matrix with rows and columns are denoted as structured data. These data can be stored as a table in SQL database, rows and columns in excel sheet, or delimiter separated in CSV.

The Data which is not embodied in the form of matrix is said to be semi-structured data and unstructured data. We can generally store the Semi-Structured data in XML files, JSON format and a lot more, whereas the unstructured data in form of images, e-mails, videos, log data, and textual data.

Let us consider a provided business problem based on machine learning or data science. If we deal with structured data only and the data gathered is a combination of continuous variables and categorical variables, most of the algorithms in Machine Learning will not understand or not be able to work with categorical variables.

This statement means that algorithms in machine learning perform considerably better in accuracy and other performance metrics when data is represented in numeric form instead of categorical to a model for training and testing.

Hence, the categorical data must be encoded into numbers before using it to fit and evaluate a model.

Some Algorithms of Machine Learning, like Tree-based (such as Decision Tree, Random Forest) algorithms, perform better in handling categorical variables. The best training in any project associated with data science is to convert categorical data to a numerical data.

Since the objective is now clear, let us understand several types of categorical data before we start building any statistical models, deep learning models, or machine learning models and start encoding or transforming categorical data to numerical forms.

Understanding Nominal Scale

Nominal Scale is defined as the variables that are names only. They are used to label variables. Nominal scales never overlap with each other, and do not have any numerical significance.

Note: Nominal Scale refers only to those variables that just name.

Here are some examples that represent data for the nominal Scale. Once we collect the data, we have to assign a numerical code representing a nominal variable generally.

What is person’s Gender? What is person’s Marital Status? In which city does the person reside?
Male Single Delhi
Female Married Mumbai
Divorced Chennai Widowed Bangalore

You may also like