Aim: To apply data pre-processing techniques on a given dataset using Python, including attribute selection, handling missing values, data discretization, and the detection and elimination of outliers to prepare data for machine learning models.
import pandas as pd
dataset={'Roll#':[4301,4302,4303,4304,4305,4306],'Name':['Akhil','Pavan','Yuvan','Rajesh','Suresh','Mahesh'],'StudyHrs':[0.2,2,None,3,None,1],'ExamScore':[48,79,'FAIL',87,'FAIL',62],'Age':[18,17,18,18,17,19]}
data=pd.DataFrame(dataset)
print('\n',data)
#drop unrequired columns = Attribute Selection
data_selected = data.drop(['Roll#', 'Name'], axis=1) #Axis=0 ===> Rows (Moves down). This is the default #Axis=1===> Columns (Moves across --->)
print("After Attribute Selection:\n", data_selected)
#Handling missing values; here the NaN are replaced with mean() values of the corresponding column
# 'coerce' converts None to NaN, without interrupting the code execution
# 'raise' --> It escalates the error i.e., stops execution if the column is not a number
# 'ignore' --> ignore any error; example '1'+'1' = '11' but not 2, as strings are not converted to numbers
data_selected['ExamScore'] = pd.to_numeric(data_selected['ExamScore'], errors='coerce')
data_selected['StudyHrs'] = data_selected['StudyHrs'].fillna(data_selected['StudyHrs'].mean()) # Fill StudyHrs NaNs (Not a Numbers) with the mean of the column
data_selected['ExamScore'] = data_selected['ExamScore'].fillna(data_selected['ExamScore'].mean()) # Fill ExamScore NaNs (previously 'FAIL') with 0 or the mean
print('After Handling Missing Values:\n', data_selected)
# discretization = forming ranges
bins = [0, 40, 50, 60, 80, 100] # 6 Edges = 5 Bins
labels = ['Fail', 'Third Class', 'Second Class', 'First Class', 'Distinction'] # 5 Labels (Must match the number of ranges)
data_selected['Grade'] = pd.cut(data_selected['ExamScore'], bins=bins, labels=labels)
print('After Discretization: \n',data_selected)
#removing outliers = discarding the values lesser than the average, as they distract the data analysis process
data_final = data_selected[data_selected['StudyHrs'] > 0.5] # We only keep rows where StudyHrs > 0.5
print("Final Data (Outlier Removed):\n", data_final)
What is data pre-processing in machine learning?
Data pre-processing is the step of cleaning and transforming raw data into a suitable format for machine learning models. It improves data quality and model performance.
What is attribute (feature) selection and why is it needed?
Attribute selection is the process of choosing relevant features and removing irrelevant or redundant ones. It reduces dimensionality and prevents overfitting.
Name two methods used for attribute selection.
Filter methods (correlation, chi-square) and wrapper methods (forward selection, backward elimination) are commonly used for attribute selection.
What are missing values in a dataset?
Missing values occur when data for an attribute is not recorded or unavailable. They are usually represented as NaN, NULL, or None.
How can missing values be handled in Python?
Missing values can be handled by deletion or by imputation using mean, median, mode, or a constant value.
What is data discretization?
Data discretization is the process of converting continuous numerical attributes into discrete intervals or categories. It simplifies data and improves interpretability.
Mention two discretization techniques.
Equal-width binning and equal-frequency (quantile) binning are common discretization techniques.
What is an outlier in a dataset?
An outlier is a data point that significantly deviates from the normal pattern of the dataset. It may arise due to errors or rare events.
How are outliers detected in data pre-processing?
Outliers can be detected using statistical methods such as Z-score, IQR (Interquartile Range), or visualization techniques like box plots.
Why should outliers be eliminated or treated before training a model?
Outliers can distort model training and reduce accuracy. Eliminating or treating them leads to more reliable and stable machine learning models.