€14.73 – €17.21
- Handling Missing Data
- Identify missing values within categorical columns.
- Depending on the context, handle missing data by either:
- Imputing missing values with a placeholder (e.g., ‘Unknown’ or ‘Other’).
- Dropping rows or columns with missing values if the proportion is low or it doesn’t affect the model’s performance.
- Label Encoding
- If the categorical variables have an inherent order (ordinal), assign numerical values to each category. For example, for a variable such as “Rating” with categories “Low”, “Medium”, and “High”, you can encode them as 0, 1, and 2 respectively.
- One-Hot Encoding
- For nominal categorical variables without an inherent order (e.g., “Color” with categories “Red”, “Blue”, “Green”), apply one-hot encoding to transform them into binary columns (one column for each category).
- Handling Rare Categories
- For categorical features with categories that appear infrequently, either:
- Combine rare categories into an ‘Other’ or ‘Unknown’ category.
- Remove categories that contribute to noise or have very few instances in the dataset.
- For categorical features with categories that appear infrequently, either:
- Encoding High Cardinality Features
- If a categorical variable has too many unique categories (high cardinality), it may be helpful to use advanced encoding methods such as target encoding, which involves replacing categories with the mean of the target variable for each category.
- Feature Scaling (if necessary)
- If encoding numerical values from categorical features, consider normalizing or standardizing if they are to be used in distance-based models (e.g., k-NN, SVM).
- Outlier Handling
- For categorical data that may include extreme or unexpected values, consider grouping or correcting outliers to prevent model distortion.