Machine Learning - Feature Scaling & Encoding
Scaling
from sklearn.preprocessing import StandardScaler, RobustScaler
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LinearRegression
X_num = ...
pipe = make_pipeline(RobustScaler(), LinearRegression()) # robust to outliers
pipe.fit(X_num, y)
Encoding (ColumnTransformer)
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
num_cols = ["age", "income"]
cat_cols = ["country", "segment"]
pre = ColumnTransformer([
("num", Pipeline([("imp", SimpleImputer()), ("sc", StandardScaler())]), num_cols),
("cat", Pipeline([("imp", SimpleImputer(strategy="most_frequent")), ("oh", OneHotEncoder(handle_unknown="ignore"))]), cat_cols),
])
Target encoding (leakage-safe)
Use cross-fold target encoding or libraries that perform leakage-safe encoding; avoid computing encodings on the full dataset before splitting.