Machine Learning - Data Preprocessing

Splitting data

from sklearn.model_selection import train_test_split
Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

Pipeline with ColumnTransformer

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score

num_cols = ["age", "income"]
cat_cols = ["country", "segment"]

pre = ColumnTransformer([
  ("num", Pipeline([("imp", SimpleImputer(strategy="median")), ("sc", StandardScaler())]), num_cols),
  ("cat", Pipeline([("imp", SimpleImputer(strategy="most_frequent")), ("oh", OneHotEncoder(handle_unknown="ignore"))]), cat_cols),
])

pipe = Pipeline([("pre", pre), ("clf", LogisticRegression(max_iter=1000))])
pipe.fit(Xtr, ytr)
print({"f1": f1_score(yte, pipe.predict(Xte))})

Notes

  • Fit transforms on training split only via pipelines (prevents leakage).
  • Keep preprocessing + model together for reproducibility and deployment.