End-to-end churn prediction pipeline, identifying at-risk customers, attributing outcomes across marketing channels, and allocating retention spend to the segments that matter most.
A realistic synthetic e-commerce dataset modeled on the Kaggle E-Commerce Customer Churn benchmark,
covering behavioral, satisfaction, demographic, and multi-channel marketing signals.
Class imbalance (~12.3% churn) was handled via class_weight='balanced' and PR-curve threshold tuning.
| Category | Key Features |
|---|---|
| Behavioral | Tenure, OrderCount, DaySinceLastOrder, CouponUsed, CashbackAmount |
| Satisfaction | SatisfactionScore, Complain |
| Demographics | Gender, MaritalStatus, CityTier, NumberOfAddress |
| Marketing Channels | EmailOpens, EmailClicks, PushNotifClicked, SocialAdClicked, RetargetingExposed, AcquisitionChannel |
| Engineered | EmailEngagementRate, MultiChannelEngagement |
Four classifiers trained with class-imbalance correction. Decision thresholds were tuned using Precision-Recall curves rather than the default 0.5 cutoff. Logistic Regression achieves the best AUC and is recommended for production deployment.
| Model | AUC ↑ | F1 | Precision | Recall |
|---|---|---|---|---|
| Logistic Regression | 0.739 | 0.369 | 0.260 | 0.633 |
| Random Forest | 0.708 | 0.334 | 0.212 | 0.791 |
| CatBoost | 0.702 | 0.349 | 0.245 | 0.604 |
| XGBoost | 0.628 | 0.288 | 0.195 | 0.547 |
SHAP TreeExplainer applied to the CatBoost model to identify which features drive churn predictions — and in which direction. Results are actionable: each top driver maps directly to a retention lever.
Four actionable customer personas identified via K-Means clustering. Each segment has a distinct risk profile and targeted retention strategy.
Each recommendation is grounded in a specific model finding — with supporting data observations and a concrete action plan for marketing and CRM teams.