← Back to Portfolio

Data & Reports

All project deliverables, datasets, trained models, and documentation from the Mortgage Servicing Analytics Platform. Built on 992,289 Freddie Mac loans spanning 11 origination years (2014 to 2024).

Technical Documentation

Complete project architecture, pipeline design, SQL query catalog, ML model specifications, and deployment guide. Covers all seven stages of the build process.

Documentation

Cleaned Datasets

Seven pre processed CSV files ready for analysis. Includes portfolio summary, delinquency distribution, roll rates, risk segments, vintage comparison, geographic breakdown, and full loan level detail (992K rows).

CSV Downloads

Data Quality Report

Automated quality assessment with 19 checks across completeness, validity, consistency, and distribution profiling. Every check passed on the full dataset.

HTML Report

ML Executive Report

Eight section automated report with embedded Plotly charts, portfolio KPIs, vintage analysis, roll rate matrix, risk segments, model summary, and threshold driven recommendations.

HTML Report

Trained Models

Two trained scikit learn models saved as .joblib files. OriginRisk (logistic regression, origination features only) and SegmentIQ (random forest, behavioral segmentation). Load with joblib.load() in Python.

Model Artifacts

Mortgage Domain Cheatsheet

Quick reference guide covering mortgage servicing terminology, delinquency metrics, roll rate definitions, Freddie Mac data fields, and key financial concepts used throughout the project.

Reference Guide

Enhancement Roadmap

Documented paths for improving model performance, expanding data coverage, and adding new analytical capabilities. Each enhancement builds on the existing pipeline architecture without requiring structural changes.

Model Enhancement

Calibration Layer for Dollar Loss Estimation

The current OriginRisk model outputs a probability score, for example "this loan has a 12% chance of becoming delinquent." However, the servicing team does not only care about probability. They care about dollar exposure. A $500K loan with 5% risk represents more potential loss than a $100K loan with 15% risk.

A calibration layer would adjust the raw probability outputs so they accurately reflect real world default rates, then multiply each score by the loan balance to produce an expected dollar loss per loan. This transforms the risk score from a ranking tool into a business decision tool. For example: "These 500 loans represent $2.3M in expected losses. Allocate loss mitigation resources here."

Model Enhancement

Gradient Boosting as a Third Model Candidate

The current pipeline trains Logistic Regression and Random Forest. Both achieved similar performance (AUC 0.77). Random Forest builds 200 independent decision trees and averages their predictions. Gradient boosting methods like XGBoost or LightGBM work differently. They build trees one after another, where each new tree focuses specifically on the loans that previous trees got wrong. This iterative correction process typically produces better results on structured tabular data.

The pipeline already has HistGradientBoosting registered in train.py but unused. Enabling it requires a one line configuration change. Expected improvement: 2 to 3 additional AUC points, producing a three model comparison table in the evaluation report.

Data Expansion

Processing the Full 128 File Freddie Mac Dataset

The current pipeline processes 11 selected files (one pool per vintage year, 2015 to 2025), totaling 992,289 loans. Freddie Mac publishes approximately 128 files in each quarterly release. The remaining files contain additional loan pools from the same origination years, representing roughly 5 to 8 million total loans.

More data means more statistically robust risk segments (many current segments have only 30 to 50 loans), better geographic coverage per state, and slightly improved model accuracy. The pipeline architecture already handles this without code changes. Simply add more raw files to the data directory and rerun the ETL.

The fundamental constraints remain the same: single snapshot (not time series), same feature set, and similar class imbalance ratio of approximately 1% delinquent loans.

Data Expansion

Incorporating Fannie Mae Monthly Performance Data

Fannie Mae publishes a Single Family Loan Performance dataset (20 to 50 GB per vintage) with a structure similar to Freddie Mac. However, it includes one critical field that Freddie Mac does not provide: monthly reporting period performance records. Each loan appears once per month with its current status, creating true time series data.

With monthly snapshots, the model could be trained on data from month N and tested on whether it correctly predicts delinquency at month N+3. This is genuine forward looking prediction rather than the current snapshot based scoring approach. The expected AUC improvement would be significant, likely reaching 0.82 to 0.88, because temporal features capture how a loan's payment behavior changes over time.

Implementation would require modifying the ETL pipeline to reshape monthly performance records into a time series format. The ML pipeline and downstream reporting would remain unchanged.

Research Insight

Adding Local Economic Indicators (BLS Unemployment Data)

A study of 12 million residential mortgages across seven European countries found that local economic conditions, specifically unemployment rate and house price indices, were among the most important variables explaining loan default. Their methodology used the same feature types (credit score, LTV, interest rate), the same ML models (logistic regression, random forest), and addressed the same prediction problem as this project.

The practical enhancement: join publicly available state level unemployment data from the Bureau of Labor Statistics (BLS) to the loan dataset. This is a simple CSV download containing state, month, and unemployment rate. Joining on property_state would add one to two features capturing local economic stress, which the current model cannot detect from loan level attributes alone.

Read the paper: Forecasting Loan Default in Europe with Machine Learning
Research Insight

AutoML and Leakage Aware Modeling (Fannie Mae Study)

A January 2026 study on Fannie Mae loan data focused on three challenges that directly overlap with this project: ambiguity in default labeling, severe class imbalance, and information leakage from temporal structure and post event variables. Their leakage aware approach validates the dual model design used here, where payment history variables are excluded from the origination model to prevent the target variable from appearing in the input features.

They compared Logistic Regression, Random Forest, XGBoost, LightGBM, and AutoGluon (an automated ML framework). AutoGluon achieved the strongest AUROC by automatically combining multiple models and optimizing their configurations. This suggests that adding automated model selection to the pipeline could extract additional performance without manual tuning.

Read the paper: Predicting Mortgage Default with ML, AutoML, Class Imbalance, and Leakage Control