Hey, I'm

Amjad Ali

Data & AI Engineer focused on building reliable pipelines, intelligent systems, and real world solutions.

Explore my projects, experience, and ongoing journey.

New York

Amjad Ali
scroll

Featured Project

ML Risk Model: Results Explorer

Dual model approach: OriginRisk (origination predictor, honest AUC) + SegmentIQ (behavioral segmentation, risk ranking)

For complete methodology and architecture details, see the Technical Documentation.

AUC ROC (OriginRisk)
Capture at Top 10%
Capture at Top 20%
Risk Segments Scored
This chart shows the cumulative lift curve comparing the OriginRisk model against random selection. The gold line represents the model's performance: it plots what percentage of all delinquent loans are captured when reviewing a given percentage of the portfolio, ranked by predicted risk score. The red dashed line represents random guessing, where reviewing 10% of loans would catch roughly 10% of delinquencies. OriginRisk is a logistic regression model trained exclusively on origination time features such as credit score, loan to value ratio, debt to income ratio, interest rate, and loan age. No payment history or behavioral data is used, which ensures there is no data leakage from the target variable into the input features. The steeper the gold curve rises above the red baseline, the more effectively the model concentrates risk. In this portfolio, reviewing the top 20% of risk ranked loans captures approximately 59% of all delinquencies, nearly three times better than random selection.
This chart displays the relative importance of each feature used by the OriginRisk model to predict delinquency. Feature importance is measured by how much each variable contributes to the model's ability to distinguish delinquent loans from current ones. Higher values indicate stronger predictive power. Credit score at origination is the single strongest signal, followed by the number of borrowers on the loan (single borrower loans carry more risk), origination interest rate (higher rates mean larger monthly payments), and debt to income ratio (borrowers who are already financially stretched default more often). These results are consistent with established credit risk theory and align with findings from large scale European mortgage studies. The model uses only information available at the time of lending, meaning these risk drivers can inform underwriting decisions before a loan is funded.
This table shows the highest risk loan segments identified by the SegmentIQ model, which uses all available features including payment history to score current portfolio risk. Each row represents a unique combination of credit score band, loan to value bucket, interest rate range, and origination vintage. The "Risk Score" column shows the model's average predicted delinquency probability for that segment, while "Actual DLQ" shows the observed delinquency rate. Segments where both scores are high represent concentrated pockets of risk where loss mitigation resources would have the greatest impact. Subprime and Fair credit borrowers from the 2022 and 2023 vintages with elevated interest rates consistently appear at the top, confirming that the combination of weaker credit profiles and rate environment stress produces the highest delinquency concentrations.
Credit BandLTVRateVintageLoansRisk ScoreActual DLQ
Single snapshot limitation.
The dataset is a single monthly snapshot, not a time series. True forward prediction requires sequential monthly data. The origination model predicts current delinquency status from borrower profile, which is valuable for risk stratification but not a deployment ready forecasting tool.
Class imbalance (99:1).
Only about 1% of loans are delinquent. Precision is inherently low. Addressed with balanced class weights and evaluated using lift metrics rather than raw precision.
What would improve it.
Monthly time series data, additional features (employment, payment amounts, forbearance history), gradient boosting ensembles, and a calibration layer for dollar loss estimation.
Why the dual model design matters.
The first attempt produced AUC = 1.0 due to data leakage. The redesign separates OriginRisk (origination, AUC ~0.77) for honest prediction from SegmentIQ (behavioral) for segmentation. This distinction is what mortgage analytics teams value.

Other Projects

Career & Education

July 2025 — Present
Research Assistant (Volunteering)
University of New Haven · West Haven, CT
LLM assisted software security research involving vulnerability detection, multilingual code analysis across 6 languages, and explainable automated repair workflows for real codebases.
August 2023 — May 2025
MS in Data Science (STEM)
University of New Haven
AWS (Athena, Glue, S3, Lambda), Power BI, NLP, Math for Data Scientists.
June 2024 — August 2024
Solutions Engineering Intern
Bitwise Inc. · Schaumburg, IL
Built Python GenAI service with LangChain orchestration, multi LLM routing, and SQL to PySpark conversion pipeline. Led cross team final presentation across frontend, backend, and GenAI tracks.
August 2021 — August 2023
Senior Analyst
Capgemini
Fortune 50 licensing analytics with 50+ Snowflake SQL scripts, ETL migration from Informatica to AWS/Snowflake, end to end data validation across S3, DB2, and Snowflake.
March 2021 — May 2021
Data Analyst Intern
DevTown (Shape AI)
Supervised ML models on labeled datasets achieving 89.5% classification accuracy and 92.9% fraud recall. Deployed regression model as interactive web application.
August 2017 — May 2021
BE in Electronics & Telecommunication
University of Pune
AI, Machine Learning, Data Structures & Algorithms, OOP, SQL.

Credentials & Badges

Professional Certifications
OCI GenAI
OCI Generative AI Professional
OCI Vector Search
OCI AI Vector Search Professional
OCI Auto DB
OCI Autonomous Database Professional
Azure AZ 900
Azure Fundamentals (AZ 900)
Azure DP 900
Azure Data Fundamentals (DP 900)
Azure AI 900
Azure AI Fundamentals (AI 900)
HackerRank SQL Advanced
HackerRank SQL (Advanced) Certificate
Platform Achievements
CodeSignal Streak
CodeSignal Streak: 244 days
Snowflake DE
Snowflake Data Engineering Bootcamp
Snowflake GenAI
Snowflake Gen AI Bootcamp
MS AI Skills Fest
Microsoft AI Skills Fest (GWR)
Coding Badges
LeetCode Pandas15
LeetCode Pandas 15
HR SQL Gold
HackerRank SQL Gold
HR 30 Day Gold
HackerRank 30 Days Gold
Tools & Technologies
CodeSignal Tools
Practiced on CodeSignal