Ensuring Consistency and Reliability in Scoring Models: A Python Guide to Monotonicity and Stability Checks

Introduction

Scoring models are vital tools in risk assessment, but their predictive power depends on the quality of the input variables. Two key properties to validate are monotonicity—whether the relationship between a variable and the risk outcome is consistently directional—and stability—whether that relationship holds over time. In this article, we explore how to leverage Python to test these properties and ensure your variables tell a consistent risk story.

Ensuring Consistency and Reliability in Scoring Models: A Python Guide to Monotonicity and Stability Checks — Source: towardsdatascience.com

Why Monotonicity Matters

Monotonicity ensures that as a variable increases (or decreases), the predicted risk changes in a predictable direction. For instance, in credit scoring, higher income should monotonically lower default risk. A non-monotonic pattern—where risk increases, then decreases—can signal data issues or model misspecification. Validating monotonicity builds stakeholder trust and supports regulatory compliance.

Testing Monotonicity in Python

You can assess monotonicity by binning a continuous variable and computing the average risk per bin. Use pandas to create quantile bins, then check if the mean outcome (e.g., default rate) is strictly increasing or decreasing. Apply non-parametric tests like the scipy.stats.spearmanr correlation or the isotonic regression approach from sklearn.isotonic.

import pandas as pd
import numpy as np
from scipy.stats import spearmanr

# Assume df has 'var' and 'risk_flag'
binned = pd.qcut(df['var'], q=10, duplicates='drop')
mean_risk = df.groupby(binned)['risk_flag'].mean()
# Check monotonicity via Spearman correlation
corr, pval = spearmanr(range(len(mean_risk)), mean_risk)
print(f'Spearman rho: {corr:.3f}, p-value: {pval:.3f}')

A high absolute correlation (e.g., >0.9) and low p-value indicate strong monotonicity. For more rigorous checks, use the scorecardpy package which includes mono_bin for optimal monotonic binning.

The Importance of Stability

Stability ensures that variable definitions or relationships do not shift over time, which could degrade model performance. Population Stability Index (PSI) is a common metric that quantifies distributional drift between a baseline and current sample. A PSI below 0.1 signals stable variables; above 0.25 suggests significant shift.

Calculating PSI in Python

Compute PSI by binning the baseline distribution, then comparing expected and actual proportions:

def calculate_psi(expected, actual, bins=10):
    expected_binned = pd.qcut(expected, q=bins, labels=False, duplicates='drop')
    actual_binned = pd.qcut(actual, q=bins, labels=False, duplicates='drop')
    psi = 0
    for i in range(bins):
        p_i = np.mean(expected_binned == i)
        q_i = np.mean(actual_binned == i)
        if p_i == 0:
            p_i = 0.001
        if q_i == 0:
            q_i = 0.001
        psi += (q_i - p_i) * np.log(q_i / p_i)
    return psi

Apply this across time periods using rolling windows. Visualize trends with matplotlib to spot sudden jumps.

Putting It All Together: A Validation Workflow

Step 1: Data Preparation

Load your historical data with a timestamp column (e.g., observation_date). Split into baseline (e.g., first year) and current (last quarter).

Step 2: Monotonicity Check

For each numeric variable, compute the monotonicity score (e.g., Spearman rho). Flag variables where |rho| < 0.8 or p-value > 0.05.

Step 3: Stability Check

For each variable, calculate PSI over consecutive time windows. Create a heatmap using seaborn to visualize PSI across variables and time periods.

Step 4: Reporting

Generate an HTML report with tables (<table>) listing monotonicity scores and PSI values. Use internal anchor links to jump between sections:

Monotonicity Details
Stability Analysis

Monotonicity Details

Below are results for key variables in an example credit model. Variable income shows ρ=0.98, confirming monotonicity. Variable age shows a mild non-monotonic pattern at younger ages, requiring binning adjustment.

Stability Analysis

PSI values for all variables over four quarters: most remain below 0.1. Variable debt_ratio shows PSI=0.18 in the last quarter, warranting investigation into data collection changes.

Conclusion

By systematically checking monotonicity and stability in Python, you can build more robust scoring models. These validations help maintain model performance over time and satisfy regulatory expectations. Start integrating these checks into your model development pipeline today.