Case Study 2

Explaining Sequential Decision-Making in Reinforcement Learning Systems

Making an RL trading policy auditable without giving up performance.

The challenge

Reinforcement learning can discover powerful trading strategies, but in regulated finance an opaque policy is a non-starter. Risk teams and regulators need to know why the policy chose buy, sell, or hold, which signals were decisive, what minimal changes would have led to a different action, and whether explanations can be documented and audited over time.

This case study explores how a high-performing RL portfolio strategy can be made transparent and auditable — not just accurate.

Our approach

We built an end-to-end pipeline around a PPO agent trained on daily Dow Jones data:

  1. Market state representation: structured state with prices, holdings, cash, and indicators (RSI, moving averages, Bollinger bands, CCI, ADX, and others).
  2. Reinforcement learning policy: PPO allocated capital across DJIA assets; in rolling tests (2014–2023) it beat buy-and-hold, indicating learned behaviour.
  3. Integrated Gradients attribution: decomposed each action into contributions from cash, holdings, prices, and indicators, yielding a sparse evidence profile.
  4. IG-guided transport counterfactuals: learned BUY/SELL regions, computed transport vectors constrained to IG-important features, and searched bidirectionally for minimal flips.
  5. Structured audit reports: combined attribution and counterfactuals into a one-page record for each trade (action, supporting/opposing evidence, market narrative, validated what-if).

Example audit deliverables

For a single trade — selling GS on 13 September 2018 — the framework produces the following artefacts that can be reviewed by model risk and audit teams.

Top features driving and opposing the SELL decision for GS, plus decision balance.
Feature attribution and decision balance for a SELL decision in GS.
Counterfactual scenario report showing minimal changes needed to flip the decision.
Counterfactual analysis: the minimal changes that would have flipped SELL → BUY.
Feature attribution audit summary with narrative and market context.
Audit-style summary combining attribution, signals, and a market narrative.

Key findings

1) Performance and explainability can coexist

The PPO policy outperformed the benchmark across most test years while producing stable, daily IG explanations.

2) Cash and technical indicators dominate decisions

Cash and indicators drove most decision influence; prices contributed little. Liquidity was the dominant driver in several years (e.g., 2017).

3) BUY and SELL behave differently

SELL explanations converged with fewer IG steps but required more features, indicating a sharper and more diffuse policy boundary on the sell side.

4) Decision boundaries are sparse and structured

Successful flips typically touched ~5 features: usually cash + holdings, with indicators shaping direction and prices rarely needing to move.

5) Audit reports are feasible and informative

Combining action, evidence, and market context produces a narrative suitable for model-risk review and oversight.

6) Limitations are explicit

IG remains baseline-sensitive and non-causal; counterfactuals follow the learned geometry and require practitioner validation.

Why it matters

RL policies don't need to remain black boxes. They can provide stable, repeatable explanations, align with governance expectations, and reveal structure that traditional backtests miss.

For trading, risk, and other sequential decision systems, this is a blueprint for turning a policy network into something that can withstand model-risk committees and regulators.

Outcome

The final deliverable is an auditable decision layer: PPO performance, IG-based evidence for each action, validated counterfactual alternatives, and a compact explanation report tied to recognisable market conditions.

It turns “the agent decided to sell” into: “The agent sold because cash and short-term momentum signalled limited upside; here is the minimal state change that would have led it to buy instead.”