Explaining Sequential Decision-Making in Reinforcement Learning Systems
Making an RL trading policy auditable without giving up performance.
The challenge
Reinforcement learning can discover powerful trading strategies, but in regulated finance an opaque policy is a non-starter. Risk teams and regulators need to know why the policy chose buy, sell, or hold, which signals were decisive, what minimal changes would have led to a different action, and whether explanations can be documented and audited over time.
This case study explores how a high-performing RL portfolio strategy can be made transparent and auditable — not just accurate.
Our approach
We built an end-to-end pipeline around a PPO agent trained on daily Dow Jones data:
- Market state representation: structured state with prices, holdings, cash, and indicators (RSI, moving averages, Bollinger bands, CCI, ADX, and others).
- Reinforcement learning policy: PPO allocated capital across DJIA assets; in rolling tests (2014–2023) it beat buy-and-hold, indicating learned behaviour.
- Integrated Gradients attribution: decomposed each action into contributions from cash, holdings, prices, and indicators, yielding a sparse evidence profile.
- IG-guided transport counterfactuals: learned BUY/SELL regions, computed transport vectors constrained to IG-important features, and searched bidirectionally for minimal flips.
- Structured audit reports: combined attribution and counterfactuals into a one-page record for each trade (action, supporting/opposing evidence, market narrative, validated what-if).
Example audit deliverables
For a single trade — selling GS on 13 September 2018 — the framework produces the following artefacts that can be reviewed by model risk and audit teams.



Key findings
1) Performance and explainability can coexist
The PPO policy outperformed the benchmark across most test years while producing stable, daily IG explanations.
2) Cash and technical indicators dominate decisions
Cash and indicators drove most decision influence; prices contributed little. Liquidity was the dominant driver in several years (e.g., 2017).
3) BUY and SELL behave differently
SELL explanations converged with fewer IG steps but required more features, indicating a sharper and more diffuse policy boundary on the sell side.
4) Decision boundaries are sparse and structured
Successful flips typically touched ~5 features: usually cash + holdings, with indicators shaping direction and prices rarely needing to move.
5) Audit reports are feasible and informative
Combining action, evidence, and market context produces a narrative suitable for model-risk review and oversight.
6) Limitations are explicit
IG remains baseline-sensitive and non-causal; counterfactuals follow the learned geometry and require practitioner validation.
Why it matters
RL policies don't need to remain black boxes. They can provide stable, repeatable explanations, align with governance expectations, and reveal structure that traditional backtests miss.
For trading, risk, and other sequential decision systems, this is a blueprint for turning a policy network into something that can withstand model-risk committees and regulators.
Outcome
The final deliverable is an auditable decision layer: PPO performance, IG-based evidence for each action, validated counterfactual alternatives, and a compact explanation report tied to recognisable market conditions.
It turns “the agent decided to sell” into: “The agent sold because cash and short-term momentum signalled limited upside; here is the minimal state change that would have led it to buy instead.”