Toward High-Trust Knee Radiograph Interpretation: Long-Horizon Agentic Reasoning over Multi-View Studies with RLHF-Aligned Foundation Models

Q: Why is knee X-ray interpretation hard for AI?

Knee X-rays are deceptively hard because the clinically important cases sit between clearly normal and clearly abnormal. The failure zone is subtle abnormalities: mild osteoarthritis, tiny avulsion fragments, tibial spine injuries, projection-limited patellar findings, and equivocal soft-tissue cues. A normal knee is easy. A grossly displaced fracture is easy. The cases that matter most are the ones that require caution rather than optimism.

Q: Why isn't single-pass image-to-report generation sufficient?

Single-pass image-to-report systems collapse at least eight subproblems into one uncontrolled generation step: which images belong to the same study, what view each represents, whether each is technically adequate, which findings are directly visible vs. require corroboration, which are contradicted by another view, which should be downgraded for weak evidence, and how the final report should synthesize all of it. When these subproblems are not modeled explicitly, weak findings get copied into final reports without cross-view support, pseudo-findings get overcalled, and subtle findings get missed.

Q: What is the evidence ledger?

The evidence ledger is a canonical structured representation of all per-finding state in a knee study. For each target finding, it stores: state (present/absent/indeterminate/not_assessable), confidence, supporting and opposing evidence, limitations, anatomical location, and grade/severity where applicable. The ledger is the system's primary scientific object — it is what specialists read, what the skeptic challenges, what the report writer is constrained to use, and what engineers can inspect when something goes wrong.

Q: How does RLHF alignment fit into the architecture?

The base multimodal foundation model is treated as a general reasoning substrate, not a finished radiology solution. RLHF alignment gives the model useful capabilities — following structured instructions reliably, staying within constrained output schemas, revising answers under critique, and decomposed reasoning. But RLHF alone does not solve radiology. The role of the orchestration layer is to make the foundation model answer smaller bounded questions more reliably than it answers the whole study in one shot.

Q: What is the skeptic pass?

The skeptic pass is a deterministic regularization layer that runs after evidence reduction. It explicitly downgrades positives that are weak, contradictory, or better explained by artifacts (for example, downgrading effusion when no suprapatellar distention sign exists), and rescues subtle positives that explicit secondary evidence supports (for example, promoting osteoarthritis when multiple direct degenerative components are present, or raising fracture from absent to indeterminate when secondary signs suggest occult injury). It controls overcall and undercall simultaneously without adding new full model stages.

Q: Is this a clinical validation study?

No. This is explicitly a systems paper. It documents how the agentic knee X-ray system was designed, deployed, and iteratively improved under real product constraints. It does not claim radiologist-level performance, diagnostic equivalence, or report results from a multicenter reader study. 5C Network is now accepting research collaborations with leading hospitals and radiology AI groups to run IRB-approved validation against curated subtle-abnormal knee cohorts.

Q: Can the architecture be applied to other anatomical regions?

Yes. The architecture is explicitly designed to generalize beyond knee. Future work includes extending the same evidence-led reasoning graph to additional musculoskeletal joints — shoulder, wrist, hand, foot, ankle — while preserving the per-view agent decomposition, cross-view reconciliation, and ledger-based synthesis. The orchestration layer is anatomy-agnostic; only the agent prompts and finding universe Y change per region.

Q: How does this differ from existing FDA-cleared knee AI tools?

Most existing FDA-cleared knee AI tools perform single-finding detection — osteoarthritis grading, fracture detection — on a single view, returning a label or bounding box. The system described here is architecturally different: it reads multi-view studies, maintains structured intermediate state, reconciles evidence across projections, applies skepticism, and synthesizes a complete report rather than a single label. The contribution is not a new finding detector; it is an orchestration architecture on top of an RLHF-aligned foundation model.

Q: How can hospitals or researchers collaborate on validation?

5C Network is actively partnering with leading hospitals and radiology AI research groups to run IRB-approved validation of the agentic knee X-ray system. Partners can bring fully anonymised knee X-ray datasets — including their own curated subtle-abnormal cohorts — for unbiased external evaluation. Collaboration formats include co-authored publications, prospective reader studies, and structured external evaluation. Reach out via the collaboration form on this page.

Kalyan Sivasailam

Systems Paper · April 2026

High-Trust Knee X-rays

Long-horizon agentic reasoning over multi-view studies with RLHF-aligned foundation models.

Kalyan Sivasailam & 5C Network Research

CEO & Founder, 5C Network

11 pages 9-stage reasoning graph 14 finding families 4-state evidence model

Collaborate on validation

Abstract

Knee radiography is a deceptively difficult diagnostic task for AI systems. Many normal studies are easy, and many obviously abnormal studies are easy, but the clinically meaningful edge cases sit in the low-signal regime: mild osteoarthritis, small avulsion or chip fractures, tibial spine injuries, and contradictory evidence across views.

This paper presents an agentic knee X-ray interpretation system built for that regime — a long-horizon inference graph over an RLHF-aligned multimodal foundation model, organized around study assembly, per-view analysis, cross-view reconciliation, structured evidence reduction, triggered specialists, skeptical downgrade logic, and constrained report synthesis.

Internal Research Draft · April 24, 2026 · 5C Network Research

TL;DR: Knee X-ray AI should not be a single-prompt image-to-report model. It should be a long-horizon agentic reasoning system that decomposes the study into bounded subproblems, preserves view identity, reconciles evidence across projections, reduces everything to a canonical evidence ledger, and constrains the final report to write only from that ledger. RLHF-aligned foundation models are the substrate, not the answer. The hardest regime — subtle abnormalities — is exactly why this architecture exists.

Published: April 24, 2026 | By Kalyan Sivasailam & 5C Network Research | Systems paper · 11 pages · Agentic radiology architecture

Why a Systems Paper, Not a Validation Trial

Knee X-rays are among the most common musculoskeletal imaging studies, yet robust automated interpretation remains challenging. The task is not difficult only because fractures exist or because osteoarthritis exists. It is difficult because the clinically important cases sit between clearly normal and clearly abnormal. A normal study may be straightforward. A severely osteoarthritic knee or grossly displaced fracture may also be straightforward. But the practical failure zone lies in subtle abnormalities: early degenerative change, tiny avulsion fragments, tibial spine injuries, occult fracture cues, and equivocal soft-tissue findings that require caution rather than optimism.

In clinical practice, radiologists do not solve this by looking once and speaking once. They reason through the study. They determine which projections are available. They understand which findings are best evaluated on frontal, lateral, or sunrise views. They compare projections. They revise weak impressions when the second view does not corroborate them. They mentally maintain an evidence ledger even if they never call it that.

This paper is motivated by the thesis that high-trust radiology AI should be built to behave more like that workflow. Rather than asking a single model prompt to directly read a knee study and produce a report, we construct a long-horizon agentic system that decomposes the task into bounded reasoning problems and then reassembles the results through explicit reduction and synthesis. The resulting architecture is designed not only to improve report quality, but to make failure modes measurable and tunable.

8

Sub-problems a one-shot model collapses into a single uncontrolled generation

3

Regimes the system handles distinctly: normal, obviously abnormal, subtle abnormal

4

States in the evidence model: present · absent · indeterminate · not_assessable

From the paper. We frame this as a systems paper, not a clinical validation study or claim of radiologist equivalence.

Three Claims

The paper makes three architectural claims about how radiology AI should be built in the RLHF era.

Claim 01

Reasoning over evidence, not generation over pixels

Knee radiograph AI is better modeled as structured reasoning over evidence than as single-pass image-to-report generation. The hidden reasoning steps radiologists already perform — view inference, cross-view corroboration, weak-finding skepticism — should be externalized into explicit pipeline stages.

Why it matters

Failure modes become locatable: you can ask which stage missed a finding, not just whether the model was wrong.

Claim 02

RLHF is a substrate, not a solution

RLHF-era foundation models become substantially more useful for radiology when embedded inside explicit decomposition, verification, and skepticism mechanisms — rather than used monolithically. Alignment gives you instruction-following and revision; it does not give you radiology.

Why it matters

Vendor-neutral. The architecture works on top of any RLHF-aligned multimodal model. Swap the substrate, keep the orchestration.

Claim 03

Trust is engineering, not adjective

Product-grade reliability depends on treating view structure, uncertainty state, and expert review feedback as first-class research objects — inspectable, regression-testable, and revisable. Not as hidden activations inside a single black-box prompt.

Why it matters

When a study is misread, the trace tells you whether the failure lives in a per-view agent, the reconciler, the ledger, or the report writer — and you can fix it there.

The Reasoning Graph

The system is organized as a long-horizon, stateful inference graph. Each stage converts model outputs into structured intermediate state before any final report is written.

Figure 1 — The 9-stage agentic reasoning graph. Per-view agents (04) and specialists (07) call the foundation model; every other stage is deterministic control logic operating over the structured ledger.

01

Study Ingestion

Accepts one or more DICOM or image files for a knee study. Validates files, decodes DICOM, extracts laterality, generates previews.

02

Per-Image Preprocessing

View inference, composite-view splitting (AP/LAT composites are split into virtual per-view entries), and study-level grouping.

03

Study Assembly

A lateral view remains lateral; a frontal stays frontal. View identity is preserved, not flattened. Multi-file studies stay as one object.

04

Core Per-View Agents

Bounded tasks per anatomy/technique: projection availability, alignment, osseous abnormality, joint space, soft tissue, patellar tracking, fragment vs. fabella, postoperative change, safety screening.

05

Cross-View Reconciliation

Evidence reconciliation, not pixel fusion. Each finding is marked corroborated, view-limited but plausible, contradicted by a stronger view, indeterminate, or not assessable.

06

Evidence Ledger Reduction

All outputs collapse into a canonical ledger Λ_k per finding: state, confidence, supporting evidence, opposing evidence, limitations, location, grade.

07

Triggered Specialists

Specialist agents (fracture, alignment, effusion, OA grading, aggressive osseous, patellar tracking, postop) run only when the ledger warrants it. Selective spend, not unconditional expansion.

08

Skeptic Pass

Deterministic downgrade and rescue rules. Downgrades effusion without suprapatellar distention. Rescues grade-1 OA when explicit degenerative language exists. Raises fracture from absent to indeterminate when secondary signs appear.

09

Constrained Report Synthesis

The report writer reads only from the final ledger. It does not re-read the image. This is deliberate: an unconstrained second reader would reintroduce findings that earlier stages had already rejected.

This architecture reflects how 5C Network's Bionic AI engine is built — agentic, stateful, and inspectable rather than monolithic. Read more about Generalised Medical AI and Hybrid Intelligence.

The Evidence Ledger

The ledger is the system's main scientific object. It is what specialists read, what the skeptic challenges, what the report writer is constrained to use, and what engineers inspect when something goes wrong.

Λ_k = ( s_k , q_k , E⁺_k , E⁻_k , L_k , ℓ_k , g_k )

For each target finding y_k, the ledger stores seven fields. One row per finding, one ledger per study.

s_k · state

present · absent · indeterminate · not_assessable

q_k · confidence

Calibrated confidence for the asserted state.

E⁺_k · supporting evidence

Per-view positive evidence with provenance.

E⁻_k · opposing evidence

Per-view contradicting evidence with provenance.

L_k · limitations

View-availability and image-quality caveats.

ℓ_k · location

Anatomical localization when relevant.

g_k · grade

Grade or severity where clinically meaningful (e.g. KL grade for OA).

The report writer reads only this. Traceability by construction.

Three Regimes, One Architecture

The system handles three regimes distinctly. The subtle-abnormal regime is the one that motivates almost all of the design choices.

Normal studies

Converges quickly. Projection and quality agents confirm adequacy, regional agents return absent or low-signal findings, the skeptic suppresses weak positives, the ledger collapses to a concise normal report.

Obviously abnormal studies

Advanced osteoarthritis, gross deformity, displaced fracture. Multiple agents emit concordant positives, cross-view reconciliation reinforces them, specialists characterize them, synthesis is straightforward.

The hard regime

Subtle abnormal studies

Mild or early osteoarthritis. Intercondylar eminence injuries. Small chip or avulsion fractures. Subtle patellar fractures. Equivocal effusion-related trauma cues. Almost every design choice exists to handle this regime without increasing false positives elsewhere.

Early failure pattern that motivated tuning

True normal → usually correct

Obvious abnormal → usually correct

Subtle abnormal → sometimes collapsed to normal

The response was not to make the model more aggressive — that would have increased false positives across the board. Instead, two targeted changes were made: (1) stronger prompts for subtle osseous targets like patellar fracture, tibial spine injury, and tiny avulsion fragments; (2) cheap, deterministic rescue logic in the ledger and skeptic layers so secondary evidence prevents a false-normal collapse.

The Skeptic Pass

A deterministic regularization layer over the ledger. It controls overcall and undercall simultaneously without adding new full model stages.

Downgrade rules

→Downgrade effusion when no direct suprapatellar or capsular distention sign exists.
→Downgrade soft-tissue "effusions" confounded by skin-shadow or superficial overlap language.

Rescue rules

→Promote osteoarthritis when multiple direct degenerative components are present, even if the composite OA target was initially absent.
→Rescue grade-1 or early OA when explicit mild degenerative language exists.
→Raise fracture from absent to indeterminate when secondary signs suggest occult or subtle injury.

These refinements were implemented without adding extra full model stages. They live in the reduction and skepticism layer and therefore improve subtle-abnormal sensitivity without materially increasing read time.

Long-horizon does not mean slow

A clinically useful workstation cannot double or triple read latency every time sensitivity is tuned. The system follows three latency principles.

1

Triggered specialization, not unconditional expansion

Extra model calls are activated only when the ledger warrants them. Most studies never run the deep specialists.

2

Cheap post-reduction rescue rules

Many subtle-OA and subtle-fracture rescues were implemented in the skepticism layer rather than as new full agent stages. Cost: one rule evaluation. Benefit: better subtle-abnormal recall.

3

View-aware routing

The system uses view identity to aim the right prompt at the right image, rather than asking every agent to reason over every rendering blindly. Patellar tracking goes to the sunrise view; effusion to the lateral.

"Long-horizon does not have to mean slow. It means staged, selective, and stateful." — from the paper.

By the Numbers

9

stages in the reasoning graph

14

target finding families covered

7

fields per ledger row

4

states in the evidence model

7

specialist families, triggered selectively

All figures describe the architecture of the system as documented in the paper. This is a systems paper, not a clinical validation study — see the limitations section below.

Limitations

The paper is explicit about what it does not claim.

01

Systems paper, not a clinical validation trial

No claim of radiologist-level performance or diagnostic equivalence. The contribution is architectural.

02

The orchestration layer is not a separately pretrained RL policy

The system leverages an RLHF-aligned foundation model plus deterministic control logic and targeted prompting.

03

Some heuristics remain explicit

Several safety and subtle-abnormal rescue rules are still implemented as reduction logic rather than learned policies. That is a deliberate engineering trade-off — auditability today, learnability tomorrow.

04

Performance remains most uncertain in the subtle-abnormal regime

That regime is exactly why the system exists, but it also remains the hardest place to make strong claims without a carefully curated evaluation set.

05

Multi-view quality and view availability still matter

No architecture can fully remove the information limits of projection radiography. A study with only a frontal view limits what any system, human or AI, can conclude about the lateral compartment.

Future Work

Six directions follow naturally from the architecture.

Prospective evaluation

Curated subtle-abnormal knee cohorts with sign-off outcomes.

Study-level calibration

Explicit calibration of system confidence against radiologist sign-off outcomes.

Learned trigger policies

Replace deterministic triggers for specialist activation with learned routing.

Ledger distillation

Distill the ledger and trace data into smaller domain-specific models.

Cross-joint extension

Apply the same evidence-led architecture to other musculoskeletal joints.

Expert feedback integration

Tighter integration of expert review into supervised and preference-style optimization loops.

Now accepting research collaborations

Work with us under IRB approval

5C Network is partnering with leading radiology AI researchers and hospitals to validate the agentic knee X-ray system on independently curated cohorts. Bring your anonymised knee X-ray dataset and we will run it as a fully unbiased external evaluation — with co-authored publications, prospective reader studies, and IRB-approved validation as the collaboration formats.

Total unbiasedness

You bring an anonymised dataset we have never seen. We evaluate on it. Results, traces, and ledgers are shared back in full.

Publication-grade rigor

Co-authored manuscripts, prospective reader studies, IRB-approved protocols, and independent benchmarks — not marketing decks.

Data stays yours

Fully anonymised under DTA. No PHI. No re-use beyond the agreed study. Federated evaluation possible on request.

References

[1] Christiano, P. F., Leike, J., Brown, T. B., Martic, M., Legg, S., & Amodei, D. (2017). Deep Reinforcement Learning from Human Preferences. Advances in Neural Information Processing Systems.
[2] Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., et al. (2022). Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems.
[3] Bai, Y., Jones, A., Ndousse, K., Askell, A., Chen, A., DasSarma, N., et al. (2022). Constitutional AI: Harmlessness from AI Feedback. arXiv preprint arXiv:2212.08073.
[4] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., & Zhou, D. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. Advances in Neural Information Processing Systems.
[5] National Electrical Manufacturers Association. Digital Imaging and Communications in Medicine (DICOM) Standard. ISO 12052:2026.

PDF · 11 pages · April 2026

Read the Complete Paper

"Toward High-Trust Knee Radiograph Interpretation: Long-Horizon Agentic Reasoning over Multi-View Studies with RLHF-Aligned Foundation Models"

By Kalyan Sivasailam & 5C Network Research · No email required. No paywall.

Continue Reading

5C Research

All publications from the 5C Network research team.

View all papers

What is GM AI?

Generalised Medical AI — beyond narrow detection to full-workflow radiology AI.

Related Research Papers

From Slices to Reports

A survey of AI in cross-sectional medical imaging

Deep Learning for Shoulder Fracture Detection

Ensemble system for fracture identification in radiographs

Autonomous AI for Multi-Pathology Detection

Chest X-ray pathology detection across Indian hospitals

Explainable AI in Radiology

Interpretable methods for radiologist trust and adoption

High-Trust Knee X-rays

Abstract

Why a Systems Paper, Not a Validation Trial

Three Claims

Reasoning over evidence, not generation over pixels

RLHF is a substrate, not a solution

Trust is engineering, not adjective

The Reasoning Graph

Study Ingestion

Per-Image Preprocessing

Study Assembly

Core Per-View Agents

Cross-View Reconciliation

Evidence Ledger Reduction

Triggered Specialists

Skeptic Pass

Constrained Report Synthesis

The Evidence Ledger

Three Regimes, One Architecture

Normal studies

Obviously abnormal studies

Subtle abnormal studies

Early failure pattern that motivated tuning

The Skeptic Pass

Downgrade rules

Rescue rules

Long-horizon does not mean slow

Triggered specialization, not unconditional expansion

Cheap post-reduction rescue rules

View-aware routing

By the Numbers

Limitations

Systems paper, not a clinical validation trial

The orchestration layer is not a separately pretrained RL policy

Some heuristics remain explicit

Performance remains most uncertain in the subtle-abnormal regime

Multi-view quality and view availability still matter

Future Work

Work with us under IRB approval

Get in touch

Thanks. We will be in touch.

References

Read the Complete Paper

Continue Reading

Related Research Papers

Welcome Back

Healthcare Facility

Radiologist