Super-specialist radiologists. Super-specialist rates. Take the AI interview
Systems Paper ·

High-Trust Knee X-rays

Long-horizon agentic reasoning over multi-view studies with RLHF-aligned foundation models.

11 pages 9-stage reasoning graph 14 finding families 4-state evidence model

Abstract

Knee radiography is a deceptively difficult diagnostic task for AI systems. Many normal studies are easy, and many obviously abnormal studies are easy, but the clinically meaningful edge cases sit in the low-signal regime: mild osteoarthritis, small avulsion or chip fractures, tibial spine injuries, and contradictory evidence across views.

This paper presents an agentic knee X-ray interpretation system built for that regime — a long-horizon inference graph over an RLHF-aligned multimodal foundation model, organized around study assembly, per-view analysis, cross-view reconciliation, structured evidence reduction, triggered specialists, skeptical downgrade logic, and constrained report synthesis.

Internal Research Draft · · 5C Network Research

TL;DR: Knee X-ray AI should not be a single-prompt image-to-report model. It should be a long-horizon agentic reasoning system that decomposes the study into bounded subproblems, preserves view identity, reconciles evidence across projections, reduces everything to a canonical evidence ledger, and constrains the final report to write only from that ledger. RLHF-aligned foundation models are the substrate, not the answer. The hardest regime — subtle abnormalities — is exactly why this architecture exists.

Published: By Kalyan Sivasailam & 5C Network Research Systems paper · 11 pages · Agentic radiology architecture

Why a Systems Paper, Not a Validation Trial

Knee X-rays are among the most common musculoskeletal imaging studies, yet robust automated interpretation remains challenging. The task is not difficult only because fractures exist or because osteoarthritis exists. It is difficult because the clinically important cases sit between clearly normal and clearly abnormal. A normal study may be straightforward. A severely osteoarthritic knee or grossly displaced fracture may also be straightforward. But the practical failure zone lies in subtle abnormalities: early degenerative change, tiny avulsion fragments, tibial spine injuries, occult fracture cues, and equivocal soft-tissue findings that require caution rather than optimism.

In clinical practice, radiologists do not solve this by looking once and speaking once. They reason through the study. They determine which projections are available. They understand which findings are best evaluated on frontal, lateral, or sunrise views. They compare projections. They revise weak impressions when the second view does not corroborate them. They mentally maintain an evidence ledger even if they never call it that.

This paper is motivated by the thesis that high-trust radiology AI should be built to behave more like that workflow. Rather than asking a single model prompt to directly read a knee study and produce a report, we construct a long-horizon agentic system that decomposes the task into bounded reasoning problems and then reassembles the results through explicit reduction and synthesis. The resulting architecture is designed not only to improve report quality, but to make failure modes measurable and tunable.

8

Sub-problems a one-shot model collapses into a single uncontrolled generation

3

Regimes the system handles distinctly: normal, obviously abnormal, subtle abnormal

4

States in the evidence model: present · absent · indeterminate · not_assessable

From the paper. We frame this as a systems paper, not a clinical validation study or claim of radiologist equivalence.

Three Claims

The paper makes three architectural claims about how radiology AI should be built in the RLHF era.

Claim 01

Reasoning over evidence, not generation over pixels

Knee radiograph AI is better modeled as structured reasoning over evidence than as single-pass image-to-report generation. The hidden reasoning steps radiologists already perform — view inference, cross-view corroboration, weak-finding skepticism — should be externalized into explicit pipeline stages.

Why it matters

Failure modes become locatable: you can ask which stage missed a finding, not just whether the model was wrong.

Claim 02

RLHF is a substrate, not a solution

RLHF-era foundation models become substantially more useful for radiology when embedded inside explicit decomposition, verification, and skepticism mechanisms — rather than used monolithically. Alignment gives you instruction-following and revision; it does not give you radiology.

Why it matters

Vendor-neutral. The architecture works on top of any RLHF-aligned multimodal model. Swap the substrate, keep the orchestration.

Claim 03

Trust is engineering, not adjective

Product-grade reliability depends on treating view structure, uncertainty state, and expert review feedback as first-class research objects — inspectable, regression-testable, and revisable. Not as hidden activations inside a single black-box prompt.

Why it matters

When a study is misread, the trace tells you whether the failure lives in a per-view agent, the reconciler, the ledger, or the report writer — and you can fix it there.

The Reasoning Graph

The system is organized as a long-horizon, stateful inference graph. Each stage converts model outputs into structured intermediate state before any final report is written.

INPUT & ASSEMBLY PER-VIEW & CROSS-VIEW EVIDENCE & SPECIALISTS SYNTHESIS 01 Study Ingestion DICOM decode, laterality, previews 02 Preprocessing view inference, composite splitting 03 Study Assembly view identity preserved 04 Per-View Agents bounded per anatomy / technique 05 Cross-View Reconcile evidence not pixels, 5-state reconciliation 06 Evidence Ledger Λₖ = (s,q,E⁺,E⁻,L,ℓ,g) one row per finding 07 Triggered Specialists fracture, OA, effusion, alignment, postop… 08 Skeptic Pass downgrade weak, rescue subtle 09 Report Synthesis writes only from ledger (constrained) FINAL RADIOLOGY REPORT structured + traceable to ledger main flow conditional specialist activation
Figure 1 — The 9-stage agentic reasoning graph. Per-view agents (04) and specialists (07) call the foundation model; every other stage is deterministic control logic operating over the structured ledger.
01

Study Ingestion

Accepts one or more DICOM or image files for a knee study. Validates files, decodes DICOM, extracts laterality, generates previews.

02

Per-Image Preprocessing

View inference, composite-view splitting (AP/LAT composites are split into virtual per-view entries), and study-level grouping.

03

Study Assembly

A lateral view remains lateral; a frontal stays frontal. View identity is preserved, not flattened. Multi-file studies stay as one object.

04

Core Per-View Agents

Bounded tasks per anatomy/technique: projection availability, alignment, osseous abnormality, joint space, soft tissue, patellar tracking, fragment vs. fabella, postoperative change, safety screening.

05

Cross-View Reconciliation

Evidence reconciliation, not pixel fusion. Each finding is marked corroborated, view-limited but plausible, contradicted by a stronger view, indeterminate, or not assessable.

06

Evidence Ledger Reduction

All outputs collapse into a canonical ledger Λk per finding: state, confidence, supporting evidence, opposing evidence, limitations, location, grade.

07

Triggered Specialists

Specialist agents (fracture, alignment, effusion, OA grading, aggressive osseous, patellar tracking, postop) run only when the ledger warrants it. Selective spend, not unconditional expansion.

08

Skeptic Pass

Deterministic downgrade and rescue rules. Downgrades effusion without suprapatellar distention. Rescues grade-1 OA when explicit degenerative language exists. Raises fracture from absent to indeterminate when secondary signs appear.

09

Constrained Report Synthesis

The report writer reads only from the final ledger. It does not re-read the image. This is deliberate: an unconstrained second reader would reintroduce findings that earlier stages had already rejected.

This architecture reflects how 5C Network's Bionic AI engine is built — agentic, stateful, and inspectable rather than monolithic. Read more about Generalised Medical AI and Hybrid Intelligence.

The Evidence Ledger

The ledger is the system's main scientific object. It is what specialists read, what the skeptic challenges, what the report writer is constrained to use, and what engineers inspect when something goes wrong.

Λk = ( sk , qk , E+k , Ek , Lk , ℓk , gk )

For each target finding yk, the ledger stores seven fields. One row per finding, one ledger per study.

sk · state

present · absent · indeterminate · not_assessable

qk · confidence

Calibrated confidence for the asserted state.

E+k · supporting evidence

Per-view positive evidence with provenance.

Ek · opposing evidence

Per-view contradicting evidence with provenance.

Lk · limitations

View-availability and image-quality caveats.

k · location

Anatomical localization when relevant.

gk · grade

Grade or severity where clinically meaningful (e.g. KL grade for OA).

The report writer reads only this. Traceability by construction.

Three Regimes, One Architecture

The system handles three regimes distinctly. The subtle-abnormal regime is the one that motivates almost all of the design choices.

Normal studies

Converges quickly. Projection and quality agents confirm adequacy, regional agents return absent or low-signal findings, the skeptic suppresses weak positives, the ledger collapses to a concise normal report.

Obviously abnormal studies

Advanced osteoarthritis, gross deformity, displaced fracture. Multiple agents emit concordant positives, cross-view reconciliation reinforces them, specialists characterize them, synthesis is straightforward.

The hard regime

Subtle abnormal studies

Mild or early osteoarthritis. Intercondylar eminence injuries. Small chip or avulsion fractures. Subtle patellar fractures. Equivocal effusion-related trauma cues. Almost every design choice exists to handle this regime without increasing false positives elsewhere.

Early failure pattern that motivated tuning

True normal → usually correct
Obvious abnormal → usually correct
Subtle abnormal → sometimes collapsed to normal

The response was not to make the model more aggressive — that would have increased false positives across the board. Instead, two targeted changes were made: (1) stronger prompts for subtle osseous targets like patellar fracture, tibial spine injury, and tiny avulsion fragments; (2) cheap, deterministic rescue logic in the ledger and skeptic layers so secondary evidence prevents a false-normal collapse.

The Skeptic Pass

A deterministic regularization layer over the ledger. It controls overcall and undercall simultaneously without adding new full model stages.

Downgrade rules

  • Downgrade effusion when no direct suprapatellar or capsular distention sign exists.
  • Downgrade soft-tissue "effusions" confounded by skin-shadow or superficial overlap language.

Rescue rules

  • Promote osteoarthritis when multiple direct degenerative components are present, even if the composite OA target was initially absent.
  • Rescue grade-1 or early OA when explicit mild degenerative language exists.
  • Raise fracture from absent to indeterminate when secondary signs suggest occult or subtle injury.

These refinements were implemented without adding extra full model stages. They live in the reduction and skepticism layer and therefore improve subtle-abnormal sensitivity without materially increasing read time.

Long-horizon does not mean slow

A clinically useful workstation cannot double or triple read latency every time sensitivity is tuned. The system follows three latency principles.

1

Triggered specialization, not unconditional expansion

Extra model calls are activated only when the ledger warrants them. Most studies never run the deep specialists.

2

Cheap post-reduction rescue rules

Many subtle-OA and subtle-fracture rescues were implemented in the skepticism layer rather than as new full agent stages. Cost: one rule evaluation. Benefit: better subtle-abnormal recall.

3

View-aware routing

The system uses view identity to aim the right prompt at the right image, rather than asking every agent to reason over every rendering blindly. Patellar tracking goes to the sunrise view; effusion to the lateral.

"Long-horizon does not have to mean slow. It means staged, selective, and stateful." — from the paper.

By the Numbers

9

stages in the reasoning graph

14

target finding families covered

7

fields per ledger row

4

states in the evidence model

7

specialist families, triggered selectively

All figures describe the architecture of the system as documented in the paper. This is a systems paper, not a clinical validation study — see the limitations section below.

Limitations

The paper is explicit about what it does not claim.

01

Systems paper, not a clinical validation trial

No claim of radiologist-level performance or diagnostic equivalence. The contribution is architectural.

02

The orchestration layer is not a separately pretrained RL policy

The system leverages an RLHF-aligned foundation model plus deterministic control logic and targeted prompting.

03

Some heuristics remain explicit

Several safety and subtle-abnormal rescue rules are still implemented as reduction logic rather than learned policies. That is a deliberate engineering trade-off — auditability today, learnability tomorrow.

04

Performance remains most uncertain in the subtle-abnormal regime

That regime is exactly why the system exists, but it also remains the hardest place to make strong claims without a carefully curated evaluation set.

05

Multi-view quality and view availability still matter

No architecture can fully remove the information limits of projection radiography. A study with only a frontal view limits what any system, human or AI, can conclude about the lateral compartment.

Future Work

Six directions follow naturally from the architecture.

Prospective evaluation

Curated subtle-abnormal knee cohorts with sign-off outcomes.

Study-level calibration

Explicit calibration of system confidence against radiologist sign-off outcomes.

Learned trigger policies

Replace deterministic triggers for specialist activation with learned routing.

Ledger distillation

Distill the ledger and trace data into smaller domain-specific models.

Cross-joint extension

Apply the same evidence-led architecture to other musculoskeletal joints.

Expert feedback integration

Tighter integration of expert review into supervised and preference-style optimization loops.

Now accepting research collaborations

Work with us under IRB approval

5C Network is partnering with leading radiology AI researchers and hospitals to validate the agentic knee X-ray system on independently curated cohorts. Bring your anonymised knee X-ray dataset and we will run it as a fully unbiased external evaluation — with co-authored publications, prospective reader studies, and IRB-approved validation as the collaboration formats.

Total unbiasedness

You bring an anonymised dataset we have never seen. We evaluate on it. Results, traces, and ledgers are shared back in full.

Publication-grade rigor

Co-authored manuscripts, prospective reader studies, IRB-approved protocols, and independent benchmarks — not marketing decks.

Data stays yours

Fully anonymised under DTA. No PHI. No re-use beyond the agreed study. Federated evaluation possible on request.

Get in touch

For radiology AI researchers, academic radiologists, hospital research offices, and AI labs.

Your role *
What kind of collaboration?

Pick any that apply.

Anonymised knee X-ray data?

For unbiased validation, bring your own data.

Estimated dataset size

Optional.

Goes to the 5C Network research team. We reply within two business days. No marketing list.

References

  1. [1] Christiano, P. F., Leike, J., Brown, T. B., Martic, M., Legg, S., & Amodei, D. (2017). Deep Reinforcement Learning from Human Preferences. Advances in Neural Information Processing Systems.
  2. [2] Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., et al. (2022). Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems.
  3. [3] Bai, Y., Jones, A., Ndousse, K., Askell, A., Chen, A., DasSarma, N., et al. (2022). Constitutional AI: Harmlessness from AI Feedback. arXiv preprint arXiv:2212.08073.
  4. [4] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., & Zhou, D. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. Advances in Neural Information Processing Systems.
  5. [5] National Electrical Manufacturers Association. Digital Imaging and Communications in Medicine (DICOM) Standard. ISO 12052:2026.
PDF · 11 pages ·

Read the Complete Paper

"Toward High-Trust Knee Radiograph Interpretation: Long-Horizon Agentic Reasoning over Multi-View Studies with RLHF-Aligned Foundation Models"

By Kalyan Sivasailam & 5C Network Research · No email required. No paywall.