High-Trust Knee X-rays
Long-horizon agentic reasoning over multi-view studies with RLHF-aligned foundation models.
Kalyan Sivasailam & 5C Network Research
CEO & Founder, 5C Network
Abstract
Knee radiography is a deceptively difficult diagnostic task for AI systems. Many normal studies are easy, and many obviously abnormal studies are easy, but the clinically meaningful edge cases sit in the low-signal regime: mild osteoarthritis, small avulsion or chip fractures, tibial spine injuries, and contradictory evidence across views.
This paper presents an agentic knee X-ray interpretation system built for that regime — a long-horizon inference graph over an RLHF-aligned multimodal foundation model, organized around study assembly, per-view analysis, cross-view reconciliation, structured evidence reduction, triggered specialists, skeptical downgrade logic, and constrained report synthesis.
Internal Research Draft · · 5C Network Research
TL;DR: Knee X-ray AI should not be a single-prompt image-to-report model. It should be a long-horizon agentic reasoning system that decomposes the study into bounded subproblems, preserves view identity, reconciles evidence across projections, reduces everything to a canonical evidence ledger, and constrains the final report to write only from that ledger. RLHF-aligned foundation models are the substrate, not the answer. The hardest regime — subtle abnormalities — is exactly why this architecture exists.
Why a Systems Paper, Not a Validation Trial
Knee X-rays are among the most common musculoskeletal imaging studies, yet robust automated interpretation remains challenging. The task is not difficult only because fractures exist or because osteoarthritis exists. It is difficult because the clinically important cases sit between clearly normal and clearly abnormal. A normal study may be straightforward. A severely osteoarthritic knee or grossly displaced fracture may also be straightforward. But the practical failure zone lies in subtle abnormalities: early degenerative change, tiny avulsion fragments, tibial spine injuries, occult fracture cues, and equivocal soft-tissue findings that require caution rather than optimism.
In clinical practice, radiologists do not solve this by looking once and speaking once. They reason through the study. They determine which projections are available. They understand which findings are best evaluated on frontal, lateral, or sunrise views. They compare projections. They revise weak impressions when the second view does not corroborate them. They mentally maintain an evidence ledger even if they never call it that.
This paper is motivated by the thesis that high-trust radiology AI should be built to behave more like that workflow. Rather than asking a single model prompt to directly read a knee study and produce a report, we construct a long-horizon agentic system that decomposes the task into bounded reasoning problems and then reassembles the results through explicit reduction and synthesis. The resulting architecture is designed not only to improve report quality, but to make failure modes measurable and tunable.
8
Sub-problems a one-shot model collapses into a single uncontrolled generation
3
Regimes the system handles distinctly: normal, obviously abnormal, subtle abnormal
4
States in the evidence model: present · absent · indeterminate · not_assessable
From the paper. We frame this as a systems paper, not a clinical validation study or claim of radiologist equivalence.
Three Claims
The paper makes three architectural claims about how radiology AI should be built in the RLHF era.
Reasoning over evidence, not generation over pixels
Knee radiograph AI is better modeled as structured reasoning over evidence than as single-pass image-to-report generation. The hidden reasoning steps radiologists already perform — view inference, cross-view corroboration, weak-finding skepticism — should be externalized into explicit pipeline stages.
Why it matters
Failure modes become locatable: you can ask which stage missed a finding, not just whether the model was wrong.
RLHF is a substrate, not a solution
RLHF-era foundation models become substantially more useful for radiology when embedded inside explicit decomposition, verification, and skepticism mechanisms — rather than used monolithically. Alignment gives you instruction-following and revision; it does not give you radiology.
Why it matters
Vendor-neutral. The architecture works on top of any RLHF-aligned multimodal model. Swap the substrate, keep the orchestration.
Trust is engineering, not adjective
Product-grade reliability depends on treating view structure, uncertainty state, and expert review feedback as first-class research objects — inspectable, regression-testable, and revisable. Not as hidden activations inside a single black-box prompt.
Why it matters
When a study is misread, the trace tells you whether the failure lives in a per-view agent, the reconciler, the ledger, or the report writer — and you can fix it there.
The Reasoning Graph
The system is organized as a long-horizon, stateful inference graph. Each stage converts model outputs into structured intermediate state before any final report is written.
Study Ingestion
Accepts one or more DICOM or image files for a knee study. Validates files, decodes DICOM, extracts laterality, generates previews.
Per-Image Preprocessing
View inference, composite-view splitting (AP/LAT composites are split into virtual per-view entries), and study-level grouping.
Study Assembly
A lateral view remains lateral; a frontal stays frontal. View identity is preserved, not flattened. Multi-file studies stay as one object.
Core Per-View Agents
Bounded tasks per anatomy/technique: projection availability, alignment, osseous abnormality, joint space, soft tissue, patellar tracking, fragment vs. fabella, postoperative change, safety screening.
Cross-View Reconciliation
Evidence reconciliation, not pixel fusion. Each finding is marked corroborated, view-limited but plausible, contradicted by a stronger view, indeterminate, or not assessable.
Evidence Ledger Reduction
All outputs collapse into a canonical ledger Λk per finding: state, confidence, supporting evidence, opposing evidence, limitations, location, grade.
Triggered Specialists
Specialist agents (fracture, alignment, effusion, OA grading, aggressive osseous, patellar tracking, postop) run only when the ledger warrants it. Selective spend, not unconditional expansion.
Skeptic Pass
Deterministic downgrade and rescue rules. Downgrades effusion without suprapatellar distention. Rescues grade-1 OA when explicit degenerative language exists. Raises fracture from absent to indeterminate when secondary signs appear.
Constrained Report Synthesis
The report writer reads only from the final ledger. It does not re-read the image. This is deliberate: an unconstrained second reader would reintroduce findings that earlier stages had already rejected.
This architecture reflects how 5C Network's Bionic AI engine is built — agentic, stateful, and inspectable rather than monolithic. Read more about Generalised Medical AI and Hybrid Intelligence.
The Evidence Ledger
The ledger is the system's main scientific object. It is what specialists read, what the skeptic challenges, what the report writer is constrained to use, and what engineers inspect when something goes wrong.
For each target finding yk, the ledger stores seven fields. One row per finding, one ledger per study.
sk · state
present · absent · indeterminate · not_assessable
qk · confidence
Calibrated confidence for the asserted state.
E+k · supporting evidence
Per-view positive evidence with provenance.
E−k · opposing evidence
Per-view contradicting evidence with provenance.
Lk · limitations
View-availability and image-quality caveats.
ℓk · location
Anatomical localization when relevant.
gk · grade
Grade or severity where clinically meaningful (e.g. KL grade for OA).
The report writer reads only this. Traceability by construction.
Three Regimes, One Architecture
The system handles three regimes distinctly. The subtle-abnormal regime is the one that motivates almost all of the design choices.
Normal studies
Converges quickly. Projection and quality agents confirm adequacy, regional agents return absent or low-signal findings, the skeptic suppresses weak positives, the ledger collapses to a concise normal report.
Obviously abnormal studies
Advanced osteoarthritis, gross deformity, displaced fracture. Multiple agents emit concordant positives, cross-view reconciliation reinforces them, specialists characterize them, synthesis is straightforward.
Subtle abnormal studies
Mild or early osteoarthritis. Intercondylar eminence injuries. Small chip or avulsion fractures. Subtle patellar fractures. Equivocal effusion-related trauma cues. Almost every design choice exists to handle this regime without increasing false positives elsewhere.
Early failure pattern that motivated tuning
The response was not to make the model more aggressive — that would have increased false positives across the board. Instead, two targeted changes were made: (1) stronger prompts for subtle osseous targets like patellar fracture, tibial spine injury, and tiny avulsion fragments; (2) cheap, deterministic rescue logic in the ledger and skeptic layers so secondary evidence prevents a false-normal collapse.
The Skeptic Pass
A deterministic regularization layer over the ledger. It controls overcall and undercall simultaneously without adding new full model stages.
Downgrade rules
- →Downgrade effusion when no direct suprapatellar or capsular distention sign exists.
- →Downgrade soft-tissue "effusions" confounded by skin-shadow or superficial overlap language.
Rescue rules
- →Promote osteoarthritis when multiple direct degenerative components are present, even if the composite OA target was initially absent.
- →Rescue grade-1 or early OA when explicit mild degenerative language exists.
- →Raise fracture from absent to indeterminate when secondary signs suggest occult or subtle injury.
These refinements were implemented without adding extra full model stages. They live in the reduction and skepticism layer and therefore improve subtle-abnormal sensitivity without materially increasing read time.
Long-horizon does not mean slow
A clinically useful workstation cannot double or triple read latency every time sensitivity is tuned. The system follows three latency principles.
Triggered specialization, not unconditional expansion
Extra model calls are activated only when the ledger warrants them. Most studies never run the deep specialists.
Cheap post-reduction rescue rules
Many subtle-OA and subtle-fracture rescues were implemented in the skepticism layer rather than as new full agent stages. Cost: one rule evaluation. Benefit: better subtle-abnormal recall.
View-aware routing
The system uses view identity to aim the right prompt at the right image, rather than asking every agent to reason over every rendering blindly. Patellar tracking goes to the sunrise view; effusion to the lateral.
"Long-horizon does not have to mean slow. It means staged, selective, and stateful." — from the paper.
By the Numbers
9
stages in the reasoning graph
14
target finding families covered
7
fields per ledger row
4
states in the evidence model
7
specialist families, triggered selectively
All figures describe the architecture of the system as documented in the paper. This is a systems paper, not a clinical validation study — see the limitations section below.
Limitations
The paper is explicit about what it does not claim.
Systems paper, not a clinical validation trial
No claim of radiologist-level performance or diagnostic equivalence. The contribution is architectural.
The orchestration layer is not a separately pretrained RL policy
The system leverages an RLHF-aligned foundation model plus deterministic control logic and targeted prompting.
Some heuristics remain explicit
Several safety and subtle-abnormal rescue rules are still implemented as reduction logic rather than learned policies. That is a deliberate engineering trade-off — auditability today, learnability tomorrow.
Performance remains most uncertain in the subtle-abnormal regime
That regime is exactly why the system exists, but it also remains the hardest place to make strong claims without a carefully curated evaluation set.
Multi-view quality and view availability still matter
No architecture can fully remove the information limits of projection radiography. A study with only a frontal view limits what any system, human or AI, can conclude about the lateral compartment.
Future Work
Six directions follow naturally from the architecture.
Prospective evaluation
Curated subtle-abnormal knee cohorts with sign-off outcomes.
Study-level calibration
Explicit calibration of system confidence against radiologist sign-off outcomes.
Learned trigger policies
Replace deterministic triggers for specialist activation with learned routing.
Ledger distillation
Distill the ledger and trace data into smaller domain-specific models.
Cross-joint extension
Apply the same evidence-led architecture to other musculoskeletal joints.
Expert feedback integration
Tighter integration of expert review into supervised and preference-style optimization loops.
Work with us under IRB approval
5C Network is partnering with leading radiology AI researchers and hospitals to validate the agentic knee X-ray system on independently curated cohorts. Bring your anonymised knee X-ray dataset and we will run it as a fully unbiased external evaluation — with co-authored publications, prospective reader studies, and IRB-approved validation as the collaboration formats.
Total unbiasedness
You bring an anonymised dataset we have never seen. We evaluate on it. Results, traces, and ledgers are shared back in full.
Publication-grade rigor
Co-authored manuscripts, prospective reader studies, IRB-approved protocols, and independent benchmarks — not marketing decks.
Data stays yours
Fully anonymised under DTA. No PHI. No re-use beyond the agreed study. Federated evaluation possible on request.
Thanks. We will be in touch.
The 5C Network research team will respond within two business days. If your data is anonymisation-ready, we will share a DTA template in the first reply.
References
- [1] Christiano, P. F., Leike, J., Brown, T. B., Martic, M., Legg, S., & Amodei, D. (2017). Deep Reinforcement Learning from Human Preferences. Advances in Neural Information Processing Systems.
- [2] Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., et al. (2022). Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems.
- [3] Bai, Y., Jones, A., Ndousse, K., Askell, A., Chen, A., DasSarma, N., et al. (2022). Constitutional AI: Harmlessness from AI Feedback. arXiv preprint arXiv:2212.08073.
- [4] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., & Zhou, D. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. Advances in Neural Information Processing Systems.
- [5] National Electrical Manufacturers Association. Digital Imaging and Communications in Medicine (DICOM) Standard. ISO 12052:2026.
Read the Complete Paper
"Toward High-Trust Knee Radiograph Interpretation: Long-Horizon Agentic Reasoning over Multi-View Studies with RLHF-Aligned Foundation Models"
By Kalyan Sivasailam & 5C Network Research · No email required. No paywall.
Continue Reading
5C Research
All publications from the 5C Network research team.
View all papersWhat is GM AI?
Generalised Medical AI — beyond narrow detection to full-workflow radiology AI.
Read moreHybrid Intelligence
How AI and radiologists work together — not AI alone, not humans alone.
Read moreRelated Research Papers
From Slices to Reports
A survey of AI in cross-sectional medical imaging
Deep Learning for Shoulder Fracture Detection
Ensemble system for fracture identification in radiographs
Autonomous AI for Multi-Pathology Detection
Chest X-ray pathology detection across Indian hospitals
Explainable AI in Radiology
Interpretable methods for radiologist trust and adoption