Healthtech

HIPAA-aligned OCR and extraction pipeline for a clinical lab network

A multi-tenant document intelligence platform that ingests scanned lab test requisitions, extracts structured data with OCR and LLM-assisted mapping, and returns auditable, HIPAA-aligned records to the client's lab information system.

TypeScriptReactPythonDjangoCeleryPostgreSQLRedisAWS TextractAWS S3AWS KMS

Client

Clinical laboratory group (US)

Year

2025

Duration

9 months

Outcome

HIPAA-aligned pipeline handling requisitions across every US state

The challenge

A clinical laboratory network was manually keying hundreds of scanned test requisitions per day into their lab information system. Every form came from a different clinic on a different template, in wildly varying quality: handwritten notes, faxes, photos taken on a phone at a patient bedside. Staff turnover and transcription errors were pushing operational cost up and making every HIPAA audit harder to pass.

They needed a platform that could ingest any document, extract structured data reliably, and produce an auditable record for every field decision — without sending PHI anywhere it wasn't supposed to go.

Our approach

Designed a multi-tenant architecture where every client organisation is fully isolated at the database, storage and audit-log layer.
Built a staged pipeline: preprocessing (deskew, denoise, normalise), OCR extraction via AWS Textract, LLM-assisted field mapping through a managed inference layer, and structured output validation with schema-level constraints.
Wrapped every step in a Celery job with retry, idempotency and per-document audit trails. Every field that lands in the lab system carries a trace back to the source region of the scanned document.
Hardened the platform against the HIPAA Security Rule from day one: encryption at rest via KMS, signed-URL S3 access, tenant-scoped access control, tamper-evident audit logging, and a documented data retention and deletion process.

Architecture highlights

Django REST API + React/TypeScript frontend
Celery + Redis for durable, retryable document processing
PostgreSQL with per-tenant schema isolation and row-level access checks
AWS Textract + managed LLM APIs behind a pluggable inference layer
Append-only audit log with per-document traceability
Infrastructure as code, environment parity from dev to prod

Outcome

HIPAA-aligned architecture reviewed by the client's compliance team and operating in production
Every extracted field is traceable back to the scanned source region, signed off by user or automated rule
Throughput capacity matched the lab network's sustained daily volume, with headroom for seasonal spikes
Staff time on manual data entry dropped dramatically; operations team moved to exception handling instead of keying

(Specific accuracy and throughput numbers are redacted per engagement; reference calls available on request.)

All case studies

Ready when you are

Let's build something that ships.

Tell us about your project. A senior engineer will reply within one business day, no pitches, no forms-before-forms.

Start a project See our work