Skip to main content

Monitoring

Scryon emits enough telemetry to run a small SLO programme out-of-the-box. This page gives concrete dashboards and alerts to start with.

Health budget

SLITargetNotes
Pipeline success rate≥ 99%1 - rate(scryon_calls_failed_total) / rate(scryon_calls_uploaded_total)
Time-to-complete p95≤ 60s for 5-minute callshistogram_quantile(0.95, scryon_pipeline_total_duration_seconds_bucket)
HTTP error rate< 1% 5xxrate(http_server_requests_seconds_count{status=~"5.."}) / rate(http_server_requests_seconds_count)
Worker stuck rate< 0.1%rate(scryon_calls_swept_total)

Prometheus scrape

Point Prometheus / Grafana Cloud / Datadog at /actuator/prometheus on a 15-30s cadence.

Example Prometheus job:

- job_name: scryon
scrape_interval: 30s
metrics_path: /actuator/prometheus
static_configs:
- targets: ["scryon.internal:8080"]

Suggested dashboards

1. Pipeline overview

PanelQuery
Calls per hoursum(rate(scryon_calls_uploaded_total[5m])) * 3600
Pipeline success rate1 - rate(scryon_calls_failed_total[5m]) / rate(scryon_calls_uploaded_total[5m])
p50 / p95 / p99 e2e durationhistogram_quantile(0.5/0.95/0.99, sum by (le) (rate(scryon_pipeline_total_duration_seconds_bucket[5m])))
Failed reasonstopk(5, sum by (reason) (rate(scryon_calls_failed_total[5m])))

2. Stage breakdown

Per-stage timers (all are *_duration_seconds):

StageMetric
Audio preprocessingscryon_audio_preprocessing_duration_seconds
Diarizationscryon_diarization_duration_seconds
Transcriptionscryon_transcription_duration_seconds
Alignmentscryon_transcript_alignment_duration_seconds
Normalizationscryon_transcript_normalization_duration_seconds
Voice matchscryon_voice_embedding_provider_duration_seconds
Analysisscryon_analysis_duration_seconds

Plot p50 / p95 of each as a stacked area to see where time goes.

3. Provider health

PanelQuery
pyannote fallback raterate(scryon_diarization_fallback_total[5m])
Voice match outcomessum by (outcome) (rate(scryon_voice_match_outcome_total[5m]))
Lemonfox 4xx raterate(http_client_requests_seconds_count{host="api.lemonfox.ai",status=~"4.."}[5m])
OpenAI 429 raterate(http_client_requests_seconds_count{host="api.openai.com",status="429"}[5m])

4. Voice profile usage

PanelQuery
Profiles created (7d)increase(scryon_voice_profile_created_total[7d])
Match outcomes by statussum by (outcome) (rate(scryon_voice_match_outcome_total[1h]))

Suggested alerts

AlertConditionWhy
Pipeline failure spikerate(scryon_calls_failed_total[5m]) > 0.1More than 10% failure over 5 min.
All calls failingrate(scryon_calls_failed_total[2m]) > 0 AND rate(scryon_calls_completed_total[2m]) == 0Total outage.
Stuck jobsrate(scryon_calls_swept_total[15m]) > 0Sweeper had to clean up — investigate.
Sentry rateexternalSentry's own alerting.
Voice match always ambiguousrate(scryon_voice_match_outcome_total{outcome="ambiguous"}[1h]) / rate(scryon_voice_match_attempted_total[1h]) > 0.5Profile probably stale or low-quality.
Lemonfox 5xxrate(http_client_requests_seconds_count{host="api.lemonfox.ai",status=~"5.."}[5m]) > 0.05Provider degraded.

Logs

Recommend shipping logs via stdout to your platform's collector. Useful filters:

FilterQuery
All pipeline events for a callevent=PIPELINE callId=f0a1d2e3-...
Failed stagesevent=PIPELINE status=FAILED
Voice matchevent=VOICE_MATCH_*
Speaker resolutionevent=SPEAKER_RESOLUTION
HTTP request access logevent=HTTP_REQUEST_COMPLETED

Sentry

When SENTRY_DSN is set, Sentry receives:

  • Every unhandled exception in HTTP handlers.
  • Every pipeline-stage RuntimeException (ScryonErrorReporter).
  • Scrubbed: request bodies, sensitive headers, transcript text.

Alert recommendations in Sentry:

  • New issue notification (immediate).
  • Spike: > 50 events / hour on any release.
  • Release health degraded: crash-free sessions < 99%.

Synthetic checks

A 1-minute synthetic check from your favourite tool:

curl -sf https://api.scryon.app/api/health > /dev/null

For a deeper probe, run a daily end-to-end test that uploads a 30-second fixture call against a staging Firebase project and asserts the analysis comes back.