Speaker resolution
Speaker resolution is the step that turns raw SPEAKER_00 / SPEAKER_01 diarization labels into real names like Praveen / Ravi. It is deterministic, text-only, and conservative — when evidence is weak, the speaker stays UNKNOWN rather than receiving a wrong name.
Important. Voice analysis happens in a separate step (Voice embedding). This service is intentionally non-biometric.
Rule ordering
The resolver applies rules in order of evidence strength. The first matching rule wins.
-
Voice match — if
VoiceMatchServicepre-labelled a speakerrole=USERwithlabelSource=VOICE_EMBEDDING, the resolver fills in the user's display name (without touching the voice matcher's confidence). -
Greeting match — speaker says
"hi <name>"/"hello <name>"/"good morning <name>"etc. within ≤ 2 tokens of a name variant. The speaker who greets is labelled as the other party.HIGHconfidence when within the first 2 segments AND first 30 s;MEDIUMotherwise. -
Name mention asymmetry — Speaker A mentions a name and Speaker B never does → A takes the opposite role at
MEDIUM. If both speakers mention both names → ambiguous; nothing assigned. -
By-elimination — strict two-speaker call, one resolved at HIGH/MEDIUM → the other takes the opposite role at
MEDIUM. Three-or-more-speaker calls do not auto-resolve untouched speakers. -
Direction tiebreaker —
INCOMING/OUTGOINGdirection can promote aMEDIUMgreeting match toHIGH. Direction alone never assigns a label. -
Answering-pattern detection — scans the first 4 segments (within 20 s) for short answering phrases. The speaker who opens with such a phrase answered the ringing phone — they are the
CONTACT; the other becomesUSERby elimination.MEDIUMconfidence, taggedANSWERING_PATTERN. Emits warninganswering_pattern_used.Recognised answering phrases include:
hello,yes,speaking,go ahead,who is this,yeah,haan,bolo,hallo, and common variants. -
Unresolved — leave UNKNOWN — if no rule above assigned either speaker, both remain
role=UNKNOWN. No direction-based guess is made. Aspeaker_roles_unresolvedquality warning is emitted so the client can surface an honest message rather than a confident but wrong label. -
Phone fallback —
contactNamemissing butphoneNumberpresent → CONTACT becomes"Contact ending NNNN"(last 4 digits only). The rest of the number never enters a transcript field.
Why call direction is not used for role assignment
An earlier version of the resolver used call direction (INCOMING / OUTGOING) as a last-resort heuristic: "if OUTGOING, the callee answers first → first speaker = CONTACT." This was removed because:
- Direction ≠ who spoke first in the audio. The recording may start mid-conversation, Android's CallLog direction can lag, and the callee/caller might not be the first audible voice in every scenario.
- Silent failures are worse than honest uncertainty. A wrong but confident label caused the summary to attribute statements to the wrong person — harder to spot than an explicit
UNKNOWN.
Direction is still used as a tiebreaker to promote an already-resolved MEDIUM greeting match to HIGH (rule 5), but it never initiates a role assignment on its own.
Segment sort determinism
Speakers are assigned stable IDs (spk_1, spk_2, …) by chronological first-appearance order in the transcript. When two segments share identical timestamps (common at diarization turn boundaries due to floating-point rounding), the sort uses (startSeconds, endSeconds, canonicalSpeakerId) as a compound key. The string comparison of SPEAKER_00 / SPEAKER_01 ensures the assignment is fully deterministic across identical runs.
Quality warnings
| Warning | Meaning |
|---|---|
answering_pattern_used | Speaker role assigned via opening phrase detection |
speaker_roles_unresolved | No evidence resolved either speaker; both are UNKNOWN |
positional_fallback_used | (Legacy — no longer emitted by the current resolver) |
speaker_identity_ambiguous | Both speakers mentioned both names; resolver refused to guess |
multi_party_call_not_auto_resolved | More than 2 real speakers; untouched speakers stay UNKNOWN |
Token-boundary matching
Name matches respect word boundaries:
"ram"does not match"program"."alice"does not match"alicia".- Common honorifics (
mr,mrs,ms,sir,madam,shri,dr) are stripped before extracting the first name to search.
Privacy
The resolver emits a privacy-safe speakerResolution telemetry block on the transcript:
{
"strategyVersion": 2,
"usedUserDisplayName": true,
"usedContactName": true,
"usedDirection": false,
"usedPhoneFallback": false,
"voiceEmbeddingEnabled": false,
"voiceProfileUsed": false,
"voiceMatchStatus": "DISABLED",
"totalSpeakerCount": 2,
"resolvedSpeakerCount": 2,
"warnings": ["answering_pattern_used"]
}
No raw names, transcript fragments, or phone numbers appear in this block — only counts, booleans, and short warning codes.
What this service is NOT
- Not an LLM. No prompts, no creative interpretation.
- Not a speaker identification system. We never store voiceprints in this service. (See Voice embedding for the opt-in feature that does.)
- Not the place for user corrections. A future user-correction endpoint will write back with
labelSource=MANUAL.
Code map
| Service | File |
|---|---|
SpeakerNameResolutionService | Resolver — all 8 rules in order. |
AnsweringPatternResolver | Detects answering phrases in opening segments. |
NameMatcher | Token-boundary-aware name matching. |
CallContext | Metadata bundle (display name, contact, phone, direction). |
LabelSource | Enum of evidence sources (includes ANSWERING_PATTERN). |
SpeakerRole / SpeakerConfidence | Role and confidence enums. |
TranscriptNormalizationService | Assigns spk_1 / spk_2 IDs by deterministic sort order. |