Skip to main content
blog.backLink
Β·6 minon-deviceprivacyaudio

Speaker Recognition: How dijin Knows Who's Talking

In a meeting with five people, a raw transcript is almost useless without knowing who said what. dijin solves this with on-device speaker diarization.

β„Ή
Speaker diarization answers the question "who said what?" β€” turning a wall of text into an attributed conversation with names, timestamps, and context.

How Voice Enrollment Works

1

Record a Short Sample

Speak for about 10 seconds. dijin captures your unique vocal characteristics using the on-device microphone.

2

Voice Embedding Generated

A compact mathematical representation of your voice is created entirely on-device. No server involved.

3

Stored Locally

The embedding stays on your device. Not uploaded anywhere. Hardware-encrypted in Keychain.

4

Recognition Across Sessions

Days, weeks, months later β€” dijin matches the voice automatically. No re-enrollment needed.

The Diarization Pipeline

Speaker Diarization Pipeline
Audio Input
VAD (Voice β”‚ ──▢ Split into segments Activity Det)
Embedding β”‚ ──▢ Per-segment voice vector Computation
Clustering β”‚ ──▢ Group similar voices
Matching β”‚ ──▢ Compare vs enrolled voiceprints
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜```
~10 seconds
Enrollment
100% on-device
Processing
Cross-session
Persistence
Embeddings never uploaded
Privacy

Privacy by Design

πŸ”’
Voice embeddings are one-way mathematical representations. They cannot be reversed into audio. Even if someone accessed the embedding data, they could not reconstruct your voice. Embeddings never leave your device.

Cross-Session Memory

Because dijin stores embeddings locally, speaker identity persists across sessions. Meeting with Ayse on Monday and again on Thursday? dijin recognizes Ayse in both β€” no re-enrollment needed.

ScenarioWhat Happens
First meeting with Aysedijin clusters voice, you label "Ayse"
Meeting 3 days laterdijin auto-recognizes Ayse from stored embedding
New unknown speakerLabeled "Speaker 2" until you assign a name
Multiple speakers at onceEach speaker gets separate embedding and label

This separates a transcription tool from a memory system. dijin doesn't just record words β€” it remembers who said them.

blog.tryCta

blog.ctaDescription

blog.ctaButton
Speaker Recognition: How dijin Knows Who's Talking