Software System for Verification and Identification of Speakers

Avatar software package identifies speakers and verifies speakers’ identity based on individual statistical characteristics of the vowel sounds in their speech.

Avatar automatic speech recognition module provides unbiased, highly precise and fast individual phonemic patterns analysis. Moreover, the system is completely context independent and can be applied to any language and any language groups.

Avatar User Guide v1.0

Key Points

Effective speaker identification and verification regardless of speaker’s language and gender

High precision for audio recordings starting from 10s

Reliable results for audio recordings from 30ms

Available as a standalone desktop application

Speaker Identification Workflow

There are several audio files with voice recordings of different speakers. It is necessary to determine whether this is the same speaker or not.

Upload the source files into Avatar’s directory and run an automatic comparative analysis focused on one of the basic sounds.

Get a downloadable graph showing speakers speech patterns’ comparison and error probability percentage.

Speaker Verification Workflow

There is a large database of phonograms produced by various speakers, some of them might be already identified, some of them are not. It is necessary to sort them by the proximity of the individual speakers’ characteristics for further analysis or identification.

Upload the source files into Avatar’s directory and run an automatic audio sorting for sound [a:], [i:], or for the both of them.

Get a list of files sorted by the probability of the first kind of error. This list can be saved, printed or exported as a Word document.

Contact us to get a free 3 to 6 months trial of one of the Silentium tools


Avatar’s speaker recognition model follows the mechanics of human hearing. Humans do not hear “whole” sounds, they rather perceive certain combinations of “atomic” components for each sound they recognize.

Moreover, the set of “atomic” components for each sound may partially overlap with the set of “atomic” components for other sounds. For example, the set of component structures for sound [a:] overlaps with the set of structures for sound [o:]. Similarly, the sound [i] overlaps with the sound [i:].

With a spectra representation these structures can be interpreted as averaged spectra of the corresponding sounds for each particular speaker. These averaged spectra are then used in the applied model of the Avatar’s Phonemic Machine to distinguish the structural components of individual speech sounds patterns.

The Avatar’s automatic speaker recognition system does not use generally accepted notions such as the pitch frequency and speech formants. Still, these parameters are functions of the averaged integral characteristics of the spectra of “atomic” sounds and are taken into account indirectly in the identification process. They are also present in a number of graphical interfaces of the system.

A significant feature of the Avatar’s voice recognition system is the analysis of short time intervals, from 10 to 30 ms. The actual lengths are determined adaptively, depending on a number of factors. The frequency resolution for calculating spectral characteristics of the “atomic” components is 1 Hz regardless of intervals’ lengths and is achieved using non-orthogonal wavelet transforms based on the Morlet wavelet.

This interpretation of the spectra is fully consistent with the classical concepts, up to the order of resolution in frequency and specific form of time window functions. The Avatar system can analyse voice characteristics of speech recordings that last from 30 to several hundred milliseconds.


At the core of the system is the built-in Phonemic Machine designed on the basis of deep learning neural networks. It analyses each speaker’s individual set of vowel phonemes and compares a wide range of characteristics to other speakers’ speech patterns.

Since speakers are identified by the closeness of the characteristics of their pronunciation of vowels, the analysis is context independent and provides reliable results regardless of speakers’ native language, age, and gender.

The Avatar software system also includes a number of specially built databases that provide solutions for the verification and identification of speakers, depending on the duration of the source phonograms and speaker’s language group.

The basis for the probabilistic identification of speakers is the error graphs of the first and second kind. These error probabilities of the first and second kind are used directly in solving the problems of speaker identification and verification. All probability characteristics were determined using special test sets based on prepared experimental data.

The system has at least an order of magnitude higher frequency resolution when evaluating any parameters of the spectrum of phonogram characteristics compared to existing systems. This is achieved using non-orthogonal spectral transformations.

Operational speed: device verification for 1000 audio recordings takes several minutes for the PC with two nuclear processors.

Interface localizations: English, Russian and Ukrainian.

Custom localization is available upon request.


The effectiveness of the Avatar automatic speech recognition system depends on the length of the source recordings. The feasible audio file duration starts from 30 ms and the longer the audio is, the smaller is the probability of errors.

The probability of errors is calculated based on graphs of errors of the first and second kind for large audio recordings data sets with sampling frequency >= 8000 Hertz.

The use of a special technology for comparing the full spectra of the speakers’ vowel sounds at small time intervals provides highly precise results.

Depending on the audio files lengths the probability of an error for speaker’s identification is 0.2% or less.

Source Audio Requirements:
  • Optimal duration: 200 ms and longer
  • Minimum duration: 30 ms (for sounds [a:] or [i:])
  • Maximum duration: unlimited
  • Supported file format: .wav (at least 8,000 Hz)
Hardware Requirements:
  • Processor: from 2 GHz
  • RAM: at least 4 GB (preferably 8 GB)
  • OS: Windows 10 (64 bit)