Forensics Audio– Speaker Recognition


shutterstock_137894381-e1416458727706.jpgPhoto: StopFraudCo

After joining the Central Audio team at Microsoft, audio innovation has become a huge part of what we do. It has completely changed my view on the role and impact audio can have in the world we live in today. Audio is impactful not just in the HoloLens, or augmented/virtual reality, but in our actual reality as well. I think this is an important thing to note because often times our associations with audio is in the form of entertainment, whether it be listening to music from an artist’s album, being in awe of the loud explosions or monster roars from a movie, or even creating a sonic immersive environment within the virtual world of video games. What fascinates me about the subject of forensics audio though is that audio can be a driving factor in our real world judicial system; it can save the lives of the innocent or convict the guilty of their crimes.

911 Calls

911 calls can be a crucial source of evidence in any case, and the nature of this evidence is composed solely of audio. Take for instance, the George Zimmerman case in Florida in 2012.

Zimmerman, a neighborhood watch, claimed to have shot an unarmed African American teenager (Trayvon Martin), in self defense. Take a listen to these 911 calls. What audio picture does this paint of what happened at the scene of the crime? While we can’t get a 100% conclusive picture of what happened, we can learn a great deal from the audio.

Speaker Recognition

So what do people who work in forensics audio actually do?  As you can hear from the previous example, it’s important to be able to identify who is speaking in the recordings. Voice ID and speaker recognition plays an important part in audio forensics. Just like how fingerprints are used to validate someone’s identity, voice prints can also identify who you are. SpeechPro’s SIS 2 is able to analyze two audio sources and compare them based on various criteria like pitch, fusion, SF (spectral flux pattern recognition), GMM (Gaussian mixture model) to evaluate the probability that the speaker in source 1 matches that of source 2.


Take for instance, this example of Donald Trump.


In May, the Washington Post wrote an article questioning whether Trump pretended to be his own publicist to brag about himself on a phone call back in 1991.  One important thing when analyzing audio for forensics is to remember to stay unbiased. No matter what your opinion or views are on one speaker or the other, it’s important to stay objective in evaluating the audio evidence. Forensics audio expert Ed Primeau used SIS 2 to analyze and compare the two voices  and the result was that the speakers voice ID traits had a more than 97.5% mismatch.

known_v_unknown.jpgPhoto: Ed Primeau

For successful voice ID to occur, the recording usually needs at least 20 words in it to get an accurate read of the person’s voice. Audio forensics experts then will often times request an exemplar, which is taking a sample of the suspected person’s voice and have them read exactly what is recorded. This will allow you to create a word-for-word direct comparison. Other common things that experts listen for in recordings is the accent, the pronunciations of consonants (i.e. how do they say their “s”?), vowels (i.e. how do they say their long “a’s?”), the pitch of their voice, the emphasis on words, the rhythm or pattern of their speech as well as pauses.

In general, this topic is very fascinating to me. I spent almost a full day just reading about it and listening to many examples.  My next several posts will dive deeper into other aspects of forensics audio, so stay tuned! 🙂

3 thoughts on “Forensics Audio– Speaker Recognition”

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: