Ever since we hacked together our first rough proof-of-concept of an app estimating the ratio of men’s vs women’s voices in meetings (waaay back in 2016) we’ve been unable to let the concept go. In January, when Google announced they’d chosen Keras as the first official high-level library to be added to TensorFlow core, we were so inspired we decided we had to give it a shot. What could possibly go wrong, right?
Looking for data
All successful machine learning projects follow this basic formula:
Everyone knows this. Thus, the first problem in our machine learning endeavor was data. Training a model takes data, and lots of it.
Like any sane person looking for something, we went ahead and binged it. Nah, I’m joking of course. We googled it. I’m sure we would’ve gotten the same result with Bing though. Eventually. Anyways, up turns an article about classifying the gender of people by voice. Using a variety of ML techniques they found a model with a 99% accuracy on their data. What’s more, their model was very simple and boiled down to this decision tree:
Is there a lib for that?
Onto the next part of the famous machine learning auto-success formula: testing. We had to implement the model in Swift on iOS to verify that it worked. The algorithm itself is stupid simple, but extracting features from audio in Swift proved a different matter. There is, as far as we can tell, no near-omnipotent audio analysis lib for Swift—like seewave for R—that we could use to extract f0 and IQR from audio in real time. After much googling and some hair-pulling, we found the pitch detection lib Beethoven which looked like it might suit our purpose. We cloned it, ripped out the parts we didn’t need and added the missing stuff that we did. I ♡ the MIT license.
With a spring in our step and a smile on our face we started implementing the model using the modified Beethoven library. We were hoping to achieve an accuracy of at least 85-90% on our own test data. Naturally, that didn’t happen. We had to go back to the drawing board.
A few mostly sleepless nights followed. Finally, by adjusting the thresholds in the original model and performing various black magicks, we felt confident enough to proceed with building the actual app.
Someone messed with the iPhone mic
Once at the “works on my device”-stage, we started testing on different devices and with more voices. To our dismay, we found that something had changed in the recording system between the iPhone 6 and later models.
It seems iPhone 6, as part of the built-in noise reduction, has a low shelf filter that cannot be disabled that newer devices such as 6S, SE and 7 lack. Several sources claimed this filter could be turned off by setting the
AVAudioSession mode to
AVAudioSessionModeMeasurement. We found this to be false.
In addition to this filter, the mic of these newer phones are more sensitive and record at a higher level than older ones. These two factors working together were bad news for both our fundamental frequency extraction algorithm and our interquartile range calculator. The results were bad. Very bad.
The only solution seemed to be the good old device detection trick we’ve used a bazillion times when building websites, knowing all along it’s a bad idea. And presumably because it’s a bad idea, Apple has made it as difficult as possible to find out the device model in any more detail than “iPhone”. Thanks, Apple. Luckily, we had code lying around for this precise case in an old project. Using it, we were able to detect the device model and adjust the input level threshold and apply our own low shelf filter on newer devices to make them behave.
We promise we’re not recording your top secret meetings
The reason we couldn’t avoid this whole mess and just release our old version of the app to the public is that the gender classification was done server-side in a fairly inefficient manner. The server could barely handle one connected device at a time. I don’t even want to think about what would happen if more than ten people tried to use the app at the same time.
For this concept to work at scale, the classification has to be done directly on the device. That means we do not have to send any audio data whatsoever from the device. And we don’t. Cross my heart. What we do send are minimal statistics for each recorded session to be able to deliver some sort of analytics platform in the future. Plus they’re fun to look at on the web site.
What happens next?
This app is not perfect. I’ll be the first to acknowledge that. There are still features to add, issues to fix and problems to solve.
First, the algorithm makes mistakes even under optimal conditions, and the more noise there is the worse it gets. This cocktail party problem is well known and while some people may soon solve it, we are not those people.
Second, since we can only look at the physical characteristics of a voice we cannot possibly know whether the person to whom it belongs identifies with their biological gender or not. This is unfortunate, because the last thing we want is for people to feel excluded and marginalized.
And finally, a mere app cannot solve the problem of gender inequality. But maybe, just maybe, it can help.