A Dutchman, a doctor and a computer walk into a bar

Speech is one of the things we consider deeply human: most of our interaction is based on it. So it’s no surprise that us humans have been trying to teach others to talk – be it animals or machines.

“Speech is a more natural interface than typing”, summarizes Peter Smit, a doctoral candidate in Aalto University’s Centre of Excellence in Computational Inference. He has a strong background in machine learning and for several years now, speech recognition has been the application of machine learning he has focused on.

And he has a solid point. Regardless of how we use our fingers nowadays, they didn’t evolve for typing. And to be able to communicate with us better, computers will need to learn to understand speech.

However, anyone who has tried to ask Siri to call a friend has their doubts about the state of speech recognition. Especially because the field is not exactly new: already in the 60’s machines have been able to interpret digits and other simple words. 

Machine learning driving better speech recognition

During the last decade, we have seen a turning point. Whereas before single speech recognition systems were built to model and understand languages, the fast development of deep learning has now changed the game entirely. Speech recognition is nowadays really an example application of the use of deep neural networks.

Developing these new technologies is very dependent on data, and this has led to a situation where industries and markets are actually starting to drive the research: they have access to more and better data.

And even more interestingly, this means that you don't have to know the language to analyze it. It’s possible to scale data to recognize another language quite fast. This is something a human couldn’t do, but technology can.

Peter, for example, is a native Dutch speaker and fluent in English, too. But he has analyzed more than 10 languages, including Finnish and Northern Sami.

“All speech recognition techniques will become more language independent in the future. In five years, the whole area of speech recognition will look totally different”, Peter believes.


Understanding computers  

There are a few things that we still need to clarify to make this computer-human relationship work.

First, people have the tendency to overestimate how much machines actually understand.

Speech is such a natural interface for us humans that we assume it means understanding the concepts and ideas in the speech. But speech recognition is different from natural language processing: machines are able to recognize words, but not the meanings behind them.

Second, computers are very tech-savvy.

People often believe that for example background noises, sound quality or special accents are critical to speech recognition. These really aren’t problems at all: this side of technology is very good.

Third, people are very context-related.

“It’s in the nature of humans to adapt to different situations. Unfortunately, we can’t necessarily say the same thing about computers”, Peter explains.  “Context is everything. The same sound “too” can mean very different things in different contexts: go to, two sets, me too. To be able to type in the right word, the computer needs to analyze the context.”


This is not only true with individual words, but focus helps in other ways, too. As mentioned before, quality data is very valuable in speech recognition. If the data is from a specific focus area, the better the end result is.

“This is often seen as a flaw, but people have different focus areas, too. Not everyone can understand legal text or medical dictations.”

And medical dictation and transcription is the field where Peter is focusing next. He has joined the team of Inscripta, a startup that aims stop wasting people’s time on medical dictation. A recent study shows that in Finland medical professionals (including not just doctors but also nurses and other professionals) spend more than 50 % of their working hours on compulsory back-office tasks, of which typing in patient notes is the most time-consuming by far. The same goes for other similar healthcare systems in Europe and in North America.

To put this is another way, of the approximately 20 000 medical doctors Finland has, only half are able to actually focus on patient work while the other half waste their expertise on routine-like back-office work.  


Medical field as an application area

Medical dictation is a very good field to apply speech recognition. It’s fairly easy to gather a dataset including hundreds of hours of speech and texts that cover most of the medical-specific terms, and that’s enough to get the deep learning process going.

Transcribing has also a very vital role in healthcare for patient safety and legal reasons. There’s a lot of transcribing to get done, and using speech recognition could free healthcare staff’s time to do their actual job and meet more patients.

As in many other applications of machine learning, humans would still be needed for doing an overview of the job done by computers. But it’s way more effective than where we are now.

And of course, we should not forget: “People make mistakes, too. Not all words transcribed by humans are totally correct, and it will not take long for algorithms to get as good or even better in transcribing in specific areas.”