Natural Language Processing ft. Siri

5 min readSep 14, 2019

“Hey Siri, what’s the weather like?” If you are an iPhone user, chances are you have experimented with Apple’s very own virtual assistant — literally at your fingertips — Siri. However, how exactly does Siri work?

Siri uses a variety of advanced machine learning technologies to be able to understand your command and return a response — primarily natural language processing (NLP) and speech recognition.

If you’ve never heard of natural language processing before and would like a quick introduction, watch this.

NLP primarily focuses on allowing computers to understand and communicate in human language. In terms of programming, languages are split up into three categories — syntax, semantics, and pragmatics. Whereas syntax describes the structure and composition of the phrases, semantics provide meaning for the syntactic elements. Pragmatics, on the other hand, refers to the composition and context in which the phrase is used.

For the majority of its existence, natural language processing has had a primarily sophisticated, rule-based approach; sentences were generated by using syntactic information from a database. In the late 1980s, however, machine learning algorithms began to replace previously inefficient rule-based algorithms.

NLP systems are dependent on statistical models to produce reliable responses. Deep neural networks are a vital branch of statistical models in the recent successes of NLPs — including answering questions, generating text, and automatic summarization.

But how exactly does all of this work? Let’s use a toddler as an example:

Daisy is a 2-year-old who loves to bite her brother, Jack. However, when Jack says, “Daisy, stop biting me,” she seems to stop. How does Daisy know what Jack is asking her to do? As Jack tells her to stop, he also pulls his arm away. Because this has happened numerous times, Daisy is able to attribute “stop biting me” to the fact that Jack does not want to be bitten. She is able to process that data and store it for the future, so she knows what to do when she hears that phrase. Daisy is being trained.

Similarly, in the world of NLPs, Daisy would be considered a data model as she contains trained data. Processing data in the same fashion as exemplified by Daisy — although there are many other ways to do so — is known as Named Entity Recognition. This process adds semantic knowledge to the element and allows the machine to understand the subject of the given task.

Though this process may seem simple, there are a large variety of text-processing substages — such as dependency parsing, stemming and lemmatization, and TF-IDF, to name a few.

To get a better understanding of how NLPs are applied, let’s delve deeper into how Siri works. Siri’s ability to detect “Hey Siri” is primarily due to the use of a recurrent neural network, as well as multi-style and curriculum learning. Leveraging techniques from the branch of speaker recognition, Siri can avoid unintended activations — such as when the user says a similar phrase. Speaker recognition allows Siri to ascertain the primary user using his or her voice, involving user enrolment and recognition.

During the user enrolment stage, the primary user is asked to say a few phrases to create a statistical model for the user’s voice. In the stage of recognition, the computer compares a speech input to the primary-user-trained model and decides whether to accept or reject it.

System diagram of personalized “Hey Siri” (Apple 2018)

The boxes in green display the stages of feature extraction where the input of “Hey Siri” is converted into a fixed-length speaker vector. In the first stage, the speech input is converted to a fixed-length speech supervector, which includes information about the phonetic content, background recording environment, and identity of the user.

In the second step, the speech vector is transformed to place emphasis on the characteristics of the user’s voice rather than environmental factors. This allows a user’s utterance of “Hey Siri” to be recognized in various environments.

Devices that have the “Hey Siri” feature enabled store a user profile that consists of speaker vectors. It contains five vectors after the enrolment process, and in the model comparison stage, a corresponding speaker vector is extracted for each test utterance. Its cosine score is computed against each of the speak vectors already in the profile, and if the average of these scores surpasses a specific threshold (λ), the device processes the user’s command. The latest accepted speaker vectors are added to the user profile until it has a total of forty.

The most predominant part of any speaker recognition system is the speaker transform. The transform minimizes within-speaker variability and maximizes between-speaker variability. Siri’s speech vector is 442-dimensional (26 MFCCS * 17).

The very first version of the speaker transform used for Siri was trained using Linear Discriminant Analysis (LDA). It used sig data from 800 production users with 100+ utterances each, producing a 150-dimensional speaker vector.

The speaker transform was further improved by using explicit enrolment data, enhancing the front-end speech vector, and switching to a non-linear discriminative technique in the form of deep neural networks (DNNs). A DNN from 16,000 users each speaking ~150 utterances was trained. The network structure consisted of a:

100-neuron hidden layer with a sigmoid activation
100-neuron hidden layer with a linear activation
Softmax layer consisting of 16000 output nodes

After the DNN is trained (using the speech vector as an input), the softmax layer is removed. The output of the linear activation layer is then used as a speaker vector. Here is a visual representation of the process:

Deep neural network training & speaker vector generation (Apple 2018)

However, based on recent experiments, a network structure of four 256-neuron layers with sigmoid activations, followed by a 100-neuron linear layer has proven to achieve the best results.

Key Takeaways:

Natural language processing allows computers to be able to answer questions, extract information, analyze sentiment, and make predictions based on the context of the phrase inputted
Machine learning and natural language processing technologies are at the core of voice assistants
Through revolutionary advancements have been made in recent years, the performance of speaker recognition systems still have room for improvement
The integration of deep neural networks can significantly enhance the speaker vector generation process

To learn more:

Here’s something to think about: How can we train computers to understand sarcasm?

Feel free to reach me at komalsaini@live.com or connect with me on LinkedIn! :)

Natural Language Processing ft. Siri

Written by Komal Saini