Building Jarvis – NLP Hot Word Detection

Have you ever wished you had your own Jarvis? An Artificial Intelligence system tending to your every need. Solving complex tasks for you and pouring coffee whenever you need it?

As stated by Mr Stark himself, Jarvis started as a small Neural Language Processing network. While we are not S.T.A.R.K. Industries we could still start building our Jarvis out of many open source components available.

There are however two disadvantages to doing so, first is that we will not learn anything. And the second is that we will be stuck with the black box we cannot extend or improve on. Obviously, we will go with the manual approach.

The first step in building our AI may sound like a simple problem but as you will see in a bit, is the opposite of that. What I am talking about here is called the “Hot word” detection (a.k.a. keyword spottingtrigger word detectionwake word detection). This technique is meant to save processing power so that Jarvis is only processing our commands when we need it to.

Disclaimer: I will be using Keras for simplicity and in this article, you will probably find the only readable Keras keyword spotting network architecture description (I know I couldn’t find any.) Also, if you are not familiar with basic concepts of neural networks I would suggest jumping to my previous articles before you continue:

They are already here!

If you are using Siri, Alexa or Google Home, you have already calibrated and used a small neural network designed for wake word detection. While real Jarvis had unlimited resources and could continuously listen to Mr Stark’s voice.

Unfortunately, constantly pumping audio streams into a sophisticated Natural Language Processing network isn’t a viable option for us. Many giants including Google and Apple have no such resources to spare either (at least not per device). All the major players currently on the market use (to a certain degree) two-stage voice processing.

The first stage happens off-line on the device and is supposed to establish, at least to a certain degree that we can pass the device audio stream to the second on-line stage. This online stage is where the actual processing of that stream happens. Example of this could be that your mobile device constantly listens to you and when it detects a “Hey Jarvis” phrase with an 80% certainty – it will decide it is worth sending an audio stream to the cloud. In the cloud, a more sophisticated algorithm will ensure the “Hey Jarvis” phrase was spoken and if so, it will start processing further commands spoken after our Hot-Phrase.

This on-device, off-line stage is the problem we’ll be working on here. Think about it! If we wanted to continuously send a data stream to one of available voice to text service we would be spending about a 1000$ every month per device!

So now you hopefully understand why wake word detection is so important. It allows us to detect a certain phrase with a certain degree of probability and when we detect it. We can pass the audio stream to our cloud services. They operate on much faster hardware and use less power efficient neural networks to perform deeper analysis.

This is exactly what we’ll do here. We’ll design a Neural Network with few key requirements in mind: speed, accuracy and power efficiency. We need efficiency as we will be designing for the IoT world (Raspberry PI, mobile phones, Arc Reactor powered exoskeleton armour, things like that). While you may say there are some ready products we could use like PicoVoice or Snowboy.AI – these are shipped as black boxes that cannot teach us a lot.

What does Jarvis eat?

Before we get to the network architecture, we need to prepare some test data. Without good data, we cannot expect to train a good network. Training data is to a neural network what food is to human. It is extremely important and should be balanced. There is a well-known phrase among data scientist describing this: “garbage in – garbage out”. Going overboard or not providing enough quality and quantity of proteins can push one to be as skinny as young Steve Rogers or as fat as “Bro Thor”. This could pose a challenge as we need samples of our wake-phrase recorded by many people in a variety of background noises…

Or do we? Well, yes, we do – but we don’t have to know thousands of people and spend all the time recording our samples. Instead, we can cheat a bit and use existing text to speech solutions like Amazon Polly or Google Text To Speech. They both provide a web console where you can test different phrases and configurations for free in your browser.

Both Google and Amazon support the SSML (Speech Synthesis Markup Language). It’s a simple script language allowing for speech tuning – making little modifications to what the output voice will sound like. It should be relatively easy to write a script using available client SDKs which will generate many variations for our wake phrase. Variations will be constructed via permutations of different speech synthesiser settings via the SSML. All the settings we can use for Google TTS. like “Pitch” or “Emphasis level” are all well described here and the amazon equivalent can be found here. 

Above you can see 12 different voice samples all saying “Hey Jarvis” described in the SSML.

Let’s mix things up

This should allow us to generate a couple of hundreds of voice samples. The only thing that is left is to mix our samples with some background noises. For this, you can simply leave your voice recorder on in a place or places where you will most likely be using our detector (house, park, office, any other place in the vicinity of the nine realms). You can use any kind of sound processing library to mix your samples with random clips of the background noise audio file. I have used NAudio which is a simple yet powerful library allowing among others mixing audio streams.

As with any neural network, it is not easy to say how many samples are a good amount – it depends on the network architecture and the working environment but for our project, I can suggest (based on my own experiments) a starting dataset of:

  • 200 positive samples recorded over the varying degree of background noise.
  • 100 positive samples recorded over silence.
  • 200 negative samples of random words recorded over varying degree of background noise.
  • 100 negative samples recorded over silence.

Now that we have our test data, we can start thinking about the network design and how we are going to train it with our samples. We need to decide on the shape of the network as well as the number of layers and neurons. We also need to think about any pre-processing we may want to do on files for them to be easier digestible by our network. Don’t worry! It’s not going to be as scary as it sounds.

Tip for Lazy Scientists: If you don’t want to spend time preparing your test data, you can use part or entire Google Commands Dataset. It consists of over 100 000 WAVE files representing different people saying different commands.

Hammer time – Network Architecture Design

Selecting the right architecture for language processing is not an easy task. What we usually do is start with a couple of well-described architectures. They are not hard to find, search for “ wake word detection neural networks”. Pick one, evaluate it, start changing its hyperparameters and layers to see if results improve.

You may want to spend a bit of cash on a cloud TPU (special hardware build for ML available remotely via This will be helpful as your local setup may not allow you to experiment as fast as cloud TPUs do. Cloud TPUs can be hundreds of times quicker than regular CPUs.

Data science on the budget tip: If you prefer to run your experiments locally I would recommend one of the dedicate ML Enabled Nvidia Turing line of GPUs. They are relatively cheap ($350+) and an order of magnitude cheaper than Nvidia Tesla’s which can go for $10000+ and yet both chips in a lot of scenarios will perform at a similar level. Think of it as a community and enterprise version of software – there is often not much of a difference except for premium support and some advanced features you may never use.

Our selected base network model architecture has been described in this Stanford University research paper Speech Command Recognition with Convolutional Neural Network. 

What this paper describes and what the above diagram shows are a regular Convolutional neural network. Our model uses a drop-out technique to improve learning and the pooling technique in the form of MaxPool layer to stabilize data coming through the network. Both techniques are known and widely used in other similar networks. The convolutional segment is complemented with a fully connected layer which ends with a SoftMax function that should give us our prediction.

So far nothing out of the ordinary here – and as you will hopefully learn the great networks do not have to be overly complex – they do however have to be fine-tuned and fed the right training data.

One interesting component of our network is the use of Mel Filter (MFCC). MFCC in our case is a pre-processing step done on the test data, so it is easier for the network to detect patterns in training. Since convolutional networks work best with images and large multidimensional arrays – we will convert our audio stream to something more suitable to our convolutional layer. We will convert the audio wave to an image. This is where the Spectrograms and the MFCC (Mel Filter) comes in handy.

Something to note here: MFCC is not part of our model, the network starts at Conv1 and the MFCC happens at the loading time of our WAVE files as you will see in the code below. Mel filter is commonly used and is nothing else than changing sound wave (which is basically an array of floating-point values) to a spectrogram.

Spectrograms, as you can see on the image below, are a different representation of sound – they are better suited for voice analysis.

Above you can see the input waves in blue and then the colourful rectangles representing the same sound after we apply Mel Filter. I will soon show you how to code this transformation. We could use a regular spectrogram or any other spectrogram generating function, but it has been proven that MFCC is best for visualizing human voice.

As for other parameters shown on the diagram they all have their representations in the Keras framework which we will be using for creating this model in code. In our experiment, we will create a model with just under 1 million parameters (parameters are sort of moving parts of our network influencing one another important thing you need to know about them is that 1 million is a relatively small number thus making our model suitable for a mobile device type of performance.

Coding the solution

I am not sure what is with data scientists and the object-oriented programming – it appears the community avoids it at all cost or it’s just my bad luck. Either way to understand something fully and to be able to extend code easily we must build our codebases with OOP and SOLID principles in mind. The whole code base for the solution is available on my GitHub here. 

The structure of the project is as follows:

  • “dataset” directory – contains training and evaluation wave files
  • “output_model” directory contains the output of our code: in each subdirectory, we will find a single-run result in a form of h5 file containing our trained model and the configuration.json file containing serialized parameters of our network plus some optional notes. This is purely so that we can later come to the results and compare the changes we did like removing a dropout layer or changing normalization method. Files names are created from the current date and the model ACC As you can see my model did 94.44% in its best run. I dare you to do better and share your changes in the comments!
  • py – is just a high-level orchestrator for our experiment it only executes other Services and Factory methods (if a class creates something it’s a Factory – if it acts on some object it is a Service)
  • py – this is responsible for loading our Training and Evaluation data sets to the nice numpy arrays which Keras will expect as the training input parameters.
  • py – this class encapsulates every single Wave file that we will use for our training and evaluation – in this class, we set the binary class label 0 or 1 stating whether it is positive or negative sample. More importantly in this class, we execute our Mel Filter code with the following method:

def extract_audio_features(self):
    audio, sample_rate = librosa.load(self.file_path, sr=self.sample_rate, res_type='kaiser_best')
    self.mfcc = librosa.feature.mfcc(y=audio, sr=sample_rate, n_mfcc=40)
    self.mfccScaled = np.mean(self.mfcc.T, axis=0)

as you can see there isn’t much to it – the above is pretty much taken from the documentation of a very popular audio processing library called librosa more on the details of this process can be found here. 

  • – here we build our model using very useful Keras factory methods
model = Sequential()
model.add(Convolution2D(input_shape=[40,32,1], filters=64, kernel_size=[20, 8], strides=[1, 1],
    padding='same', kernel_initializer=TruncatedNormal(stddev=0.05), activation='relu'))

model.add(MaxPooling2D(pool_size=[p_pooling_in_time, q_pooling_in_frequency],
    strides=[p_pooling_in_time, q_pooling_in_frequency], padding='same'))

model.add(Convolution2D(filters=64, kernel_size=[10, 4], strides=[1, 1], padding='same',
    kernel_initializer=TruncatedNormal(stddev=0.01), activation='relu'))


model.add(Dense(32, activation='relu', kernel_initializer=TruncatedNormal(stddev=0.01)))
model.add(Dense(128, activation='relu', kernel_initializer=TruncatedNormal(stddev=0.01)))
model.add(Dense(1, activation='sigmoid', kernel_initializer=TruncatedNormal(stddev=0.01)))

Architecture of this network is based on the cnn-one-fpool3 network described in the Stanford university paper I mentioned above, you can also read about it in this Google Research paper where its compared to a couple of alternative architectures. These alternative architectures may give you ideas for improvements.

  • – this service contains methods related to the training and evaluation of our model. This is also where you will find code responsible for saving our network to the disk including serialization of the configuration file.

Wrapping up

So, this is pretty much it, as you can see it is not hugely complicated, we have:

  • Prepared our test data using available text to speech technology
  • Researched a base of our neural network architecture
  • Pre-processed our wave files with an MCFF
  • Trained the network
  • Seen that it was good and enjoyed the fruits of our labour!


I hope this article gave you some inspiration, feel free to experiment with the number of layers and their shape – machine learning is all about experimentation and acquiring the “intuition” based on our previous experiments. One interesting alteration to the above experiment could be using the GAN networks to create a larger set of test data for us. You can read more about GAN networks in my article here.

Also, if you need some inspiration of what your own Jarvis could do, have a look here at the open-source project Jasper, which can serve as a base or an inspiration for your own Iron butler!