CMU Sphinx – Speech Recognition System

Speech interface to computer is the next big step that computer science needs to take for general users. Speech recognition will play an important role in taking technology to them. Speech Synthesis and Speech Recognition together form a speech interface. A speech synthesizer converts text into speech. Thus it can read out the textual contents from the screen. Speech recognizer had the ability to understand the spoken words and convert it into text. Our goal is to create speech recognition software that can recognize words.

sphinx speech recognition

Speech recognition will play an important role in taking technology to the general Users. Speech Synthesis and Speech Recognition together form a speech interface. A speech synthesizer converts text into speech. Thus it can read out the textual contents from the screen. Speech recognizer had the ability to understand the spoken words and convert it into text. Our goal is to develop speech recognition system

In todays world, computers have become absolutely indispensable especially for the people of urban India. But for the development of the country, where most of the people live in rural areas as a whole, the technology has to reach them as well. The various computer accessories, like the keyboard and the mouse, require a certain level of expertise from the user, which cannot be expected from rural people. Besides, they also require people to be proficient in English as well, to be able to use the computer. In a country like India, where the literacy rate is quite low, the abovementioned constraints hate to be discarded, for the technology to be able to reach the grass-root level. This is where speech recognition becomes useful. The two main components of speech interfaces are: speech synthesiser and speech recogniser. The speech synthesiser transforms the written text into speech where as the speech recogniser understands the spoken words and changes them into text.

Existing Systems
Although speech recognition systems are already available, most of them have been built for English language. The languages model and the language dictionary that have been used in these systems are in English, and hence, they are not of much use to the rural people. However, there has been some significant work towards developing a speech recognition system in Konkani. Many speech recognition software and language processing tools have been developed for this purpose. Sphinx is the most commonly used speech recognition software in open source.

Speech Recognition – Definition and Issues

Speech recognition is the process of converting an input acoustic signal (input in audio format in the form of spoken words) and recognises the various words contained in the speech. These recognised words can be the final results, which may serve as commands and control for Question Answering System. speech recognition can be put together to take the audio formate as input and then generate the text format as an output.

Speech recognition involves different steps:
1. Voice Recording
2. Word boundary detection
3. Feature extraction
4. Recognition Using language models

Design of the System

1. Sound recording and word detection Component: This Component takes the input from the audio recorder, preferably microphone, and identifies the word in the input signaL Word detection is usually done by using the energy and the zero crossing rate of the signaL The output of this component is then sent to the feature extraction module.

2. Feature Extractor component: This component is generally responsible for generating the feature vectors for the audio signals input to it from the word detection component. It generates the MFCC (Mel Frequency Cepstrum Coefficients) which is used later to identify the audio signal uniquely.

3. Recognition System It is the most Important part of the design model. It is a
continuous, multi-dimensional HMM (Hidden Markov Model-based) component which takes as input, the feature vectors generated from the feature extractor component and then finds the best or most suitable match from the Knowledge Model.

4. Knowledge Model  This component, along with the language dictionary, is used for identifying the sound signal.

3  Structure of Speech

Speech is a form of continuous audio stream where stable states get mixed with dynamically changed states. In this case, more or less similar classes of sounds called phones can be defined – words are considered to be built of phones. Different factors like phone context, style of speech, speaker and many other factors affect the acoustic property of the waveform corresponding to the phones. Often, three states in phone are selected for HMM recognition, with the first part being dependent on the previous phones, the middle part being stable and the last part being dependent an the subsequent phone.

The non-linguistic sounds are usually referred to as fillers (breath, um, uh, cough,
ete.). The words and these other nonlinguistic sounds form utterances which are separate chunks of audio between pauses.

3.1 Sound Recording and Word Detection

This component is responsible for capturing the input from a microphone and then forwarding it to the feature extraction component of the system. Before forwarding to the feature extraction module, it also identifies the segments of the sound containing the words, that is word detection. It is the responsibility of this module to save the audio files in the .WAV format Which is required for further training the model.

3.2 Sound Recorder
Voice recording is the first step in speech recognition. The sound recorder takes the input from the microphone, saves these audio files in the .WAV format and finally forwards them to the next module. It also supports change of different factors like the sampling rate, the number of channels and the size of the sample. The sound Recorder actually has two classes or functions implemented in it – sound reader class and sound Recorder class.

1. It is the responsibility of the Sound Reader class to receive the input audio from
the microphone, with its parameters being sampling rate, number of channels and the sample size. It has three basic functionsOpen, Read and Close. The Open function opens the device in the read mode, the Read function checks if there is some valid sound present in the audio signal and then returns the content, while the close function releases the device.
2. It is the responsibility of the Sound Recorder class to convert the input audio signal from it’s raw format to .WAV format.

3.3 Word Detector
In speech recognition system, detection of words in sound signals is very important. It is able to detect the silence region, and thus anything other than the region of silence is considered to be a word. Usually, the zero crossing rate is used to detect the region of silence.
For word detection, the audio signal is sampled over a particular time interval, for which the energy and the zero crossing rates are calculated. Zero crossing rate is defined as the number of tines the wave goes from the positive value to the negative value and vice-versa. The first 100 milli-seconds are considered to be the region of silence during which the average zero crossing rate is calculated to determine the background noise. Upper threshold value for zero crossing rate is fixed as 2 times
this value while the lower threshold Value is set to 0.75 times the upper threshold.
During word detection, if the zero crossing rate goes above the upper threshold Value and stays there for consecutive three sample periods, then it is assumed that word is present.
Thus, recording is started and continues till the zero crossing rate falls below the lower threshold value and stays there for a minimum period of 30 milli-seconds continuously.

4 Feature Extraction

The feature extraction module is one of the most important modules in the speech recognition system. To identify a spoken word, the speech recognition system has to identify and differentiate between different sound in the same way as human beings do. However, the most important point is that although the same word, if spoken by different people, produces different types of sound waves, Human beings still have the ability to understand that the spoken word is the same. This is primarily because of the fact that these words, although  Spoken by different people, tend to have some
common features, which makes recognition Possible. Thus features have to be extracted by the feature extractor module for efficient speech recognition.

4.1 Mel Frequency Cepstrum Coefficients Computation

Different Temporal and spectral analysis is done on the sound signals to extract the useful features, the most important of them being the Mel Frequency Cepstrum Coefficients (MFCC).
MFCC is considered to be the closest possible approximation to the human ear. MFCC is generated from the sound signal by passing through high band pass filters which results in higher frequencies becoming more distinct than the lower frequencies. MFCCs are the coefficients that collectively make up an MFC. They are derived from a type of cepstral representation of the audio clip.

5 Knowledge Models

Before the speech recognition system is able to recognise the input, it is necessary to train the system. During training, the systen generates the acoustic and language models from the data.

5.1 Acoustic Model

Features that were extracted from the input sound by the extraction module have to
be compared with some predefined model to identify the spoken word. This model is
known as the Acoustic ModeL

5.2 Word Model
In word model, the words are modelled as a whole. During recognition, the input sound is matched against each word present in the model and the best possible match is then considered to be the spoken word.

5.3 Phone Model

In phone model, only parts of words called phones are modelled instead of modelling the word as a whole. Instead of matching the sound with each word, we match the sound with the words and recognise the parts. These parts are then put together to form a word. In both these models, the silence and filler words have to be modelled. Filler words are actually the ones that are produced by the human beings between two words.

5.4 language Model

Providing a fair idea about the context and the words that can occur in the context, to the speech recognition system, is the main idea of the language modeL It provides an idea about the different words that are possible in the language and the sequence in which these words may occur.

5.5 Phone set

Phoneme is the basic or the smallest unit of sound in any Language. In the phone set that we have used to develop the speech recognition system for konkani language.

5.6 Dictionary
A dictionary (also known as the pronunciation lexicon) specifies the pronunciations of the words as linear sequence of Phonemes in such a way that each line in the dictionary contains exactly one pronunciation specification.

6 Fundamentals to Speech Recognition
Speech recognition is basically the science of talking with the computer, and having it correctly recognized. To elaborate it we have to understand the following terms.

6.1 Utterances

When user says some things, then this is an utterance in other words speaking a word or a combination of words that means something to the computer is called an utterance. Utterances are then sent to speech engine to be processed.

6.2 Pronunciation  

A speech recognition engine uses a process word is its pronunciation, that represents what the speech engine thinks a word should sounds like. Words can have the multiple pronunciations associated with them.

6.3 Grammar

Grammar uses particular set of rules in order to define the words and phrases that are going to be recognized by speech engine, more concisely grammar define the domain with which the speech engine works. Grammar can be simple as list of words or flexible enough to support the various degrees of variations

6.4 Accuracy

The performance of the speech recognition system is measurable, the ability of recognizer can be measured by calculating its accuracy. It is useful to identify an utterance.

6.5 Vocabularies

Vocabularies are the list of words that can be recognized by the speech recognition engine. Generally the smaller vocabularies are easier to identify by a speech recognition engine, while a large listing of words are difficult task to be identified by engine.

6.6 Training

Training can be used by the users who have difficulty of speaking or pronouncing certain words, speech recognition systems with training should be able to adapt.


  • Able to write the text through both keyboard and voice input
  • Voice recognition of different notepad commands such as open save and clear.
  • Provide significant help for the people with disabilities.
  • Controlling System based on Voice Input.

Leave a Reply

Your email address will not be published. Required fields are marked *