Converting speech to text is an important application feature. This ability is increasingly being used in a wide variety of contexts. Voice input is used to control smart phones, automatically handle input as part of help desk applications, and to assist people with disabilities, to mention a few examples.
Speech consists of an audio stream that is complex. Sounds can be split into phones, which are sound sequences that are similar. Pairs of these phones are called diphones. Utterances consist of words and various types of pauses between them.
The essence of the conversion process involves splitting sounds by silences between utterances. These utterances are then matched to the words that most closely sound like the utterance. However, this can be difficult due to many factors. For example, these differences may be in the form of variances in how words are pronounced due to the context of the word, regional dialects, the quality of the sound, and other factors.
The matching process is quite involved and often uses multiple models. A model may be used to match acoustic features with a sound. A phonetic model can be used to match phones to words. Another model is used to restrict word searches to a given language. These models are never entirely accurate and contribute to inaccuracies found in the recognition process.
We will be using CMUSphinx 4 to illustrate this process.
Audio processed by CMUSphinx must be in Pulse Code Modulation (PCM) format. PCM is a technique that samples analog data, such as an analog wave representing speech, and produces a digital version of the signal. FFmpeg (https://ffmpeg.org/) is a free tool that can convert between audio formats if needed.
You will need to create sample audio files using the PCM format. These files should be fairly short and can contain numbers or words. It is recommended that you run the examples with different files to see how well the speech recognition works.
First, we set up the basic framework for the conversion by creating a try-catch block to handle exceptions. First, create an instance of the Configuration
class. It is used to configure the recognizer to recognize standard English. The configuration models and dictionary need to be changed to handle other languages:
try { Configuration configuration = new Configuration(); String prefix = "resource:/edu/cmu/sphinx/models/en-us/"; configuration .setAcousticModelPath(prefix + "en-us"); configuration .setDictionaryPath(prefix + "cmudict-en-us.dict"); configuration .setLanguageModelPath(prefix + "en-us.lm.bin"); ... } catch (IOException ex) { // Handle exceptions }
The StreamSpeechRecognizer
class is then created using configuration
. This class processes the speech based on an input stream. In the following code, we create an instance of the StreamSpeechRecognizer
class and an InputStream
from the speech file:
StreamSpeechRecognizer recognizer = new StreamSpeechRecognizer( configuration); InputStream stream = new FileInputStream(new File("filename"));
To start speech processing, the startRecognition
method is invoked. The getResult
method returns a SpeechResult
instance that holds the result of the processing. We then use the SpeechResult
method to get the best results. We stop the processing using the stopRecognition
method:
recognizer.startRecognition(stream); SpeechResult result; while ((result = recognizer.getResult()) != null) { out.println("Hypothesis: " + result.getHypothesis()); } recognizer.stopRecognition();
When this is executed, we get the following, assuming the speech file contained this sentence:
Hypothesis: mary had a little lamb
When speech is interpreted there may be more than one possible word sequence. We can obtain the best ones using the getNbest
method, whose argument specifies how many possibilities should be returned. The following demonstrates this method:
Collection<String> results = result.getNbest(3); for (String sentence : results) { out.println(sentence); }
One possible output follows:
<s> mary had a little lamb </s> <s> marry had a little lamb </s> <s> mary had a a little lamb </s>
This gives us the basic results. However, we will probably want to do something with the actual words. The technique for getting the words is explained next.
The individual words of the results can be extracted using the getWords
method, as shown next. The method returns a list of WordResult
instance, each of which represents one word:
List<WordResult> words = result.getWords(); for (WordResult wordResult : words) { out.print(wordResult.getWord() + " "); }
The output for this code sequence follows <sil>
reflects a silence found at the beginning of the speech:
<sil> mary had a little lamb
We can extract more information about the words using various methods of the WordResult
class. In this sequence that follows, we will return the confidence and time frame associated with each word.
The getConfidence
method returns the confidence expressed as a log. We use the SpeechResult
class' getResult
method to get an instance of the Result
class. Its getLogMath
method is then used to get a LogMath
instance. The logToLinear
method is passed the confidence value and the value returned is a real number between 0 and 1.0 inclusive. More confidence is reflected by a larger value.
The getTimeFrame
method returns a TimeFrame
instance. Its toString
method returns two integer values, separated by a colon, reflecting the beginning and end times of the word:
for (WordResult wordResult : words) { out.printf("%s Confidence: %.3f Time Frame: %s ", wordResult.getWord(), result .getResult() .getLogMath() .logToLinear((float)wordResult .getConfidence()), wordResult.getTimeFrame()); }
One possible output follows:
<sil> Confidence: 0.998 Time Frame: 0:430 mary Confidence: 0.998 Time Frame: 440:900 had Confidence: 0.998 Time Frame: 910:1200 a Confidence: 0.998 Time Frame: 1210:1340 little Confidence: 0.998 Time Frame: 1350:1680 lamb Confidence: 0.997 Time Frame: 1690:2170
Now that we have examined how sound can be processed, we will turn our attention to image processing.