Man has been constantly
trying to create machines that resemble him. Perhaps,
it is the inner urge in man to replicate him that
prompts him to make machines that are very similar
to him. On the other hand, man has also been striving
hard to make products that helps to reduce his
workload while dealing with machines that are
not able to reproduce his skills and finer capacities.
Since computers are one of the most used machines
today, it is only natural that man has been trying
to implement many features that helps him to interact
with the computer in a much better way [Ross,
2004]. Speech recognition software has been a
major breakthrough that have given us the ability
to “talk” with computers. Speech recognition
is a very powerful tool indeed, considering the
fact that speech is one of the main characteristics
that separates humans from animals. Every language
follows a graphical pattern of growth. It starts
off with humble beginnings but soon establishes
itself as a complex system of rules and vocabulary
that ultimately enriches the language. Old rules
are discarded for new ones that are more practical
and better suited for the current conditions in
which the language is used. Speech recognition
has to use various forms of the language and incorporate
many standard as well as non-standard features
to be effective. However, language is just one
of the aspects that can make a speech recognition
software effective or useless. This paper will
try to analyze the problems with speech recognition
software.
Analysis
The basic assumptions of speech recognition
It would not be wrong to say that speech recognition
is one aspect of Artificial Intelligence. AI is
broadly divided into Statistical AI and Classical
AI. Statistical AI, arising from machine learning,
tends to be more concerned with “inductive”
thought: given a set of patterns, induce the trend.
Classical AI, on the other hand, is more concerned
with “deductive” thought: given a
set of constraints, deduce a conclusion. A system
can not be truly intelligent without displaying
properties of both inductive and deductive thought
and would have to be a synthesis of both types
of AI [John McCarthy, 1996; Computing Canada,
2001; Edge, 2000]
There are numerous branches of AI and the list
keeps expanding, which is directly or indirectly
related to speech recognition. Some of them are
problems while some are techniques. Most of these
branches are mere concepts that form the basis
for research. Logical AI, Search, pattern recognition,
representation, inference, common sense knowledge
and reasoning, learning from experience, epistemology,
ontology, heuristics and genetic programming are
some of the main branches. Common sense knowledge
and reasoning is the area in which AI is farthest
from human-level though a lot of research has
been carried out since the 1950s'. (John McCarthy,
1996)
Epistemology is a study of the kinds of knowledge
that are required for solving problems in the
world. Ontology is the study of the kinds of things
that exist. In AI, the programs and sentences
deal with various kinds of objects, and we study
what these kinds are and what their basic properties
are. A heuristic is a way of trying to discover
something or an idea imbedded in a program. The
term is used variously in AI. AI is already being
used in varying degrees in fields such as computer
games, speech recognition, expert systems for
specific tasks, heuristic classification and so
on. This is where the argument of strong AI and
weak AI comes in. Strong AI makes the bold claim
that computers can be made to think on a level
that is equal to humans and possibly even be conscious
of themselves. Weak AI simply states that though
some “thinking-like” features can
be added to computers to make them more useful
tools, computers can never truly think and imitate
humans.
The advocates of strong AI believe that computers
are capable of true intelligence. They argue that
what humans perceive as consciousness, is strictly
algorithmic, i.e. a program running in a complex,
but predictable, system of electro-chemical components
or neurons. They are of the view that the brain
is a bigger and better computer and that with
sufficient technology, it will someday be possible
to create machines that enjoy the same type of
consciousness as humans. Some supporters of strong
AI expect that it will some day be possible to
represent the brain using formal mathematical
constructs. [Martin and Oscar, 1987]. These arguments
and counter arguments are very important in speech
recognition because speech recognition is not
just about deciphering language in a mathematical
fashion. It is all about deciphering the tone,
mode and even emotion of the speaker in many different
ways [Author not known, 2001]
Problems with speech recognition
Language problems
When the speech recognition software came into
existence, people believed that they could converse
with computers as easily as they see it being
done in popular films. The prospect of talking
with the computer was an exciting one because
people who were already hooked to chat and the
various tools on the Internet found this as an
easy pastime to spend their idle time. Moreover,
the thrill of talking to someone who could be
“told” anything in any manner was
also a very interesting proposition for many.
However, when the software made its entry into
the market, people soon realized that getting
the computer to understand facts through speech
was a very difficult task.
A very interesting observation will explain why
it is difficult to get the computer understand
what we speak. It is relatively easy for us to
get the computer to talk the word that we feed
into it as text. There are a lot of software available
that can easily convert text to speech. On the
contrary, perhaps no software can satisfactorily
convert text to speech or respond effectively
to voice commands. The simple reason is that there
is huge variation in the way that different people
speak the same language. Language elements like
frequency, pitch, amplitude and stress etc are
just some of the factors that we use often without
our knowledge in our speech. For example even
the universal language English, is spoken in different
dialects and accents across the world. Hence language
that is spoken in one corner of the world will
not be similar to the language that is spoken
in another part of the continent. This is in fact
one of the major hurdles that stands in the way
of speech recognition. We must understand that
computers do not have the capability to think
for themselves. Hence, it will not be able to
distinguish between the word "schedule",
which is spoken in the United State, the United
Kingdom or Other Asian countries. Hence inconsistency
in language is one of the major constraints that
speech recognition software suffers from.
Phonetic and semantic problems are other issues
that are associated with speech recognition software.
For example, not all people are well versed with
the accurate pronunciation of the English word.
Further, the word may be subject to colloquial
usage, which may be very different from the original
pronunciation of the word. In addition, many regional
differences may creep in that can make the pronunciation
of the word very different. However, when human
beings from different backgrounds converse with
each other, they may be able to make out the differences
and understand the subtle changes in words and
their meaning fast enough. These skills of human
beings cannot be incorporated into speech recognition
software because trying to do so would increase
the development time of the product by many times.
At most what can be done is to localize the software.
That is to have one edition for a specific area,
and another for another area. Even in that case,
the cost of development would be too high. Hence,
most speech recognition software uses only standard
words and usages that are available in the standard
language. To go beyond the standard language is
beset with problems that many developers try to
avoid. The end result is that the word that appears
on the screen when a US national utters a word
will be very different from what appears on the
screen if a Japanese or other Asian national utters
it.
A very amusing example of the practical problems
which one has to encounter while using speech
recognition software in everyday life is described
by Merritt (2004) in his article. He shows that
even basic sentences appear awkward and quite
unusable on the screen
For many years, however, the Holy Grail of speech
recognition was the accurate translation of speech
to text. A disarmingly simple concept, speech-to-text
translation has proven to be a technologically
vexing endeavor. And, judging from the September
issue of Macworld, it still is. In Macworld's
Feedback section, a reader pokes fun at his Windows-based
voice-recognition software thusly: "Eye am
using a new ViaVoice my IBM speech program. It
works quite well as you can see by the water.
I think you for an Fiat's Stanton article that
made by the Senate in." To which Macworld
responds: "I'm happy two here that you are
using ViaVoice sucks S fully. I myself use beach-recognition
software for Windows, which is at least a generation
a head of Max. Queerly, speech recognition is
the technology of today and 2 Maura!" Merritt
(2004)
As the above example shows, there are many issues
that is difficult for a speech recognition software
to solve. For example, how does it differentiate
between spellings used in American and British
English? How can software distinguish between
"right" and "write"? How is
it going to introduce grammatical elements like
punctuations into the text? All these are problems
that needs to be sorted out. If speakers need
to specifically say the punctuations into their
computer’s mic, that would create a very
difficult problem for many, since we are not used
to speaking punctuations in real life. This in
turn means that the usability and ease of the
software is greatly reduced, which further can
be considered as negative aspects of the software.
It must be said that speech recognition software
is not at all user friendly even though you only
need to take your mic and speak words into it.
This is because the speech recognition software
cannot distinguish between background sounds and
the sound of the user. Hence speech recognition
software becomes very impractical in an office
that has more than 2 people sitting close together
in a room. The software may catch the voice of
others speaking in the room, or may catch the
voice of computer keystrokes, the sound of other
electrical goods etc and type words on the screen
that has no relation to what the user might be
trying to speak. Almost all speech recognition
software require operators to sit in quite rooms
with minimal disturbances. This means that it
cannot be used in most offices where there is
a minimum level of ambient sound. This makes the
software exclusive because it can be used only
by CEOs or other officials who may be given a
separate room and facilities in the office.
Another disadvantage is that the software has
to be spoken to very slowly. Often people who
talk fast will see gibberish appearing on their
screen because the software cannot decipher the
words spoken in such haste. And in situations
where a memo has to be written really fast, the
software may as well be kept aside, and the typist
summoned. Often speaking slowly will make people
lose the tempo with which ideas flow into their
minds. Hence, speaking slowly will kill their
flow of ideas and may significantly affect their
ability to construct proper sentences and coherent
ideas. Another aspect that is a disadvantage with
speech recognition software is that it cannot
measure the emotions with which words are spoken
into it. For example, an operator may have to
ask the software to underline or capitalize words
for effect. This may not happen as easily as said
because the emotional situation of the person
may not allow him to dictate each and every word.
On the contrary, a human secretary may be able
to judge the emotion and act accordingly. Similarly,
the software cannot be used by a person who is
temperamental since it needs great patience to
review each word that is written on the screen
because the chances for errors are very high.
A very important drawback with voice recognition
packages is that the software has to be trained
for each user [Wikipedia, 2004]. This means that
in an office where an employee has to work in
different machines at different times, training
each machine for voice input would be a very costly
affair in terms of wasted man hours for doing
repetitive work. In addition, during instances
of virus attacks or other similar conditions when
the hard disk of the machine has to be formatted,
the software has to be trained on all machines.
Most vendors say that their software learns progressively
from the input that is given to it from time to
time. This means that a person who uses a machine
for one year will have better performance from
the software than a person who uses it for one
month. This also means that to obtain optimum
performance from the software, users have to wait
for a definite period, which is not worth the
money that is being paid for these tools. It simply
means that the software cannot be used as soon
as it is brought. The time that is required to
make it productive is same as employing a person,
which is a far better option considering the versatility
of a human being and the rather restrictive approach
of the machine. Another drawback is that the same
"trained" software cannot be used for
different people since their personal style of
speech may be very different. For example, consider
a scenario where an employee has left an organization.
A new employee cannot use the speech recognition
software with as much efficiency with which the
previous employees used it because the software
has to be trained to suit the voice of the new
employee. This means that the office systems and
all the associated processes of the office becomes
dependent on one employee and his skills. If the
employee leaves the office, the office will require
considerable time and effort to regain its productivity.
It may be seen that issuing voice command to
make one’s machine do specific tasks does
not run into so much trouble as dictating to one’s
computer to make it write something on screen.
This is because the number of words that are used
in predefined commands in most software is limited
and since the concept of programming using a common
platform has gained wide acceptance, a set of
commands that are used in one software is acceptable
to another software as well.
Hardware and technical problems
Speech recognition software is also affected by
hardware and software problems. Such software
is not universally available for all operating
systems and so if a person decides to install
it, he may have to spend a lot of money to change
his existing operating systems. In addition, the
costs that are involved in changing other software
that would be incompatible with the new operating
system may make the endeavor worthless. Similarly,
voice recognition software is sensitive to issues
like the quality of the sound card, electrical
disturbances in the circuit, loose connection
in the circuitry etc. In addition, the quality
of the microphone and the ambient sound in the
room is also very important to get good performance
from the software. Hence, all these issues need
to be resolved before a good working system can
be set up and made fully functional.
Voice recognition software can also be a drain
on the resources of the machine. It needs higher
processing power, and a lot of memory to convert
sound to text in real time. Upgrading the systems
would therefore require additional costs [RDS
Business & Industry, 2002]
|