Voice System Technologies & Architecture
By Roger Byford
Overview
This white paper reviews some of the major
technology and architecture decisions
designers must make in creating voicepowered
solutions for warehousing and
industrial applications. We assume in all
cases that the voice system includes
wearable voice computers, communicating
with their operators via microphoneequipped
headsets, and with remote
information systems via a radio frequency
local area network (RF LAN).
We start by reviewing information flow from
the voice system to other systems. All
voice-powered systems must communicate
with customer-owned software packages to
transfer information. This section
concentrates on warehouse applications, for
which the relevant software package is the
warehouse management system (WMS) or
its equivalent. Next we discuss the following
technologies, each critical to the success of
any application:
• Speech recognition, which enables
a voice computer to convert its
operator’s speech into text.
• Speech synthesis, which enables a
voice computer to talk to its
operator.
• RF LAN technology, which allows a
wearable voice computer to
communicate in real time with
remote information systems.
Finally we review thick client and thin client
architectures. In a thick client architecture
all processing takes place on the wearable
computer. In a thin client architecture some
processing takes place on a central server.
Information Flow
In this section we concentrate on warehouse
operations in general, and order selection in
particular. We use the abbreviation WMS
(warehouse management system) to refer to
whatever software system (or systems)
provides the information necessary to
perform order selection. This may be a true
WMS or an order entry system working in
conjunction with a simple stock locator
program, or some other combination of
programs. We use order selection as our
application example because it is the most
commonly implemented today. The ideas
that follow, however, can readily be
extended to other warehouse applications.
The information a voice directed order
selection system requires includes two
components:
• What products are to be selected –
the information contained on a
paper pick list, and
• How the selection process should
be performed – priorities, splitting
and combining of orders to make up
effective pick trips, assignment of
some orders to particular operators,
etc. This information determines
what pick trip will be assigned to an
operator who notifies the system
that he or she is ready to work.
All WMSs can provide the first category of
information, but not all can provide the
second. In many cases, where a WMS
prints a stack of pick lists at the beginning of
a shift, the second category is managed by
distribution center staff, who make decisions
about priorities and work assignments, and
perform operations such as splitting and
combining orders. Although in theory it
would be possible to continue to perform this
work manually after installing a voice
directed selection system, in practice that
would negate many of the benefits. The
solution is to add a “middleware” (i.e., a
software package acting to join two systems
that cannot communicate directly)
component to the system. This software
package must accept the pick list
information from the WMS, and must be
able make the decisions (perhaps with
human assistance and/or override
capability) that will allow it to provide the
second category of information to the
operators’ voice terminals.
We can refer to WMSs that can provide both
categories of information as supporting realtime
selection. Those that cannot are often
called batch systems.
Similar issues arise when considering a
voice directed system for other warehousing
operations. Can the WMS directly provide
all the information required, or must another
piece of software be added?
A second issue is the transfer, from the
voice system to the WMS, of information
concerning what products have been
selected. For a batch WMS using paper
pick lists, this information is usually keyed in
manually on an exception basis (only items
that were not selected as expected).
Depending on the volume of information to
be entered, and therefore the clerical hours
required, it may make sense to continue to
enter this information manually, from
printouts created by the middleware
package. Alternatively, depending on the
capabilities of the WMS, the middleware
package can either generate data files to be
imported by the WMS, or transfer the
information directly.
A prospective customer for a voice directed
warehouse operations system must
therefore consider its own WMS as well as
the voice system. Questions to be
answered (by some combination of the
customer, the WMS provider, and the voice
system provider) include the following:
• Does the WMS support real-time
order selection?
• If so, have the voice system and
WMS vendors already developed a
direct link between their systems?
• If not, what will be the terms and
technological means under which
they will connect? From the
technology viewpoint, several
connection mechanisms are
possible, including a direct link
(generally using sockets), terminal
emulation (in which the voice
terminals pretend to be
screen/keyboard terminals), and file
transfers (mostly used for rapid
prototyping). Of these, by far the
most flexible and reliable is the
direct link.
• If a middleware component is
required, is the voice system
provider’s offering flexible enough to
meet the facility’s requirements?
And does it meet the standard
criteria of demonstrated reliability,
capability, etc. for systems of the
size and scope being
contemplated?
Vocollect’s Design Decisions
Vocollect offers a variety of direct-to-WMS
connection schemes, including sockets,
terminal emulation, and file transfers. In
addition, we offer a capable and very flexible
middleware package called Pick Manager,
which currently runs on a Microsoft®
Windows NT®/SQL Server platform. Pick
Manager is in use today at multiple sites,
supporting in some cases as many as 200
wearable devices.
Technologies
Speech Recognition
Effective speech recognition (transcribing
human speech into text) is critical to the
success of any voice powered industrial
system. The critical measure of a speech
recognizer’s performance is accuracy –
does it correctly transcribe what it hears?
And getting a computer to perform speech
recognition as accurately as a human does
under all circumstances remains an
unsolved problem.
So if we want a computer to achieve high
recognition accuracy, we must simplify the
problem. Fortunately, there are a number of
ways designers can do that while still
meeting the needs of the users. The
simplifications that designers choose are
based on the intended application of the
system. In almost all cases, however, they
are making trade-offs between constraining
the problem and accuracy.
Another possible design trade-off is
accuracy versus time. We make the
assumption here that the designer must
create a system that can, in effect, hold a
real-time conversation with its users.
Processing for five minutes to transcribe two
seconds of speech is acceptable only in a
very limited number of applications.
In what follows we use the word
“understand” to mean, “be able to transcribe
from speech into text.”
Large Vocabulary and Small Vocabulary
The words that a speech recognition system
is expected to transcribe comprise its
vocabulary. Human beings have a very
large vocabulary. We can understand many
thousands of words. A speech recognizer
capable of taking dictation on a computer
must also have a vocabulary of thousands of
words. Such a recognizer is called a large
vocabulary system. At the other end of the
spectrum is a recognizer designed only to
tell whether a user has responded to a
question by saying “yes” or “no” – clearly a
small vocabulary issue. The distinction
between small vocabulary and large is
arbitrary, not rigid, but one thousand words
is a reasonable dividing line. Systems with
vocabularies of between a few hundred and
a couple of thousand words are sometimes
described as medium vocabulary.
In an ideal world, designers would create
only large vocabulary speech recognition
systems. But there are two trade-offs in
doing so. Most importantly, it is much more
difficult to create a high accuracy large
vocabulary speech recognizer than to create
a small vocabulary one. Correctly
recognizing one word from a vocabulary of
fifty thousand or more is far harder than
distinguishing “yes” from “no.” Second, a
large vocabulary recognizer generally
requires much more computing horsepower
(and memory) than does a small vocabulary
recognizer. So, for example, it may be
difficult to incorporate a large vocabulary
recognizer into a portable or wearable
device.
Fortunately, industrial speech recognition
systems do not generally require large
vocabularies. A typical warehouse
application requires a vocabulary of less
than one hundred words, while an inspection
application may require up to about one
thousand.
One other vocabulary issue is often
important to industrial users. A small
vocabulary may be perfectly acceptable, but
a fixed small vocabulary is not. System
design, and creating a positive experience
for the system’s operators, is much harder if
the speech recognizer places constraints on
choice of words: “You can’t use this word –
you must use that one.”
Continuous and Discrete
Human beings easily recognize speech in
which individual words are run together, with
no audible gaps between them, to form a
phrase or sentence. To do this we have to
not only be able to understand the individual
words, but also to decide where the
boundaries are between them. Sometimes
we must use very high-level knowledge to
perform this task. Consider these two
sentences:
“Six teen idols were cavorting on stage.”
“Sixteen idols were cavorting on stage.”
Only a deep understanding of the generally
accepted use of the word “idol” allows us to
understand that the first sentence is much
more likely to be correct than the second.
Early speech recognizers functioned only
with discrete speech. Users had to pause
perceptibly between each word. Today this
constraint is generally applied only in very
inexpensive recognizers (e.g., for toys), or to
make very hard problems easier (e.g.,
recognizing one of many thousands of
company names to provide stock quotes).
In either case, the system must make it clear
to the user that a single word is called for:
“Say the name of the company.”
For industrial applications, a discrete
recognizer is not acceptable. Having to
pause between words (for example, when
entering a sequence of digits) is slow and
frustrating. All speech recognizers for
industrial applications should be capable of
understanding continuous speech. Although
this makes the recognizer’s task more
difficult (i.e., makes it harder to keep
accuracy high) it is possible today to
create very high accuracy continuous
speech recognizers for industrial
applications.
Speaker Dependent and Speaker Independent
It is easier to understand someone’s speech
if the listener knows who is speaking and is
used to hearing him or her talk – particularly
if the speaker has an unusual speech
pattern or a strong accent. That statement
is even truer for computer-based speech
recognizers than for people, and for some
applications, we can make use of this fact to
improve recognition accuracy dramatically.
In some applications, it may not be possible
to require users to identify themselves to the
system, or to give the system time to
practice listening to them, like when calling
an automated telephone attendant. In the
warehouse, we can do both. In fact for a
small vocabulary recognizer we can make
use of a computer’s perfect memory to
improve performance dramatically for every
user, even those with unusual speech
patterns or strong accents. We can store
each user’s voice patterns, for every word
the recognizer will be required to
understand. Although Judy will not say
“one” in exactly the same way every time, if
the recognizer knows that Judy is speaking,
and has access to her personal voice
patterns, it will be able to transcribe her
speech much more accurately than if it tried
to compare her speech to all ways of saying,
“one,” or to an average of how most people
say it. Such a system is referred to as
speaker dependent, meaning that it depends
for its accuracy on knowing who is talking to
it. A speaker independent system does not
make use of that knowledge, and is
therefore inherently less accurate.
The process of allowing a speech recognizer
to “practice” with a user is called training.
Speaker dependent recognizers are
therefore sometimes called trained systems,
and speaker independent recognizers are
referred to as untrained.
For a small vocabulary recognizer the
training process generally consists of having
the user speak to the system (one or more
times) all of the words in the recognizer’s
vocabulary. This is sometimes referred to
as enrollment training. For a large
vocabulary system speaking every word
during training is not practical. Speaker
dependent large vocabulary systems
generally use an adaptation process, in
which the user reads known passages of
speech to the system, and the system draws
conclusions about the user’s speech
patterns.
Note that a speaker dependent system must
allow for storing users’ voice patterns, and,
for a system involving multiple wearable
computers, for retrieving them on demand
so that any user can log on to any of the
computers. Today this is easy to do, so we
do not consider it as much of a design tradeoff.
How does a designer choose between
creating a speaker dependent and a
speaker independent system? A trained
recognizer will typically have at least twice
the accuracy of an untrained one. The
trade-off is between the time the user must
take to train the system and the time he or
she will gain from the increased accuracy
that training brings. For a small vocabulary
industrial speech recognizer, this trade-off
calculation is very clear. If training the
system requires an investment of about
fifteen minutes (typical for a warehouse
application), and the user will operate the
system for perhaps two thousand hours in
the course of a year, even a tiny
improvement in performance as a result of
training will pay for itself very quickly in the
form of increased productivity through
increased accuracy. And it is generally
accepted that a trained recognizer will
typically have at least twice the accuracy
(half the error rate) of an untrained one.
A major advantage of a trained recognizer
for industrial applications is that a trained
system does not care about unusual speech
patterns, accents, or even language. It is
simply comparing the speech patterns it
recorded during the training process with the
ones it hears during use. A speaker
independent system, however, just cannot
accept (for example) “uno” as another form
of the word “one.” Users must conform to
the system’s expectations of their speech
patterns. For the workforce on a factory or
warehouse floor, where a wide variety of
accents and even languages is the norm,
this may not be practical.
Casual or Full-Time Users
Another set of design decisions for speech
recognizers revolves around casual versus
full-time users. For casual users, the speed
of data entry is less important than coping
with extraneous speech (speech the user
wants the recognizer to ignore), while for
full-time users the reverse is true.
A speech application designed for casual
use may, for example, require the user to
start and end each utterance with a specific
word (“ready, 1, 2, 3, enter”), and will reject
any user speech that is not in exactly the
right format. The same application designed
for full-time use would expect an utterance
of the form “1, 2, 3,” with the trade-off that it
might be easier to interpret an extraneous
utterance as a sequence of digits. In
industrial and warehousing applications,
where the user may be entering hundreds of
digit strings per hour for ten or more hours
per day, the 40% reduction in the amount of
speech required to enter the data is vastly
more important than a modest improvement
in the ability to reject extraneous utterances.
A discrete speech recognizer may also be
appropriate for casual users (see the
discussion of discrete and continuous
recognizers above). The fact that each word
must be surrounded by silence allows the
recognizer to reject any utterance that
consists of multiple words. Again, however,
for any full-time user, the inability to enter a
sequence of words spoken continuously
would be a productivity killer and a source of
unbearable frustration.
A speaker independent recognizer may also
be more appropriate for casual users, if it is
not possible for them to devote the fifteen
minutes or so required to train a speaker
dependent recognizer. But for both casual
and full-time users, since a speaker
independent recognizer must correctly
detect and process a wide range of accents,
dialects, and speech properties, it is
naturally more prone to misinterpreting
extraneous speech (or other sounds) as
words that should be recognized.
An option that can provide very nearly “best
of both worlds” performance is to allow the
operator to change the mode of the
recognizer with simple, intuitive phrases. In
Vocollect's case, for example, our
recognizer typically operates in full-time user
mode, minimizing the amount of user time
and speech required to enter data. By
simply saying, “Talkman, sleep,” however,
the user can put the Talkman into a special
casual user mode, in which the only phrase
that returns it to normal operation is
“Talkman, wake up,” with that phrase
preceded and followed by brief silences.
While it is “asleep” the recognizer is almost
totally immune to any accidental activation
from extraneous speech or other outside
noise.
Vocollect’s Design Decisions
Vocollect’s products use a continuous,
small (and variable) vocabulary, speaker
dependent recognizer. Users can speak
naturally, without pauses, because the
recognizer is continuous. The recognizer is
small vocabulary because that offers higher
performance, and industrial applications do
not require a large vocabulary. The
vocabulary is variable, and can therefore be
modified for each application (or even by
each user). And the recognizer is speaker
dependent because the small investment in
training a speaker dependent recognizer is
paid for many times over by the improved
productivity, and the user satisfaction
created by the increased accuracy (and
extraneous speech rejection) that training
the recognizer provides.
Finally, our products and applications are
clearly designed and built for full-time, not
casual, users. We strongly emphasize
productivity and ease of use (reduction in
the amount of speech required of the user).
While it is possible, for example, to create
Talkman dialogues that require specific
words to start and end each utterance, we
very rarely recommend doing so. At the
same time, our speech recognizer uses
numerous techniques to reject both nonspeech
and extraneous speech sounds with
high confidence. And the ability to put the
recognizer to sleep (see above) and wake it
up with simple phrases provides a simple,
intuitive and virtually bulletproof mechanism
for permitting sidebar conversations. For
the rare occasions when these techniques
fail (and for the much more common ones
when the user mis-speaks!), a combination
of built-in Talkman features and dialogue
design techniques permit easy editing of
erroneously entered data.
Speech Synthesis
As speech recognizers make human speech
intelligible and meaningful to computers,
speech synthesis technology permits
computers to speak to humans. There are
two distinct speech synthesis techniques
available to systems designers.
Digitized Speech (also called Record and
Playback)
Digitized speech is what we hear when an
automated attendant or answering machine
speaks to us over the telephone. The
computer is essentially acting like a tape
recorder. At some time in the past a human
spoke into a microphone, and the speech
was converted to numbers (digitized) so the
computer could store it. On demand, the
computer recovers the digitized speech
samples and reconstitutes them into sound.
Digitized speech can be of very high quality.
However, someone must record, and the
computer must store, every word or phrase
the computer will have to speak during
operation. This may present a storage
problem for large applications, and it
invariably presents a maintenance concern.
If the application is to be modified, is the
original speaker still available to record new
words and phrases? And if not, will multiple
voices be acceptable, or must the entire
application be re-recorded? Also, creating
high quality voice recordings generally
requires both sophisticated equipment and,
perhaps more difficult, a professional
announcer.
A significant limitation of digitized speech in
some applications is that the computer can
only speak phrases that have been prerecorded
(or that can be created through
concatenation). It is therefore functionally
impossible to create an application in which
the computer speaks, for example, product
descriptions, or in which the computer can
speak to its operator unpredicted text
messages that are sent to it from another
machine (e.g., a supervisor typing in a
message to be spoken to an operator).
Text-to-Speech
A computer with text-to-speech (TTS)
software can convert computer text directly
into spoken sounds. TTS removes all the
constraints and maintenance headaches of
digitized speech, as the computer can speak
any text presented to it (e.g., this document)
with no prior knowledge, and it is not
necessary to have anyone record or
maintain speech phrases. A computer using
TTS, however, does sound like a computer
speaking. It is clearly not human. In some
applications this may present a problem. It
is not always easy to understand someone
with a new accent the first time you hear him
or her. But as listeners we humans are
extremely adaptive. Give us a few minutes
and we can easily decipher even very strong
accents. The mild accent of a computer
using TTS is very easy to understand,
especially for industrial applications, in
which users typically hear very similar
phrases many times each day.
Vocollect’s Design Decisions
Vocollect offers both text-to-speech and
digitized speech in its products. We strongly
believe, however, that the advantages of
TTS far outweigh the slight loss in speech
quality. We therefore recommend that our
customers employ the TTS option, and
today every one of our many customers
does so.
Radio Frequency
Local Area Networks
(RF LANS)
RF LANs allow portable or wearable
computers to communicate wirelessly, at
high speed, and in real time with remote
information systems. An RF LAN is the
wireless equivalent of an Ethernet wired
computer network.
RF LANs operate like miniature cellular
telephone systems. Throughout a large
facility, multiple access points (like cell
phone towers) are mounted. As a portable
computer equipped with an RF network card
moves around the facility, it is automatically
handed off from one access point to
another, just as a cell phone in a moving
automobile is transparently handed off from
one cell tower to the next.
For many years there were multiple
competing RF LAN technologies, and
devices from different manufacturers could
not communicate with one another. Today
there is a single emerging standard, and one
other technology that may be considered for
some applications.
802.11b
The new technology standard is known as
802.11b (pronounced “eight oh two dot
eleven bee”), which is the number of the
Institute of Electrical and Electronics
Engineers (IEEE) standards committee that
created it.
Radios using the 802.11b standard operate
in the 2.4 GHz range, and use the direct
sequence spread spectrum technology to
achieve high data bandwidth. A single
portable device, or access point, has a
theoretical maximum bandwidth of 11
megabits per second (Mbps). Since all
portable devices communicating with a
single access point must share this
bandwidth, congestion may occur under
some circumstances. Multiple access points
can, however, be configured with three noninterfering
channels covering a single area.
We believe that 802.11b will remain the
standard of choice for industrial RF LANs for
at least the next three to five years.
Other Standards
Another RF communications standard that is
receiving a lot of press is Bluetooth™.
Bluetooth radios are designed to offer
moderate bandwidth, very short-range
communications for personal area (or
perhaps home) networks. Bluetooth is in
effect a replacement for the much sold but
little used infrared links incorporated into
many computers and some printers.
In the industrial world, Bluetooth may
eventually lead to wireless personal
peripherals – for example, a wireless
headset communicating with a belt-worn
voice computer. But there are two major
issues under consideration before such
devices become reality. The first is that
Bluetooth and 802.11b operate in the same
frequency band, and they do interfere with
one another. So a wearable computer that
incorporates both 802.11b and Bluetooth
radios will need to be designed to avoid this
interference. An IEEE committee is working
on the interference issue, but has not as yet
released any recommendations.
The second concern with wireless personal
peripherals is that although they have the
desirable effect of eliminating wires, they
replace those wires with additional batteries.
A wireless headset must have a battery.
And living with an extra battery may be more
difficult overall than living with a wire. So it
is far from clear today that wireless personal
peripherals, whether driven by Bluetooth or
any other technology, will be attractive for
industrial applications.
Vocollect’s Design Decisions
Vocollect’s products support any RF network
cards that come in the standard PC card
form factor, and for which the required
software drivers are available. In practice,
this means all vendors’ networks. We
generally recommend assembling a singlevendor
802.11b solution, with access points
and the portable device PC cards coming
from the same manufacturer.
With respect to Bluetooth, we have been
monitoring the technology for some time,
but, for the reasons listed above, do not
expect it to appear in our products any time
soon.
System Architecture: Thick Client or Thin Client?
The intelligence (data processing) in an
application involving users who are remote
from the main computer system is always
distributed between the user’s computer (the
client) and the remote system (the server).
The term “thin client” is used to describe a
client device that does little data processing
– a traditional wired or RF data terminal is
the thinnest possible client (usually called a
dumb terminal). A thick client does a great
deal of processing locally, and uses the
remote server primarily as a data storage
device.
In an industrial voice system there are two
thick or thin decisions designers must make:
where the speech processing takes place,
and how much operating logic resides in the
client.
Speech Processing
All available industrial voice systems
perform speech synthesis on the client. For
speech recognition, however, the designers
have made different choices. In a system
using a server-based speech recognizer, the
speech signal from the wearable computer’s
microphone is pre-processed on the
wearable device, and then transmitted over
the RF network to a server that performs the
speech recognition work for many wearable
devices simultaneously. In a system using a
client-based speech recognizer, all of that
work takes place in the terminal, with no
server and no data transmission. In theory
the greater computing power available on
the server allows it to run more powerful
speech recognition algorithms than could
the wearable client. In practice, given the
great advances over the past few years in
the computing power that can be built into a
wearable device, and given the constrained
nature of the speech recognition problem for
industrial systems (see the Speech
Recognition discussion earlier in this
document), there is no real advantage today
in a server based recognizer. There are
some significant disadvantages, however.
First is scaling. How many wearable
devices can a single speech recognition
server support? Even if the server is ten
times more powerful than a wearable device
could be, that still suggests the need for
multiple servers to support the one hundred
or more wearable devices often in use in a
warehouse. Can the system support using
multiple servers? Is it practical to support a
large “server farm” in the typical warehouse
environment?
Second is processing delay, or latency.
Users are very sensitive to delays in
response from a speech recognizer. What
will happen when multiple wearable
terminals send data to the server at the
same time? The RF network compounds
the latency issue. RF data networks, unlike
telephone networks, are not designed to
minimize delay times. They are designed to
guarantee data delivery. With multiple
wearable terminals in a single zone, there
are delays in transmitting data (and this
problem is compounded because
transmitting even pre-processed speech
adds dramatically to the amount of data that
must be moved over the network). This
issue is particularly important for high-speed
piece-pick operations. Vocollect used to
receive negative reactions from customers
with response delays of even about two
thirds of a second (and as a result we
reduced those delays to less than one
quarter of a second). A server-based
system simply cannot respond in this kind of
time frame. The much-vaunted “subsecond”
response of RF networks is about
an order of magnitude too slow.
Third is network coverage. The coverage of
industrial RF LANs today is generally
excellent. In fact, one leading vendor
guarantees continuous connectivity.
However, temporary issues, such as
seasonal movement of products within a
warehouse or failure of part of the network
backbone (access point, hub, wiring, power,
etc.) may create coverage dead spots. A
wearable computer using a server-based
speech recognizer will be useless under
these conditions.
A final concern for a server-based speech
recognizer is that the server becomes a
single point of failure for the entire system.
The server(s) must therefore be treated as a
mission-critical device, with full redundancy
and complete hot fail-over capability. Such
systems are not inexpensive.
Vocollect’s Design Decisions
All Vocollect products perform all speech
processing on the wearable device. They
are therefore thick client designs. We
devote considerable research and
development effort to optimizing our
extremely sophisticated speech recognition
algorithms to run very effectively on our
wearable devices. And we do this because
we firmly believe that the many benefits of
the thick client speech processing
architecture make that effort well worthwhile.
Operating Logic
Consider an order selection system for a
warehouse. At the thin client extreme, the
wearable device must communicate with the
server (the warehouse or picking
management system) at each step in every
pick operation, such as direction to slot,
verification of location, determination of pick
quantity, and verification. At the thick client
extreme the warehouse management
system (WMS) could transmit a complete
pick list to the wearable device, and
(perhaps an hour later) the wearable device
would report back that all items had been
picked. The design trade-off is between
real-time information and control (thin client
benefits), and guaranteeing rapid response
to the user regardless of RF network
performance and server load (thick client
benefit).
The considerations for rapid response to the
user are similar to those discussed above
for server-based and client-based speech
recognition. A thin client system requires
perfect, not too heavily loaded, RF network
coverage, and an always-responsive server.
These conditions are not easy to guarantee.
A thick client system may require more onetime
software work to link the client devices
to the WMS, but once the interface is
implemented the system can function
perfectly from the users’ point of view even if
the network and server are far from perfect
in their coverage and response times
respectively.
With respect to real-time information and
control, the primary operating requirement
for an order selection system is generally to
know immediately that product has been
removed from a location (or that a location
has become empty). A thin client system
accomplishes this automatically. A thick
client system can readily do so with a minor
modification. When an operator wants to
start work, the WMS transmits a complete
pick list to the operator’s wearable device.
Each time the operator completes an
operation, the wearable device transmits the
pick data back to the WMS. In a well
designed system this data transmission
occurs in the background, while the operator
continues to work. If there happens to be a
dead spot in the RF network coverage the
terminal simply batches up the pick data
records until it can transmit them.
A secondary reason for wanting real-time
information and control in an order selection
system is to allow the WMS to modify a pick
trip while it is in progress. In theory this
would allow the WMS to have the operator
bypass a slot if that slot were known to be
empty. In practice, Vocollect does not know
of a WMS that modifies pick trips on the fly.
It is possible to compromise between the
thin client and thick client modes of
operation. The WMS might send two or
three pick records to the wearable device at
the beginning of a pick trip, and then send
one more each time the operator reports a
pick completion. This design guarantees
good response time for the operator
(because information for the next pick
operation is always on hand in the wearable
device), but it does not entirely overcome
the RF dead spot issue.
Vocollect’s Design Decisions
Vocollect has opted for a thick client design
in all our products. However, the wearable
device software can readily be configured to
function as a thin client, or even as a dumb
terminal. For order selection systems, we
recommend to our customers that,
whenever possible, they operate our
equipment in the modified thick client mode
described above: the WMS sends a
complete pick trip to the wearable terminal,
and the wearable reports back (invisibly to
the operator) as pick operations are
completed. We believe this mode of
operation offers the best trade-off in overall
system design, guaranteeing very rapid
response to the wearable device operator
while providing real-time information to the
people and software managing the
warehouse.
Summary
There is a variety of complex technology
and architecture decisions that any creator
of voice-powered systems for industrial
applications must consider. Vocollect’s
rationale in making these decisions has
been to promote those options that our
experience tells us offer maximum benefit in
real-world environments, while also offering
maximum flexibility to meet specific
customer needs. We believe our choices
have been vindicated by our market
leadership position, and by the 100%
installation success rate we have achieved
in a broad range of applications and
operating environments.
Copyright© 2002 Vocollect, Inc. All rights reserved.
Voice System Technologies and
Architecture, version 1.2
Published by Vocollect
703 Rodi Road
Pittsburgh, PA 15235
t) 412.829.8145
f) 412.829.0972
Printed in the United States of America January 2002
Talkman® is a registered trademark of
Vocollect, Inc.
Microsoft and Windows are either registered
trademarks or trademarks of Microsoft
Corporation in the United States and/or
other countries. The Bluetooth name and
the Bluetooth trademarks are owned by
Bluetooth SIG, Inc. All other trademarks are
the property of their respective owners.
The information in this paper has been
carefully checked and is believed to be
accurate. However, Vocollect assumes no
responsibility for any inaccuracies that may
be contained in this guide. In no event will
Vocollect be liable for direct, indirect,
special, exemplary, incidental, or
consequential damages resulting from any
defect or omission in this paper, even if
advised of the possibility of such damages.
In the interest of product development,
Vocollect reserves the right to make
improvements in this guide and the products
it describes at any time, without notice or
obligation.