Understanding Voice Technology

Vitech's voice solutions utilize hardware and software from Vocollect, the leader in voice directed work. Vitech chose to partner with Vocollect following extensive research into the voice offerings available and the technology decisions behind them. There are some key decisions that any voice provider makes on their way to developing a voice solution. When you are using voice in a rugged environment such as distribution center, those decisions will make or break the success of your project. Below we provide a brief explanation of some of those key areas, the competing options, and the direction chosen by Vocollect. For a more detailed explanation and further information please visit our Voice Library for a variety of white papers on the topic.

Text to Speech vs. Pre Recorded Speech

Text to Speech (TTS) technology allows a device to take string based input and read it aloud to a user with a computer-generated voice. Pre Recorded Speech involves a human recording the prompts to be spoken to the user and then playing those back at the appropriate times on the device. Pre-Recorded Speech has the benefit of sounding more human, but the benefits stop there. TTS offers serious benefits for voice applications that cannot be accomplished with Pre-Record Speech. First, a good TTS engine will allow a user to speed up and slow down the speed of the voice to grow with the user as they become more familiar with a process like voice picking. With Pre-Recorded, separate prompts need to be maintained with fewer or shortened words, adding significant programming and storage overhead. A good TTS engine also allows users to choose different voices, and change the pitch of a voice so that they can better understand and hear it. This is significant when working with users who may have a hearing impairment. Finally, by using TTS, the user will hear a fluid, consistent speech pattern throughout their application. Pre-Recorded starts to sound "choppy" when reading long digits and has to resort to TTS for item descriptions and other unplanned strings, resulting in an inferior user experience. Vocollect Voice utilizes Text To Speech.

Speaker Dependent vs. Independent

The voice solutions delivered by Vitech and used in the field today are not built for a casual voice user. A casual voice user is someone who will periodically need to use a voice recognition system, but not on a daily basis. You are a causal voice user when you call in to check a credit card balance and use the voice recognition feature. A voice picker in a warehouse is a full time voice user. With that in mind, one of the single most important success factors of any voice project is the accuracy of the voice recognition. If the users have to repeat themselves numerous times throughout a task because the device does not hear, or hears them incorrectly, you will quickly lose any productivity gains and be left with very frustrated users.

Speaker Independent technology aims to try and recognize everyone without the need for any up front training. Although very useful for casual systems like a credit card call center, it does not work well in loud environments like a distribution center, nor does it offer accuracy when dealing with a variety of accents and languages.

Speaker Dependent technology is aimed at users of a full time voice system. It does require up front voice template training (usually a little as 15 minutes). This is a onetime investment for a user during their initial use of the system. That investment pays for itself thousands of times over because of the increase in recognition engine accuracy. Different accents and languages are no longer a problem because each user is able to train to how they talk. Vocollect's recognition engine is Speaker Dependent.

Finite Vocabulary vs. Infinite Vocabulary

A “finite” vocabulary voice system has a pre-determined set of commands that a user is allowed to speak and only listens for those commands. An “infinite” vocabulary system attempts to listen for any possible word in the language the user is speaking.

One of the main reasons companies look to voice technology to drive their supply chain work force is to increase productivity. Voice technology aims to shave time off of every task your workers perform. In order to accomplish that, the voice system needs to have a user interface that is as simple and efficient as possible. That means we need to limit the instruction set being spoken by the user to a few, very easy to remember commands. Typically, this should include all the digits 0 through 9, possibly the letters A through Z (spoken phonetically) and a small list of command words such as "say again", "sign off", etc. The entire set should be less than 100 if possible with a given user only speaking 20 or less in a typical day. Given that goal, finite vocabulary is the clear choice for voice directed work in a warehouse or industrial environment. This helps with three goals. First, it aids in voice recognition because the device listens for a small set of commands. Second, training time is drastically reduced because the operator only needs to learn a small set of commands. Third, it increases productivity because the operator only needs to interact through a few very simple spoken commands.

Another element to note when discussing finite vocabulary is to ensure that the number of commands is as few as possible and that commands not being used in your operation can be turned off to further increase productivity and reduce training time. Vocollect Voice utilizes a finite, small vocabulary system with the ability to enable and disable command words.

Start and Stop Words

Start and stop words (or anchor words) are an indicator to a recognition engine when to start and stop listening. For example, when using a voice system that utilizes start and stop words, the conversation between the device and the user at the point of a pick would be as follows: Device: "Pick 6" | Operator: "Pick 6 Confirm". The operator speaking the work "Pick" indicates to the recognition engine to start listening and the word "Confirm" tells the system to stop listening. This was very common when voice systems were first introduced into the warehouse and is still quite prevalent in some systems that have a less than perfect recognition engine. The downside of using start and stop words is that you require the operator to say up to 3 times the amount of vocabulary at any given prompt. For a typical operator, that adds up to thousands of additional words spoken on a single shift. When looking at a voice system, be sure that it does not require the use of start/stop words for it's recognition engine to be effective. Vocollect Voice does not require or recommend the use of start/stop words.

In Area vs. Workstation Based Training

Part of the training that goes into a voice system is the initial "template training" where the operator trains his or her voice the first time they use it. To be as effective as possible, a voice system should allow operators to do their training in the areas they will be working with the ability to be mobile while doing so. An operator will tend to speak at different levels and a different pace when they are in their work area and moving around than they would when sitting in a conference room. For the recognition engine to be as accurate as possible, the person should be speaking during template training as they will during their actual work day. In area training also allows the voice system to react better to the ambient (background) noise present in the environment. Furthermore, the training should be able to be done with the same device and headset they will use on a daily basis. That way, if they have any recognition issues during their task, then can quickly and easily retrain a word without the need to seek out a supervisor or return to the training room. Vocollect Voice allows users to train on the same mobile device they use every day without the need for special equipment and also allows users to retrain any problem words without the need to exit their current task.

Fully Integrated and Alternative Hardware Offering

We are often told by prospective customers that they want to make sure any software offering they purchase is able to run on hardware from multiple vendors to ensure they are not held hostage by a single vendor. We can definitely appreciate that view point and are happy to inform them that Vocollect Voice is certified on a number of devices provided by the leading hardware manufacturers (Motorola, LXE, Psion Teklogix, Intermec). However, it is important when choosing a voice provider that you investigate how they go about "certifying" a given device. You do not want to choose a vendor that claims they are able to run on any hardware. For a voice solution to be effective, you want to ensure that the device has a capable sound card, processor, and connection method to deliver a superior user experience. A good voice vendor should have a formal certification process that involves both lab and field trials. Be sure to validate your voice vendor also has proven experience running on that device in the field. We would also suggest that the vendor also offers their own "built for voice" device so that you have the option to purchase a fully integrated solution from the same vendor. Although a hybrid device is sometimes a good idea, a purpose built device will typically perform better and last longer doing the function it was intended to do. Vocollect utilizes a strict certification process for its certified hardware platforms and also offers its own award winning Talkman series device.

Vocollect Talkman T5 and SR20 Headset


     Request Information
Name:

Company:

Email:

Information Wanted: