Speech Recognition

Gallaudet's Chapel Hall with U.S. Capitol dome in background

Automatic Speech Recognition -- Fall 2002

Automatic speech recognition (ASR) is now being applied to the transcription of speech for communication access by deaf and hard of hearing people.

The most successful and accurate of these applications currently (2002) make use of a technique called “shadowing” or sometimes called “voicewriting.” Rather than have the speaker’s speech directly transcribed by the system, a hearing person whose speech is well-trained to an ASR system repeats the words being spoken.

This technique can greatly improve the accuracy of the system compared to direct ASR transcription of speech while that speaker is engaged in conversation or lecturing to a group of people. The professional voicewriter’s attention is dedicated to the task of transcribing. The voicewriter is able to watching the output of the speech, concentrate on the task, and keep the voice modulated for optimum accuracy. Typically the individual is working in an environment where noise is controlled, or else using a mask to eliminate ambient noise and speech. When someone is conversing or lecturing directly into an ASR system, none of these conditions is met and accuracy is in general significantly worse.

Services

Ultratec, Inc. has developed a new telecommunications relay service based on this technique and on other innovations in the user interface for telephone calls. CapTel™, for Captioned Telephone, permits a relay-service user to both listen to the other party’s speech and read the ASR-based transcription on a small screen. The voicewriter’s presence is so unobtrusive that some new users believe the direct method of ASR transcription is being used. CapTel has other helpful features as well, such as a “dial-through” feature the eliminates the need for the caller to first dial the relay service and communicate with a communications assistant before beginning the call. CapTel is still experimental in fall 2002, being tested and considered for implementation by relay services.

Viable Technologies, Inc. offers live transcription service for meetings, lectures, and conference calls using the shadowing technique. The users access the service on a website. The transcriber typically gets access to the speech via a phone line in the location of the meeting or lecture. The user can control the appearance of the streaming text, and can review text and communicate via text with the real-time transcriber. The company cleans up the transcript after the class or event, and emails it to the client within 24 hours.

How accurate is transcription using shadowing? As with stenography based services, a lot depends on the skill of the voicewriter, the number of new proper names and jargon in the speech (which can later be added to the system’s dictionary), and the clarity and audibility of the speech being transcribed.

There is some debate about how accurate a transcription service needs to be in order to be effective. Although everyone agrees that 100% is the goal, the lower limits are sometimes contested. Certified court reporters’ required accuracy for certification is 96% at 180 words per minute. At this accuracy rate, when the speech is at a rate of 150 words per minute, the number of error per minute should be about 6 on average. But if the accuracy falls to 90%, the number of incorrect words would be 15 per minute on average, or 750 in the course of a 50-minute class. It is easy to see where every percentage point counts, particularly in situations where the person is trying to extract new information in real time and learn – as in classrooms where deaf students are dependent on the captions for receiving lectures and discussion.

Back to home page