Videophones

For related articles, click:

NOTE

This Web page was written at 1995, when deployment of ISDN, not to mention broadband Internet service, was still in the future. At 2005, broadband Internet, provided via cable TV and ADSL is available in several regions worldwide. In those regions, several deaf people use PCs equipped with webcams and software such as MSN Messenger for video communications.

MOVING PICTURE TRANSMISSION AT LOW BITRATES FOR SIGN LANGUAGE COMMUNICATION

Written by:
M. W. Whybray
BT Laboratories
Martlesham Heath
Ipswich, IP5 7RE
England

Introduction

There are about 50,000 people in the UK for whom sign language is the preferred means of personal communication, usually because of profound deafness. However, the existing telephone system only allows these people to use text based terminals (textphones), which are very impersonal to use, and about 6 times slower than verbal communication.

Sign language is as fast as speech, and so the ability to transmit moving pictures of sufficient quality for sign language over telephone and other networks would enable deaf people to use their most efficient means of communication, as well as allowing far more interactive and personally satisfying conversations. This paper describes the development at BT Laboratories of an experimental videophone for sign language communication that works over the normal Public Switched Telephone Network (PSTN) using a 14.4 kbit/s modem.

Video Codec

The key module in the videophone is a video codec (coder/decoder) that implements an image compression algorithm allowing transmission of moving images at a data rate of only 14.4 kbit/s. The coding algorithm used for early tests produced binary pictures (ie only black or white pixels) by means of a 'valledge' operator [1]. This operator was sensitive to the local valley-shaped image intensity features that had been identified as perceptually important when representing human faces and hands. The cartoon-like output image was quite intelligible when the person signing wore plain dark clothing, sat in front of a plain background, and was illuminated from the front. This was fine in the laboratory, but on field trial in deaf people's homes it was soon clear that it was not possible to control any of these factors sufficiently without causing unacceptable inconvenience to the users.

A different coding algorithm was therefore developed which provided a full grey-scale. It is based on the CCITT Recommendation H.261 algorithm [2], and uses a combination of motion-compensated prediction and transform coding. Briefly, the new image frame to be encoded is divided up into square blocks of pixels, and compared to the last displayed frame. For each block in the new frame, the old frame is searched by looking in the local region for a block which matches the image data in the new block as closely as possible. The idea is that moving objects will be tracked by finding the corresponding match between the image data in the two frames. If a good match is found, then by transmitting the required displacement (or 'motion vector') for that and subsequent blocks a first approximation or 'prediction' of the new frame can be constructed.

This prediction is subtracted from the new frame and the resultant prediction error is then data compressed by means of a block based Discrete Cosine Transform (DCT). This acts in a similar way to a Fourier transform by mapping spatial domain data into a frequency-like domain. In this transform domain, the useful information tends to be ordered in such a way as to make data reduction by means of quantisation and variable length coding more efficient, and also has the advantage that the coding errors introduced are of a relatively benign nature since they manifest themselves largely as blurring and random noise in the final image. The codewords generated are assembled into packets, together with the block displacement information and synchronising codewords, for transmission. Frames are reconstructed at the receiving end by translating codewords back into transform coefficient values, performing an inverse DCT, and adding the resultant blocks to the prediction frame generated by applying the received motion vectors to the old displayed frame.

This coding scheme produces an output image which approximates to the input image, but is degraded by some loss of fine detail, and by coding noise. The degree of these effects is controlled by the step size of the quantiser. A smaller step size gives better pictures, but also lower data compression. The overall data rate generated depends on the image spatial resolution, the degree of compression, and the frame rate. Since the channel capacity is fixed at 14.4 kbit/s by the availability of suitable modems, there is the problem of finding the optimum values for the image resolution, distortion (ie quantiser step size), and frame rate, for most effective sign language communication. Unfortunately the only way of measuring this effectiveness is by running extensive tests using people proficient in sign language. This procedure is complicated by the fact that sign language incorporates a range of complexity from large gestures to precise finger formations, and also the proficiency of signers varies considerably. With several coding parameters to optimise an exhaustive search of the parameter space is therefore impractical.

Fortunately, previous work has narrowed down the appropriate range for some parameters. Reduced frame rate appears to have little effect until it drops below about 10-15 frames/s, after which a decline in performance sets in [3]. Reducing the spatial resolution gives a progressive loss of intelligibility, but resolutions as low as 24 by 16 (vertical/horizontal) pixels are still usable for grey-scale images [3]. To maintain a frame rate of above 10 frames/s on our system, whilst maintain a reasonably clear picture by suitable quantiser control, we found that a picture resolution of around 48 by 48 pixels was appropriate. This was compared to an alternative format of 48 by 32 pixels by means of tests using signers who were timed when performing interactive tasks involving both the signing of sentences, and finger spelling. The result was a small preference for the 48 by 48 format, despite the fact that the higher spatial resolution was paid for by a slightly reduced frame rate and increased coding distortion.

System design

The video codec is implemented using two TMS320C30 programmable Digital Signal Processors (DSPs), attached to a proprietary chip-set which allows the capture and display of video information, whilst giving full access by the DSPs to the digitised frames stored in memory. The fact that the coding algorithm is implemented entirely in software allows it to be fairly easily modified to alter parameters such as the image resolution or the quantiser control process. The DSPs also perform other functions such as drawing text on the screen.

Although the video codec is central to the videophone, there are great many other design features that have been incorporated as a result of research and discussions with deaf and hearing impaired people. These are summarised below:

  • The screen and camera are mounted in a compact table- top terminal unit, which can be easily tilted to allow for different table and user heights.
  • Text can be sent at the same time as pictures by means of an attached keyboard. This allows signers to revert to the use of text in situations where it is sometimes more useful than signing, for example when communicating precise details such as names, addresses, and numbers, or when signing is difficult due to poor lighting etc.
  • The system can interwork with existing textphones, of which many thousands are already used by deaf people and also by some emergency services and help agencies.
  • The screen layout as per figure 1 allows the simultaneous display of the incoming picture, a smaller self-view picture, separate incoming and outgoing text lines, and some system status information. The self-view picture enables users to check that they are signing only within the view of the camera. A full-size self- view picture is also available by pressing a key.

    Figure 1 - not reproduced in ASCII!

  • Incoming calls are signalled by on-screen messages, a flashing xenon strobe light, and connections to any other alerting device such as tactile buzzers, or house- light flashing systems, as often used by deaf people.
  • A telephone style keypad and control keys provide a simple user interface, and outgoing call progress is indicated by on-screen messages such as 'Engaged'.
  • In addition to the above standard facilities, we have allowed for the optional inclusion of a low bitrate audio codec to be able to transmit sound as well as pictures. This is mainly for lip-reading purposes, as most people who lip-read actually have residual hearing, and require both sight of the lips and sound to lip-read effectively. However, we have not yet conducted any formal tests on this aspect.

User reactions

The first two systems were put on field trial in January 1992, with several more being deployed in the summer with a target of 15 on trial by the end of 1992. The triallists are mainly deaf people living in the Ipswich area, near to BT Laboratories. Initial reactions are that despite the very limited picture quality possible, the system is significantly quicker to use than a textphone, but more importantly gives a much greater depth and humanity to the communication. This is because there is instant feedback of the other person's reactions and emotional state, conveyed by body language and facial expression, as well as by formal sign language. For deaf people these visual aspects of communication are very important. At the time of writing the full trial is in its early stages, so it is not possible to comment on the long term usability of the system, which is ultimately more important than enthusiastic early reactions to a new 'toy'.

Future prospects for videotelephony

The future of the system described depends on the results of the trial, and if these are positive on the options for cost effective manufacture. The present design is an experimental test-bed, and as such costs well over #2000; this would have to be reduced to a few hundred pounds by extensive re-design to be viable. The recent announcement of at least two general purpose PSTN videophones using essentially similar technology indicates that this price goal is achievable in a mass- market item. However, these systems are not ideally adapted for deaf peoples needs since the image coding algorithms were not optimised for sign language, they have no text facilities, limited visual indications of call status, and use up valuable channel capacity to provide the audio channel which is usually not useful to deaf people. It may be that some adaptation of these systems is possible however. Informal tests indicate that the picture quality is at least usable for sign language.

In the longer term future, the penetration of digital telephony, particularly the Integrated Services Digital Network (ISDN), will allow higher data rates (64kbit/s and above) to be used for videophones, giving higher quality pictures than is possible on the PSTN. The ISDN is widely available now in the UK and abroad, as are ISDN videophones, and there have been several trials showing their value for sign language [4].

Other future developments include mobile videophones. Initial tests have been performed using a modified H.261 type codec, with data transmitted over a Digital European Cordless Telephone (DECT) radio link, and demonstrated the basic feasibility of the idea [5]. This will give a means of providing mobile video services, at least over a domestic area, which may have applications for people with mobility or other impairments.

[1] Whybray M W, Hanna E: 'A DSP based videophone for the hearing impaired using valledge processed pictures', Proc ICASSP89, Vol 3.M9.16, pp1866-1869.

[2] CCITT Recommendation H.261: 'Video codec for audio visual services at p x 64 kbit/s'.

[3] Sperling G, Landy M, Cohen Y, Pavel M: 'Intelligible encoding of ASL image sequences at extremely low information rates', Computer Vision, Graphics, and Image Processing 31, pp335-391 (1985).

[4] COST 219 book: 'Issues in Telecommunications and Disability', edited by S von Tetzchner, published by the Commission of the European Communities, 1991, ISBN 92- 826-3128-1. Catalogue number CD-NA-13845-EB-C.

[5] Heron A, MacDonald N: 'Video transmission over a radio link using H.261 and DECT', 4th International Conference on Image Processing and Applications, IEE, 7- 9 April 1992.


Sign Language Phone

Contributed by Carlton D. Fuerst at 15 Sep 1994.

The was an article in the New York Times (Sunday, August 21, 1994) entitled "A work in Progress: Sign Language Telephones" by Evan I. Schwartz. This artlicle covers much of the information recently mentioned in the "Sign Language Phone" thread.

Apparently there is a sign language phone being developed at the A. I. duPont Institute in Wilmington, DE. This is not a video phone; phone lines do not allow true video images (frames per minute) to be transmitted. As a result, video phone images are jerky at best and therefore not suitable for the "fluid motion" required by sign language. Richard A. Foulds, director of the University of Delaware's Applied Science and Engineering Laboratories (and head of the project) is quoted as saying "We look not for the quality of the picture, but for the quality of the movement." Apparently the image is reduced to a black and white outlined drawing of the person using "edge detection" technology. In a photograph, a signer is wearing a glove which sense the hand position, but it is not clear whether this has anything to do with the sign language phone. The article indicates that the image from the video camera is fed to a circuit board in the back of an IBM-type PC. This development is sponsored by a $375,000 grant form the US dept. of Education.

A separate system (sign language phone) is being developed by engineers at British telecom. AT&T Bell Labs (Dr. Kicha Ganapathy) is working on edge detection techonlogy and the possibility that outline images may be useful to "add information to the conversation" of hearing people.

The sign language phone was tested at Gallaudet (presumably the U. of Del.'s phone) with 13 pairs of subjects. Of those, 11 pairs "were able to tease, laugh, argue, discuss, etc without becoming frustrated." 55% prefered the system to T.D.D. but there were complaints about blind spots and ghost images. There was a tendency for younger people to take the sign language phone more readily.

There is also some information in the article about the importance of sign language to the deaf and a joke about throwing a conductor from a train. It is explained that "a big part of deaf humor involved making fun of hearing people." It was not a great article but being a scientist, it brought many technical questions to mind that I found interesting to consider.


Video conferencing demos

Contributed by Dr. Cynthia King at 23 Oct 1994.

I have been to two video conferencing demos. ISDN 64 Kbps (not meg!) is not acceptable for sign language...especially over long periods of time. Some people find 384 kbps acceptable, but most of the deaf people in the group wanted 768 kbps (or half T1). Full T1 (which is the equivalent of 24 ISDN lines) is 1.54 megabytes/sec. I don't remember exact costs, but the figure $400 a month and $72/hour come to mind...i.e., a T1 line costs about $400 a month and then you pay $72/hour for connect time. You'd pay less is you use half T1, etc. Remember this is for video conferencing.

At the two demos I attended, one system was clearly superior. It handled motion very well and sign language was clear. On the other system, motion caused some problems in clarity...if you signed fast, your hands smeared across the screen. (If you've ever seen Mary Beth's sign language book for kids where she has a rainbow painted following her fingers across the cover... this video conferencing looked like that...except for the pretty colors). It was AWFUL to read!

I don't have any of the technical information handy about these systems... it's not something I'm interested in using in the near future (I'd really suggest waiting a couple years to let some of the bugs get worked out.)

Most business ISDN is done at 128 kbps...this means 10-15 frames per second and is acceptable for some purposes. Sign language digital video on CD-ROMs on a computer run at about 15 frames per second and most people find it readable (this is not research, just experience). Most of us won't be happy, tho, until we have fullscreen, 30 frames per second (fps) video for sign language. That is coming in a variety of ways...M-PEG, hardware assisted Quicktime and Video for Windows...for digital video. (We already have 30 fps with laserdisc video and there are about 3-4 sign language laserdiscs available). Both video conferencing and digital video, tho, are rapidly changing fields. If you don't need this stuff today, I'd wait about 18 months when the prices drop dramatically and bugs are worked out.

Last update date: 
2005 Nov 17