|
|
| Speech Technologies White Paper |
| Application |
The scene: a local museum or gallery.
An art patron strolls into a gallery or museum. She is informed that this location supports a wireless network that will enhance her visit,
so she pulls out her PDA and browses to the given web site. As she works her way from gallery to gallery, and, indeed, from work to work, her
PDA immediately senses the closest work, retrieves any relevant information, and speaks it to her through headphones. She is prompted to ask
questions regarding this particular work, so she speaks a plain English question to the PDA, which immediately answers out loud.
The system keeps track of where she paused the longest during her visit and each kind of interaction she had, and it tunes her experience accordingly.
|
A Proof-Of-Concept application has been built that demonstrates a stable architecture capable of supporting the aforementioned patron's visit.
|
|
Using a wireless (802.11b) network, a PDA accesses a speech-enabled web page from a remote server. The
RFID-enabled handheld automatically (and wirelessly)
detects the artwork nearest to the visitor (marked with an active RFID tag), and retrieves a relevant set of information, images and links to the PDA.
When the Speak button is tapped,
the PDA displays a small "audio meter" while the visitor speaks a question.
When the PDA detects that speech has begun, recording commences and continues until silence is detected. The audio meter gives real-time feedback about
loudness, background noise and word gaps that can help the person speaking have a better experience with the device.
A typical request might be "Please show all painters".
|
|
The PDA sends the recording to the server, which converts the speech to text and returns it to the PDA. If there were no problems with the conversion,
then the PDA immediately turns around and sends the now-text-request to another server for natural language processing.
A semantic engine "interprets" the sentence and converts it into a formal database query.
A database lookup is then performed using the newly formed query and a result set is sent to the web server, which, in turn, forms a web page containing the results
that is sent back to the PDA. The web server also creates a voice response that the PDA converts to speech output while it displays the results page.
|
|
Because the application is a multimodal design, the visitor can interact by either speaking or tapping on the screen. For example, when a link in the result
set is tapped, more detail, including images, is returned to the PDA.
|
This was accomplished via complex interactions of both off-the-shelf and custom components (see Technology), at lost cost and high performance. Much work remains to "harden" and tune the
application for everyday real-world usage (see Next Steps), but there are few unknowns left to uncover.
|
|