Will Improvements in Voice Technology Eliminate the Need for Voice UX Design?
With our book Voice UX Design, we are aiming to avoid the standard, party-line approach to usability in Voice, which — frankly speaking — doesn’t work. Software programmers are continually encouraged to write software that “allows human beings to speak naturally” to a computer. The reality is that these systems do not yet have the requisite transcription accuracy, contextual parsing, or defined logic responses to be successful. It is well beyond spec, at least for now, and so we need to develop the Voice UX toolkit to meet user needs.
TLDR: Over the long term, linguistic models will improve along three dependent trajectories: transcription accuracy, context parsing accuracy, and digital response accuracy. Yet these improvements will not address the need for discoverability and usability. We will still need to employ tools like Method of Loci (MoL), the Beep, etc. to address these user needs in the short and long term.
The first trajectory, accurate transcription, is about better speech recognition. Google and Amazon are continually improving transcription accuracy, but a 1:20 or 1:10 error rate — due to a dirty soundscape, colloquialism, accent, syntax, etc. – means we as programmers need to include a confirmation step for irrevocable or high risk interactions. We are not going conduct a financial transaction, for example, without a confirmation, because we know the error rate is too high. Plus we have found that errors tend to repeat. The system will misinterpret the same word or phrase again and again. We use MoL as a way of enforcing a context. We can weave the confirmation into the structure (see “Take a step forward to confirm” dialog below here).
The second trajectory is context parsing modeling, i.e. the capacity to perfectly parse accurate transcriptions into appropriate data structures. We usually point to IBM Watson as an example of the effective semantic analysis of human speech. The issue is that the system requires a fleet of engineers and linguistic specialists to optimize the digital output. Before software programmers (generalists) can use that capacity, they need for the specialists to distill their jockeying into rules algorithms. Right now we don’t have is a way to make this parsing usable to non-specialists.
The third trajectory is about encoding the correct response to accurately transcripted and accurately parsed user intent. Just because you are certain of users’ intent, your digital system still needs to know how to respond. Human language is incredibliy complex. Accurate transcription is orders of magnitude less difficult than accurate parsing which is orders of magnitude less difficult than building algorithms that can respond appropriately.
Google has indexed the world, run analytics on nearly 20 years of textual queries and has Google Assistant running on a half billion devices, yet the collection of algorithms needed to provide appropriate responses remains elusive. Imagine any dialog between Iron Man and his digital assistant, Jarvis. Assume perfect transcription and perfect parsing. What sort of API or logic could you implement to derive Jarvis’s response from that input. You’d need a team of linguistic specialists to make some headway in narrow information domains. But it isn’t accessible to most programmers. They need innovative solutions to get to A, B and C. And even with those innovations in place, they may still need to serve users who need help understanding where to find content and how to access it. And in the interim, programmers will need a fully developed Voice UX Toolkit to enable users to use Voice systems effectively.