What We Wish We’d Known About Audio Games Design (Part II)
There’s a wildly popular, website that collects postcards describing people’s deepest, darkest secrets. It’s called PostSecret, and it displays the front and back of ten postcards (of the thousands received weekly) anonymously. The cards go up on Saturday night, and they stay up for one week and then are very occasionally recycled. PostSecret has had nearly three-quarters of a billion (with a “b”) unique views.
We’re mentioning PostSecret because platforms – Amazon Alexa, Google Assistant, etc. – do a good job protecting users’ anonymity. Could Voice be a venue for confidential transmissions? Even anonymous ones? Maybe Voice could be a conduit for all kinds of private communications from a variety of sources, from confidants to therapists?
It’s an interesting idea, one that clients routinely bring up for exploration. What prohibits this approach – currently – is the technical capabilities of NLP and NLU. Voice assistants are not dictation machines. They can be pretty good at understanding speech. The industry standard claim is 90% accuracy or maybe 95% on a good day. But consider that 90% means that one word in ten is wrong. That’s one word per sentence.
And that itsn’t the only technical challenge. There is currently no way to edit a phrase to correct an error. For example, you could say a sentence, and your assistant will repeat it back to you for confirmation. You hear the mistake, but you lack a cursor or highlight to fix or flag the error. All you can do is reject the entire transcripted sentence and try again. And, you can expect the same result because the issue isn’t necessarily your pronunciation but rather the NLP/NLU itself. The Voice assistant misunderstands you again, and so it makes exactly the same mistake.
The take home lesson is that Voice currently makes for a very bad dictation interface.
If you have a great Voice app idea that involves open recognition of anything more than a short phrase – especially if someone other than the speaker is expected to understand it – you would do well to stop and reconsider the project. No one needs to waste time learning this truth independently. Do another project while you wait as NLP/NLU evolves.
StarLanes, the first massively multiplayer online Voice game
Initially Voice came to prominence as the expected control interface for the Internet of Things (IoT). Voice would allow users to just speak naturally, to ask for what they wanted which the system would understand and undertake. The reality turned out to be a bit more complicated.
Audio Interfaces, we soon learned, can quickly get cluttered.
Let’s use the example of StarLanes, the first massively multiplayer online game for Voice. To play gamers join one team or the other and move pieces around a common board to capture points. Teaming proved to be an effective strategy. Players could focus on tactical and strategic combat. We planned in the next versions to emphasis tactical and strategic combat while incorporating trading and exploration.
But as we added more and more features to the game, the game started to destabilize.
We documented false positives, in which the user would say one thing and the game would hear it as something else. We had discovered that the Voice interface – as it stood – had a practical limit of about 40 general intents. We could partly get around the limit by designing very specific intents, which helped maintained NLP/NLU accuracy. But even this work-around had its limits.
Our current proposed fix is to break the gameplay into four separate audio apps. All would share the same universe, and their gameplay would affect each other. But by segregating the non-overlapping functionality into different apps, we would avoid audio clutter.
The take home lesson is to put a cap on the complexity of your linguistic model. If you want good recognition, you will want to aim to keep the feature set small. Or find a way to create a vocabulary that can be re-used in multiple contexts. Almost every application we write we always add “yes” and “no” to the audio model. (We even created an entire quest engine in Star Lanes driven entirely by “yes” and “no”. It was a little constrained, but it was easy to teach the users, easy to implement, and easy to design to – with some imagination).