Testing automated transcription

Outside of my role as Technical Project Manager at Gaia Resources, I have a few hobbies and passions: spending time with my young family, exercise, the local mangrove forests, sleep. But perhaps more relevant, and with increasing cross-over to my day job, is managing and growing the Brisbane’s Meetup group for Website Accessibility and Inclusive Design.

I took over the group back in September, and since then we’ve run five meetup events and held one additional meeting in partnership with UX Brisbane. As of March, we now have a permanent home at the ABC in Southbank. Ash Kyd, one of the Web Developers in the ABC and strong accessibility advocate, has organised for the ABC to host us on the third Tuesday of each month. This support ensures we can keep hosting events for free and keep a recurring and regular time to keep advancing accessibility causes.

Last night (Tuesday 17 April), I spoke about my experience road testing the new AWS Transcribe service, as well as leveraging the Google’s Web Speech API.

The purpose of this talk was to see if we can reduce the cost of creating accessible resources. The cost of transcription is very high, if it can be reduced, then more resources can be made accessible, and if many of the manual tasks around transcription can be automated, then time can be focused on quality and correction.

If you’re interested in getting the notes from my presentation, you can get them here: Transcription on the cloud [PDF 2.9mb].

But in short, I found the AWS Transcription service, although it made many errors, extremely useful. Effectively it gave you a circa. 80% accurate transcription that could be manually corrected in a fraction of the time that it would take to manually create a transcription. Also, it gave you confidence rankings on a word-by-word basis, which is very cool:

screen shot showing how AWS transcribe rates all words

But perhaps its most powerful feature was the ability to add an external vocabulary. Some words that you use are technical, rare, have heavy accents etc. The ability to load these in with a CSV vocabulary means the machine learning algorithm is ready for them, and you don’t have issues with obscure and incorrect matches on these words.

One disadvantage was that all words are wrapped in individual timestamp ranges; they weren’t grouped into sentence clauses. Which is required for creating confidence ranges on words and applying the external vocabulary, but it also means making a usable *.srt export for captioning would still be a fairly manual process.

On the other hand the Google’s Web Speech API had amazing real time processing of speech. Like, really, really good. It also had a much better handle on the Australian accent. There are some tools out there like Speechlogger (which leverage that API) that allow you to manually override the transcription on the fly and build sentences and paragraphs. This allows you to create good, reliable *.srt exports, and process them as it transcribes speech. Live. In real time. I didn’t see ability to load in external vocabularies like you could for AWS Transcription service, but I didn’t road test this service as heavily.

I didn’t test either in a noisy environment, so I can’t comment which would be better in that very common scenario. My tests were for me talking directly to each service, with one speaker in a fairly quite room.

In short, both services offer real accessibility benefits, and the potential to make things like conference presentation transcriptions (both for recorded video and live captioning) a lot more accessible and cheaper to produce. So an exciting area of development.

If you do live in the Brisbane area, and would like to get involved in the accessibility Meetup, we’d love to see you and you can get involved via our Meetup page.  Or, drop me a line at morgan.strong@gaiaresources.com.au

Morgan

Comments are closed.