Accuracy vs. Convenience: What we learned pitting MATE against Google Gemini

Life moves pretty fast. So does technology. If you’ve been in tech over the last five years, you know the pace of AI advancement is relentless. At Gaia Resources, we’ve shared our AI journey from fish image detection (2021) and handwriting analysis (2022) to more recently building the MATE framework for AI transcription and search of archival records. But for a team of passionate professionals, 'building' is never truly finished. We review, we audit, and - most importantly - we continue to tinker.

This inspiration led us to recently ‘eat our own dog food’ by using the MATE tool internally for transcribing our company wide strategy meeting held recently in Perth. We took the opportunity to compare MATE (which is currently built on the latest version of WhisperX) against the power of Google Gemini’s inbuilt video meeting transcription service. The big learning after our Data Scientist Gail ran the ruler over both options is that: things have improved dramatically over the last few years but nobody is perfect yet.

Both systems still struggled with quality of recording in parts. Lesson learned: Laptops are for notes, not for high-fidelity room recording. Interestingly both models struggled with unclear recordings which is not unexpected, which when compared to human hearing (from somebody who was in the room originally) we can forgive them for not being perfect. Side note: WhisperX and Whisper give measures about the scores for words and audio segments respectively - something we’ll take advantage of in the future in systems we build around the MATE framework.

Despite the poor recordings, we did see a clear outperformance from our MATE over the Google Gemini transcription.

This performance calculation was based upon breaking down each of the 286 rows of transcribed text into side-by-side columns of human transcription, Google transcription, and MATE transcription. Whilst many of the rows were identical in performance, and consistent with human level transcription, there were 64 rows of variation or difference to compare. In these remaining 64 rows of variation or errors MATE performed better in 75% of the rows, getting closest to human transcription.

The level of detail in our evaluation also extended to the sentence identification and punctuation in situations such as:

Despite a lot of similarities in transcription quality there are also punctuation and sentence structure to factor-in when comparing results.

In situations where the words transcribed were correct, but the sentence structure and punctuation differed, MATE was correct in 6 out of 7 instances where we saw this happening. So when factoring in sentence identification and punctuation alone, the performance increased to a 85% preference for MATE. Both transcription options did the task of removing ‘umms’ and ‘uhhs’ from general speech for a cleaner transcript. Google tended to 'smooth over' uncertainty by removing duplicate words or phrases. While this makes for a cleaner read, it often strips the original context. For example:

For some of our clients in the world of archives and collections, these 'small' stumbles aren't just typos - they are alterations of the historical record. When fed into additional search functionality these omissions could alter the meaning and purpose of the original source media, which is less than desirable!

We built the MATE framework on a robust, open-source framework that we can review the components of and swap them out as we see the performance metrics through thorough testing like this - we aren't locked into one provider’s bias. In this case, Whisper is proving to be the right choice.

Most proprietary models are trained predominately on American English; if your team is named Emily or David or Taylor, you’re probably fine. But for more diverse teams or even indigenous languages, those commercial models can sometimes struggle - although there are a lot of languages in both the Whisper and Gemini models, and being open source, we’re seeing additions to Whisper being added by the community regularly. MATE allows us to swap in specialized models (such as ones that include other languages, like those being created by our friends at PARADISEC) to ensure everyone is heard correctly and history is recorded for all, long into the future.

If you’re interested in learning more about the flexibility and functionality of MATE - or AI in general - across collections/archives or environment, please reach out to me or on our socials LinkedIn, Facebook or Instagram to chat more about what impacts AI can deliver in your business.

Jarrad

*Please note, cover image AI generated