Transcribing audio recordings has historically been one of the most time-consuming aspects of interview or focus group research. Fortunately, recent advancements in AI have made this task significantly more manageable.
In addition to well-known AI models like ChatGPT and DALL-E, OpenAI has developed Whisper, an automatic speech recognition (ASR) system. Whisper excels at automatically transcribing spoken language from audio recordings into text. Its proficiency is a result of being trained on a vast corpus of over 680,000 hours of multilingual data collected from the web. This extensive training means Whisper has the capability to decode speech in a wide range of languages.
Whisper is undoubtedly pretty accurate at decoding English speech. However, I am curious how would it fare with English spoken with Singaporean accent? I decided to put it to the test by requesting it to transcribe spoken English with a distinct Singaporean accent, using freely available audio from Wikimedia Commons.
This is the audio recording:
And this is the transcribed result! I am using the openai package (which means you need to get the API key to get this to run) For the sake of brevity, I will include the first few sentences of the transcript:
“An Inconvenient Truth from Wikipedia, the free encyclopedia at en.wikipedia.org. An Inconvenient Truth is a documentary film about climate change, especially global warming, directed by Davis Guggenheim and starring former United States Vice President Al Gore. The documentary is based largely on a multimedia presentation that Gore developed over many years as part of an educational campaign on global warming. An Inconvenient Truth is also the title of a campaign and book by Gore, which reached number one on the New York Times bestseller list of 2 July and 13 August 2006, and again during several months on the list. The film premiered at the 2006 Sundance Film Festival and opened in New York and Los Angeles on 24 May 2006. See 2006 in film. […] "
As you can see, it performs remarkably well! Although the recording doesn't incorporate colloquial Singlish expressions like 'lah,' 'leh,' 'lor,' the nearly flawless transcription would be a huge help in the interview/focus group research process, certainly much better than transcribing the interview recording manually.
Whisper supports multilingual transcription, so I tested it to see if it can accurately transcribe speech in Tamil, Chinese, and Malay. I could not find a long enough recording of a Malay speech in Wikimedia commons, so I used Indonesian audio recording as a substitute for this experiment.
Here is the result of the audio transcription:
他们是谁? 他是爱美,他是中国人, 他是东尼,他是美国人。 你也是美国人吗? 不是,我是英国人。 你呢?你是哪国人? 我是法国人。
Upon checking with a fluent Chinese speaker, this transcription is not entirely accurate, and according to this fluent speaker, “the audio transcription suffers due to lack of contextual understanding the gendered transliterated names "爱美" and "东尼" and seems to have defaulted to 'he/他'. While 'he/他' and 'she/她' in Mandarin Chinese are homophones they are written differently.”
Peninggalan Johannes Latuharhari Sepeninggal Johannes, ia dihargai pemerintah Indonesia dengan bintang maha putra utama. Sosoknya diabadikan sebagai nama suatu jembatan di Jakarta beserta jalan di Ambon dan Jakarta. KM Johannes Latuharhari, kapal kargo yang dibangun di Polandia, juga dinamakan atasnya. Ada pula yayasan Mr. J. Latuharhari Foundation yang merupakan penerbit surat kabar sinar harapan di Ambon. Menurut sejarawan Australia Richard Southwell, Johannes merupakan tokoh pemimpin Ambon pertama yang mendorong agar Ambon dan Maluku termasuk dalam NKRI dan menganggap orang Ambon sebagai orang Indonesia […]
Since I can understand Indonesian/Malay, I was able to review the transcription. There are some missing or oddly placed punctuations here and there, but overall, it seems to work flawlessly with this particular audio. It’s worth noting that the speaker in this audio speaks relatively slowly and loudly, (certainly slower than when in normal conversation) which likely contributed to the accuracy of the transcription.
அதன் இசை மூவர் அல்லது தமிழ் இசை மும்மூர்த்திகள் அல்லது ஆத்மும்மூர்த்திகள் என்போர் தமிழுலேயே பாட்டி தமிழுலேயே பாடி தமிழ் இசையை வளர்த்த அர்ணாச்சலவெர் கதிராயர், முத்துத் தாண்டவர், மாறி முத்தாப்பிள்ணை ஏன்னும் மூன்று பிருமக் கள் அவர் பொதுவாக கரணாடகஇசையின் மும்மூர்த்திகள் என்று அறியப்படும் தியாகராஜர், சிஞாமா சாஸ்த்ரிகள், முத்துஷாமி தீட்சிதர் ஆகியோரைவிட இவர்கள் பலந்தால் முருப்பட்டவர்கள். தமிழ்சை மூம்மூரை கிருதி என்று அழைக்கப்படும் கீர்த்தனைகளுக்கு வடிவம் கொடுத்தோம். இன்று uğள்ள பள்ளவி, அணுபள்ளவி, சரணம் அல்லது எடுப்பு, தொடுப்பு, முடிப்பு என்னும் அமைப்ப wears இவர்களின் பாடல்களிலேயே காணக் கிடைக்கிறது. தமிழ்செய் மூம்மருக்கு தமிழ்லகாரசு சார்பில் சீர்காளியில் நினைவு மண்டவும் அமைக்கப்பட்டு வருகிறது.
Upon checking with a Tamil speaker, they immediately noticed that there are lots of misspelling and mismatch with the audio here and there. There are some parts that do not get transcribed as well (marked in yellow above). The quality of the audio itself could be a contributing factor, as there is a bit of background noise in this one.
Transcribing Multi-accent and Multi-lingual audio
Everything seems promising thus far! However, when conducting interviews with local Singapore residents, they may use colloquial Singlish, along with occasional mix-ins of Malay, Hokkien, or other language. With this in mind, I decided to try running this audio snippet from this Wah!Banana YouTube video that has a Creative Common license on it.
Here is the audio snippet:
Okay, let's continue to study Singlish. Okay. Can y'all try and use a sentence with the word la? I ate my dinner already la. Very good. I am Sergei la. I don't think that's how you use it Sergei. Oh, does it have the undertones that maybe someone forgot my name? Krishnan, help me out. What is your good name sir? My name is Sergei la. Very good la. Very very good sir. Well, I guess you're right. Want to go out with me la? No thanks la. Mr Tan, I don't even know why I'm in this stupid class. I'm from Malaysia. We're exactly the same. I thought you were three times cheaper in Malaysia. You're so stupid! You're saying we're cheap? Your Chinese is really bad. Anyway, y'all know what? In Singapore, most people can speak English. So I don't think y'all have any problem communicating.
As you can see, Whisper seems to have a decent grasp at our local Singlish, at least in the context of this audio. Furthermore, Whisper appears to automatically translate spoken Chinese and other dialects into English, and the translations seems accurate enough (according to native speakers). It’s worth noting that unlike the previous audio where there was only one speaker, this snippet features multiple speakers, and as we can see, Whisper did not differentiate between them.
I'd like to point out something interesting in this audio snippet. In the original video, some swearing in Chinese was censored with a beeping sound. Interestingly, Whisper appears to "hallucinate" and translates this portion of the audio as "You're so stupid!". The Chinese speakers that I consulted confirmed that this translation does not accurately represent what the speaker is saying. It's interesting that Whisper seems to recognise this segment as cursing and attempts an automatic translation, albeit inaccurately. It does get the punctuation right, though!
Translation the audio samples to English
Next, I ran through the samples and translate them into English. However, it seems that something peculiar occurred in my experiment. Whisper worked smoothly for Indonesian and Tamil translations, but it displayed some inconsistencies with Chinese translation. At times, it seemed to encounter difficulties and refused to provide translations into English. I was finally able to get Whisper to translate the audio for me using another approach that did not use API key (see Colab notebook for more info).
Here is the translation from the Chinese audio sample:
Who are they? They are lovers, they are Chinese, they are Southeast Asia, they are Americans. Are you also Americans? No, I am an Englishman. Where are you? I am an Frenchman. I am an Englishman. I am an Englishman. I am an Englishman. I am an Englishman. I am an Englishman. I am an Englishman. I am an Englishman. I am an Englishman. I am an Englishman. I am an Englishman. I am an Englishman. I am an Englishman. I am an Englishman. I am an Englishman.
Here is the translation from Indonesian audio sample:
The death of Johannes Latuharhari As Johannes passed away, he was respected by the Indonesian government as the main star. His figure was immortalized as the name of a bridge in Jakarta along with a road in Ambon and Jakarta. KM Johannes Latuharhari, a cargo ship built in Poland, was also named after him. There was also a statue of Mr. J. Latuharhari Foundation, who was the editor of Sinar Harapan newspaper in Ambon. According to Australian historian Richard Saufel, Johannes was the first leader of Ambon who encouraged Ambon and Maluku to be included in the NKRI and considered the people of Ambon as the people of Indonesia. […]
And here is the translation from Tamil to English:
Arunachalakavirayar, Muthuthandavar and Marimuthappillai are three great people who wrote and sang tamil songs and developed tamil music. In general, they are known as the forefathers of karnataka music. They are more knowledgeable than Thyagarajar, Siamasastrigal, Muthuswamy Deepchithar and Ahiyore. Tamizhisayimuvar gave inspiration to the songs called Krithi. Today's Pallavi, Anupallavi, Charanam or Eduppu, Thoduppu, Mudippu are found in their songs. Tamizhisayimuvar has a memorial hall in Tamil Nadu.
When it comes to translation, For Indonesian, it becomes apparent that some of the nuances are lost, resulting in translations that feel awkward and less fluid. Same thing can be said about the Tamil translation.
The Chinese translation seems to be particularly wobbly. It seems like Whisper is approximating the names into the closest proper nouns instead of recognising them as English names which have been converted into Mandarin by phonetic similarity. In describing "who are they", the final translation would lead you to think there are four people with four different nationalities, when there are only two. Additionally, there are some repeated sentences in the translation result which looks very eerie, so to speak.
Check out the colab notebook for these little experiments here:
If you’d like to try these with your own audio recording, note that you will need to upload them to Colab and amend the audio file name in the code.
- Whisper is an open-source model, but if you use Whisper through OpenAI Playground or API key, do note that there is a fee of $0.006 USD per minute (rounded to the nearest second). If you don’t have access to the API key, there is an alternative way of by installing directly from its GitHub repo. The steps to do this is included in the Google Colab notebook above.
- Whisper can only process about 25 MB worth of audio, so if your file is larger than that, you may need to slice them into chunks first. PyDub package is excellent for this task.
- Be sure to check if your audio is in one of the supported languages.
- At this point, Whisper is unable to differentiate between different speakers in an audio recording, so that’s something to keep in mind if you are transcribing focus group recording.
In summary, the overall performance for transcription great for English and Indonesian, but still a bit shaky for Mandarin Chinese and Tamil. For translation, human intervention is still needed to fine tune the translation results for all the three languages tested here.
Once you got all the transcription work cleaned, you can load them into NVivo or Atlas.ti available in SMU Libraries’ Investment and Data Studio for analysis! Check out this ResearchRadar article for more information on Atlas.ti.
As always, if you have any comments or feedback that you would like to share, please do not hesitate to send them to firstname.lastname@example.org.
Many thanks to my colleagues Samantha, Aaron, and Sumita for the help on checking Mandarin and Tamil results!