Professional Machine Learning Engineer Exam - Question 282

Question

You work at an organization that maintains a cloud-based communication platform that integrates conventional chat, voice, and video conferencing into one platform. The audio recordings are stored in Cloud Storage. All recordings have an 8 kHz sample rate and are more than one minute long. You need to implement a new feature in the platform that will automatically transcribe voice call recordings into a text for future applications, such as call summarization and sentiment analysis. How should you implement the voice call transcription feature following Google-recommended best practices?

Examice · Accepted Answer

To implement the voice call transcription feature, the best approach is to use the original audio sampling rate of 8 kHz and transcribe the audio using the Speech-to-Text API with asynchronous recognition. Google-recommended best practices suggest maintaining the native sample rate to avoid any loss in audio quality during resampling. Additionally, asynchronous recognition is more suitable for longer audio recordings, as it allows for efficient processing without requiring immediate responses, which is ideal for recordings longer than one minute.

CHARLIE2108 · Answer

I went with D.
"following Google-recommended best practices"
https://cloud.google.com/speech-to-text/docs/optimizing-audio-files-for-speech-to-text#:~:text=We%20recommend%20a%20sample%20rate%20of%20at%20least%2016%20kHz%20in%20the%20audio%20files%20that%20you%20use%20for%20transcription%20with%20Speech%2Dto%2DText

tavva_prudhvi · Answer

Upsampling to 16 kHz:
The Speech-to-Text API recommends an audio sample rate of 16 kHz for optimal transcription accuracy. Upsampling the 8 kHz recordings to 16 kHz will improve the quality of the transcription.

Asynchronous Recognition:
Asynchronous recognition is suitable for longer audio recordings (more than one minute). It allows you to submit the audio file and receive the transcription results later, which is more efficient for batch processing.

https://cloud.google.com/speech-to-text/docs/best-practices-provide-speech-data

Yan_X · Answer

B

https://cloud.google.com/speech-to-text/docs/speech-to-text-requests#:~:text=Synchronous%20recognition%20requests%20are%20limited,periodically%20poll%20for%20recognition%20results.

guilhermebutzke · Answer

My Answer: B

- Not necessary upsampling (exclude C and D)
- Asynchronous means executing different tasks with no sequential order. Therefore, is preferred over synchronous recognition for longer audio recordings as it allows for more efficient processing, especially when dealing with larger volumes of data.

ludovikush · Answer

Following best practices, the easiest choice is B

SahandJ · Answer

According to the documentation, it's best to have 16 KHz sample rate, however one should avoid up-sampling and rather use the native sample rate

livewalk · Answer

According to google recommandation on Sampling rate: "If possible, set the sampling rate of the audio source to 16000 Hz. Otherwise, set the sample_rate_hertz to match the native sample rate of the audio source (instead of re-sampling)."
So we should match the native sample (8kHz) in the question.

PhilipKoku · Answer

B) Use original sampling rate and use asynchronous recognition...
"If possible, set the sampling rate of the audio source to 16000 Hz. Otherwise, set the sample_rate_hertz to match the native sample rate of the audio source (instead of re-sampling)."
https://cloud.google.com/speech-to-text/docs/best-practices-provide-speech-data#sampling_rate

guilhermebutzke · Answer

My Answer: B

- Not necessary upsampling (exclude C and D)
- Asynchronous means executing different tasks with no sequential order. Therefore, is preferred over synchronous recognition for longer audio recordings as it allows for more efficient processing, especially when dealing with larger volumes of data.

omermahgoub · Answer

Upsample to 16 kHz and Use Asynchronous Speech-to-Text Recognition

pinimichele01 · Answer

https://cloud.google.com/speech-to-text/docs/best-practices-provide-speech-data: Capture audio with a sampling rate of 16,000 Hz or higher.	Lower sampling rates may reduce accuracy. However, avoid re-sampling. For example, in telephony the native rate is commonly 8000 Hz, which is the rate that should be sent to the service.

https://cloud.google.com/speech-to-text/docs/optimizing-audio-files-for-speech-to-text#sample_rate_frequency_range: It's possible to convert from one sample rate to another. However, there's no benefit to up-sampling the audio, because the frequency range information is limited by the lower sample rate and can't be recovered by converting to a higher sample rate.

-----> B, not D

asmgi · Answer

We have longer than minute, 8KHz recordings.

https://cloud.google.com/speech-to-text/docs/best-practices-provide-speech-data
"avoid re-sampling. For example, in telephony the native rate is commonly 8000 Hz, which is the rate that should be sent to the service."
-> 8KHz
https://cloud.google.com/speech-to-text/docs/sync-recognize
"Synchronous speech recognition returns the recognized text for short audio (less than 60 seconds). To process a speech recognition request for audio longer than 60 seconds, use Asynchronous Speech Recognition."
-> asynchronous

So, the correct answer is B.

Professional Machine Learning Engineer Exam - Question 282

Discussion