Exam Professional Machine Learning Engineer All QuestionsBrowse all questions from this exam
Question 282

You work at an organization that maintains a cloud-based communication platform that integrates conventional chat, voice, and video conferencing into one platform. The audio recordings are stored in Cloud Storage. All recordings have an 8 kHz sample rate and are more than one minute long. You need to implement a new feature in the platform that will automatically transcribe voice call recordings into a text for future applications, such as call summarization and sentiment analysis. How should you implement the voice call transcription feature following Google-recommended best practices?

    Correct Answer: B

    To implement the voice call transcription feature, the best approach is to use the original audio sampling rate of 8 kHz and transcribe the audio using the Speech-to-Text API with asynchronous recognition. Google-recommended best practices suggest maintaining the native sample rate to avoid any loss in audio quality during resampling. Additionally, asynchronous recognition is more suitable for longer audio recordings, as it allows for efficient processing without requiring immediate responses, which is ideal for recordings longer than one minute.

Discussion
CHARLIE2108Option: D

I went with D. "following Google-recommended best practices" https://cloud.google.com/speech-to-text/docs/optimizing-audio-files-for-speech-to-text#:~:text=We%20recommend%20a%20sample%20rate%20of%20at%20least%2016%20kHz%20in%20the%20audio%20files%20that%20you%20use%20for%20transcription%20with%20Speech%2Dto%2DText

tavva_prudhviOption: D

Upsampling to 16 kHz: The Speech-to-Text API recommends an audio sample rate of 16 kHz for optimal transcription accuracy. Upsampling the 8 kHz recordings to 16 kHz will improve the quality of the transcription. Asynchronous Recognition: Asynchronous recognition is suitable for longer audio recordings (more than one minute). It allows you to submit the audio file and receive the transcription results later, which is more efficient for batch processing. https://cloud.google.com/speech-to-text/docs/best-practices-provide-speech-data

Yan_XOption: B

B https://cloud.google.com/speech-to-text/docs/speech-to-text-requests#:~:text=Synchronous%20recognition%20requests%20are%20limited,periodically%20poll%20for%20recognition%20results.

PhilipKokuOption: B

B) Use original sampling rate and use asynchronous recognition... "If possible, set the sampling rate of the audio source to 16000 Hz. Otherwise, set the sample_rate_hertz to match the native sample rate of the audio source (instead of re-sampling)." https://cloud.google.com/speech-to-text/docs/best-practices-provide-speech-data#sampling_rate

livewalkOption: B

According to google recommandation on Sampling rate: "If possible, set the sampling rate of the audio source to 16000 Hz. Otherwise, set the sample_rate_hertz to match the native sample rate of the audio source (instead of re-sampling)." So we should match the native sample (8kHz) in the question.

SahandJOption: B

According to the documentation, it's best to have 16 KHz sample rate, however one should avoid up-sampling and rather use the native sample rate

ludovikushOption: B

Following best practices, the easiest choice is B

guilhermebutzkeOption: B

My Answer: B - Not necessary upsampling (exclude C and D) - Asynchronous means executing different tasks with no sequential order. Therefore, is preferred over synchronous recognition for longer audio recordings as it allows for more efficient processing, especially when dealing with larger volumes of data.

asmgiOption: B

We have longer than minute, 8KHz recordings. https://cloud.google.com/speech-to-text/docs/best-practices-provide-speech-data "avoid re-sampling. For example, in telephony the native rate is commonly 8000 Hz, which is the rate that should be sent to the service." -> 8KHz https://cloud.google.com/speech-to-text/docs/sync-recognize "Synchronous speech recognition returns the recognized text for short audio (less than 60 seconds). To process a speech recognition request for audio longer than 60 seconds, use Asynchronous Speech Recognition." -> asynchronous So, the correct answer is B.

pinimichele01Option: B

https://cloud.google.com/speech-to-text/docs/best-practices-provide-speech-data: Capture audio with a sampling rate of 16,000 Hz or higher. Lower sampling rates may reduce accuracy. However, avoid re-sampling. For example, in telephony the native rate is commonly 8000 Hz, which is the rate that should be sent to the service. https://cloud.google.com/speech-to-text/docs/optimizing-audio-files-for-speech-to-text#sample_rate_frequency_range: It's possible to convert from one sample rate to another. However, there's no benefit to up-sampling the audio, because the frequency range information is limited by the lower sample rate and can't be recovered by converting to a higher sample rate. -----> B, not D

omermahgoubOption: D

Upsample to 16 kHz and Use Asynchronous Speech-to-Text Recognition

guilhermebutzkeOption: B

My Answer: B - Not necessary upsampling (exclude C and D) - Asynchronous means executing different tasks with no sequential order. Therefore, is preferred over synchronous recognition for longer audio recordings as it allows for more efficient processing, especially when dealing with larger volumes of data.