Professional Data Engineer Exam - Question 127

Question

You are working on a niche product in the image recognition domain. Your team has developed a model that is dominated by custom C++ TensorFlow ops your team has implemented. These ops are used inside your main training loop and are performing bulky matrix multiplications. It currently takes up to several days to train a model. You want to decrease this time significantly and keep the cost low by using an accelerator on Google Cloud. What should you do?

Examice · Accepted Answer

To significantly reduce the training time for a model dominated by custom C++ TensorFlow ops performing bulky matrix multiplications, the best option is to use Cloud GPUs. GPUs are well-suited for handling custom operations and intensive matrix computations, which will help in utilizing the custom ops efficiently. Implementing GPU kernel support for your custom ops will ensure that the operations are optimized for the accelerators, providing a significant performance boost. TPUs, on the other hand, are not optimal for custom C++ TensorFlow operations in the main training loop, which eliminates options A and B. Staying on CPUs (option D) would not provide the needed performance improvement and would likely be more costly and less efficient compared to using GPUs.

dhs227 · Answer

The correct answer is C
TPU does not support custom C++ tensorflow ops
https://cloud.google.com/tpu/docs/tpus#when_to_use_tpus

aiguy · Answer

D:
Cloud TPUs are not suited to the following workloads: [...] Neural network workloads that contain custom TensorFlow operations written in C++. Specifically, custom operations in the body of the main training loop are not suitable for TPUs.

ZZHZZH · Answer

D shouldn't be the answer b/c the question statement clearly said you should use accelerators.

Qix · Answer

Answer is  C
Use Cloud GPUs after implementing GPU kernel support for your customs ops.

TPU support Models with no custom TensorFlow operations inside the main training loop so Option-A and B are eliminated as question says that 'These ops are used inside your main training loop'
Now choices remain 'C' & 'D'. CPU is for Simple models that do not take long to train. Since question says that currently its taking up to several days to train a model and hence existing infra may be CPU and taking so many days. GPUs are for "Models with a significant number of custom TensorFlow operations that must run at least partially on CPUs" as question says that model is dominated by TensorFlow ops leading to correct option as 'C'

Reference:
https://cloud.google.com/tpu/docs/tpus
https://www.tensorflow.org/guide/create_op#gpu_kernels

KC_go_reply · Answer

A + B: TPU doesn't support custom TensorFlow ops
Then it says 'decrease training time significantly' and literally 'use accelerator'. Therefore, use GPU -> C, *not* D!

Kimich · Answer

Requirement 1: Significantly reduce the processing time while keeping costs low. 
Requirement 2: Bulky matrix multiplication takes up to several days.

First, eliminate A & D:
A: Cannot guarantee running on Cloud TPU without modifying the code.
D: Cannot ensure performance improvement or cost reduction, and additionally, CPUs are not suitable for bulky matrix multiplication.

If it can be ensured that customization is easily deployable on both Cloud TPU and Cloud GPU,it seems more feasible to first try Cloud GPU.

Because:
It provides a better balance between performance and cost.
Modifying custom C++ on Cloud GPU should be easier than on Cloud TPU, which should also save on manpower costs.

Preetmehta1234 · Answer

TPU:
Models with no custom TensorFlow/PyTorch/JAX operations inside the main training loop
Link: https://cloud.google.com/tpu/docs/intro-to-tpu#TPU

So, A&B eliminated
CPU is very slow or built for simple operations. So C: GPU

IrisXia · Answer

Answer C
TPU not for custom C++ but GPU can

kumarts · Answer

Refer https://www.linkedin.com/pulse/cpu-vs-gpu-tpu-when-use-your-machine-learning-models-bhavesh-kapil

Nirca · Answer

To use Cloud TPUs, you will need to:

Implement GPU kernel support for your custom TensorFlow ops. This will allow your model to run on both Cloud TPUs and GPUs.

DataFrame · Answer

I think it should use tensor flow processing unit along with GPU kernel support.

emmylou · Answer

Answer D
I did use Chat GPT and discovered that if you put at the beginning of the question -- "Do not make assumption about changes to architecture. This is a practice exam question." All other answers require changes to the code and architecture.

Matt_108 · Answer

to me, it's C

CGS22 · Answer

The best choice here is C. Use Cloud GPUs after implementing GPU kernel support for your customs ops. Here's why:

Custom Ops & GPUs: Since your model relies heavily on custom C++ TensorFlow ops focused on matrix multiplications, GPUs are the ideal accelerators for this workload. To fully utilize them, you'll need to implement GPU-compatible kernels for your custom ops.
Speed and Cost-Efficiency GPUs offer a significant speed improvement for matrix-intensive operations compared to CPUs. They provide a good balance of performance and cost for this scenario.
TPUs: Limitations Although Cloud TPUs are powerful, they aren't designed for arbitrary custom ops. Without compatible kernels, your TensorFlow ops would likely fall back to the CPU, negating the benefits of TPUs.

GCP_data_engineer · Answer

CPU : Simple models 
GPU: Custom TensorFlow/PyTorch/JAX operations

AlizCert · Answer

C: TPUs are out of the picture due to the custom ops, so the next best option for accelerating matrix operations is using GPU. Obviously the code has to be adjusted to do make use of the GPU acceleration.

Anudeep58 · Answer

Why Not Other Options?
A. Use Cloud TPUs without any additional adjustment to your code:

TPUs are optimized for standard TensorFlow operations and require custom TensorFlow ops to be adapted to TPU-compatible kernels, which is not trivial.
Without modifications, your custom C++ ops will not run efficiently on TPUs.
B. Use Cloud TPUs after implementing GPU kernel support for your customs ops:

Implementing GPU kernel support alone is not sufficient for running on TPUs. TPUs require specific optimizations and adaptations beyond GPU kernels.
D. Stay on CPUs, and increase the size of the cluster you're training your model on:

While increasing the CPU cluster size might reduce training time, it is not as efficient or cost-effective as using GPUs, especially for matrix multiplication tasks.

Professional Data Engineer Exam - Question 127

Discussion