Professional Machine Learning Engineer Exam QuestionsBrowse all questions from this exam

Professional Machine Learning Engineer Exam - Question 35


You are developing a Kubeflow pipeline on Google Kubernetes Engine. The first step in the pipeline is to issue a query against BigQuery. You plan to use the results of that query as the input to the next step in your pipeline. You want to achieve this in the easiest way possible. What should you do?

Show Answer
Correct Answer: D

To automate a task in a Kubeflow pipeline, the easiest way is to use existing components whenever possible. By using the BigQuery Query Component from the Kubeflow Pipelines repository, you can seamlessly integrate BigQuery queries into your pipeline without the need to write additional custom code. This approach saves development time and ensures that you leverage well-tested and reusable components while enabling the results to be automatically passed to the next step in the pipeline.

Discussion

15 comments
Sign in to comment
maartenalexanderOption: D
Jun 22, 2021

D. Kubeflow pipelines have different types of components, ranging from low- to high-level. They have a ComponentStore that allows you to access prebuilt functionality from GitHub.

gcp2021go
Aug 1, 2021

agree, links: https://github.com/kubeflow/pipelines/blob/master/components/gcp/bigquery/query/sample.ipynb; https://v0-5.kubeflow.org/docs/pipelines/reusable-components/

NamitSehgalOption: D
Jan 1, 2022

Not sure what is the reason behind putting A as it is manual and manual steps can not be part of automation. I would say Answer is D as it just require a clone of the component from github. Using a Python and import bigquery component may sounds good too, but ask was what is easiest. It depends how word "easy" is taken by individuals but definitely not A.

chohanOption: B
Jun 18, 2021

Should be B

fragkrisOption: B
Dec 5, 2023

Im going "against the flow" and chosing B. It just sounds a lot easier option than D.

PhilipKokuOption: B
Jun 6, 2024

B) Python API

kaike_reisOption: D
Nov 13, 2021

D. The easiest way possible in developer's world: copy code from stackoverflow or github hahaha. Jokes a part, I think D is the correct. (A) is manual, so you have to do always. (B) could be, but is not the easiest one because you need to write a script for this. (C) uses Kubeflow intern solution, but you need to work to create a custom component. (D) is the (C) solution, but easier using a component created previously to do the job.

aeposOption: B
Nov 29, 2021

The result of D is just the path to the Cloud Storage where the result is stored not the data itself. So the input to the next step is this path, where you still have to load the data? So i would guess B. Can anyone explain if i am wrong?

xiaoFOption: D
Feb 1, 2022

D is good.

David_mlOption: D
May 10, 2022

Answer is D.

friediOption: B
Jun 20, 2023

Very confused as to why D is the correct answer. To me it seems a) much simpler to just write a couple of lines of python (https://cloud.google.com/bigquery/docs/reference/libraries#client-libraries-install-python) and b) the documentation for the BigQuery reusable component (https://v0-5.kubeflow.org/docs/pipelines/reusable-components/) states that the data is written to Google Cloud Storage, which means we have to write the fetching logic in the next pipeline step, going against the "as simple as possible" requirement. Would be interested to hear why I am wrong.

friedi
Jun 22, 2023

Actually, the problem statement even says that the query result has to be used as input to the next step, meaning with answer D) we would have to download the results before passing them to the next step. Additionally, we would have to handle potentially existing files in Google Cloud Storage if the pipeline is either executed multiple times or even in parallel. (I will die on this hill 😆 ).

tavva_prudhvi
Nov 5, 2023

Yup, you raised valid points. Depending on your specific requirements and familiarity with Python, writing a custom script using the BigQuery API (Option B) can be a simpler and more flexible approach. With Option B, you can write a Python script that uses the BigQuery API to execute queries against BigQuery and fetch the data directly into your pipeline. This way, you can process the data as needed and pass it to the next step in the pipeline without the need to fetch it from Google Cloud Storage. While using the reusable BigQuery Query Component (Option D) provides a pre-built solution, it does require additional steps to fetch the data from Google Cloud Storage for the next step in the pipeline, which might not be the simplest approach.

AmaboOption: D
May 5, 2024

from kfp.components import load_component_from_url bigquery_query_op = load_component_from_url('https://raw.githubusercontent.com/kubeflow/pipelines/master/components/gcp/bigquery/query/component.yaml') def my_pipeline(): query_result = bigquery_query_op( project_id='my-project', query='SELECT * FROM my_dataset.my_table' ) # Use the query_result as input to the next step in the pipeline

celia20200410Option: C
Jul 20, 2021

ans: c https://medium.com/google-cloud/using-bigquery-and-bigquery-ml-from-kubeflow-pipelines-991a2fa4bea8 https://cloud.google.com/architecture/architecture-for-mlops-using-tfx-kubeflow-pipelines-and-cloud-build#kubeflow-piplines-components Kubeflow Pipelines, a containerized task can invoke other services such as BigQuery jobs, AI Platform (distributed) training jobs, and Dataflow jobs.

raviperi
Sep 5, 2021

why create a custom component when a big query's reusable component is already present. Answer is D.

donchoripanOption: A
Mar 30, 2022

A. it says the easiest way possible so it sounds like just running the query on the console should be enogh. It doesn't says that the data will need to be uploaded again anytime soon, so we can asume that its just a one time query to be run.

David_ml
May 10, 2022

A is wrong. Answer is D. It's a pipeline which means you will run it multiple times? Do you always want to make the query manually each time you run your pipeline?

Mohamed_MossadOption: D
Jul 9, 2022

https://linuxtut.com/en/f4771efee37658c083cc/

Mohamed_Mossad
Jul 9, 2022

answer between C,D but above link has an article which uses a ready .yml file for bigquery component on official kubeflow pipelines repo

M25Option: D
May 9, 2023

Went with D