Databricks Certified Professional Data Engineer Exam - Databricks-Certified-Professional-Data-Engineer Exam Practice Test

Question 1

A data engineer is building a Lakeflow Declarative Pipelines pipeline to process healthcare claims data. A metadata JSON file defines data quality rules for multiple tables, including:
{
" claims " : [
{ " name " : " valid_patient_id " , " constraint " : " patient_id IS NOT NULL " },
{ " name " : " non_negative_amount " , " constraint " : " claim_amount > = 0 " }
]
}
The pipeline must dynamically apply these rules to the claims table without hardcoding the rules.
How should the data engineer achieve this?

A. Reference each expectation with @dlt.expect decorators in the table declaration.

B. Load the JSON metadata, loop through its entries, and apply expectations using dlt.expect_all.

C. Invoke an external API to validate records against the metadata rules.

D. Use a SQL CONSTRAINT block referencing the JSON file path.

Correct Answer: B

Explanation: Only visible for Actualtests4sure members. You can sign-up / login (it's free).

Question 2

A data engineer wants to create a cluster using the Databricks CLI for a big ETL pipeline. The cluster should have five workers , one driver of type i3.xlarge, and should use the ' 14.3.x-scala2.12 ' runtime.
Which command should the data engineer use?

A. databricks compute create 14.3.x-scala2.12 --num-workers 5 --node-type-id i3.xlarge --cluster-name Data Engineer_cluster

B. databricks clusters create 14.3.x-scala2.12 --num-workers 5 --node-type-id i3.xlarge --cluster-name DataEngineer_cluster

C. databricks clusters add 14.3.x-scala2.12 --num-workers 5 --node-type-id i3.xlarge --cluster-name Data Engineer_cluster

D. databricks compute add 14.3.x-scala2.12 --num-workers 5 --node-type-id i3.xlarge --cluster-name Data Engineer_cluster

Correct Answer: A

Explanation: Only visible for Actualtests4sure members. You can sign-up / login (it's free).

Question 3

A company has a task management system that tracks the most recent status of tasks. The system takes task events as input and processes events in near real-time using Lakeflow Declarative Pipelines. A new task event is ingested into the system when a task is created or the task status is changed. Lakeflow Declarative Pipelines provides a streaming table (tasks_status) for BI users to query.
The table represents the latest status of all tasks and includes 5 columns:
* task_id (unique for each task)
* task_name
* task_owner
* task_status
* task_event_time
The table enables three properties: deletion vectors, row tracking, and change data feed (CDF).
A data engineer is asked to create a new Lakeflow Declarative Pipeline to enrich the tasks_status table in near real-time by adding one additional column representing task_owner's department, which can be looked up from a static dimension table (employee).
How should this enrichment be implemented?

A. Create a new Lakeflow Declarative Pipeline: use the read() function to read tasks_status table; enrich with employee table; store the result in a materialized view.

B. Create a new Lakeflow Declarative Pipeline: use the readStream() function to read tasks_status table; enrich with the employee table; store the result in a new streaming table.

C. Create a new Lakeflow Declarative Pipeline: use the readStream() function with the option skipChangeCommits to read the tasks_status table; enrich with the employee table; store the result in a new streaming table.

D. Create a new Lakeflow Declarative Pipeline: use readStream() function with option readChangeFeed to read tasks_status table CDF; enrich with the employee table; create a new streaming table as the result table and use apply_changes() function to process the changes from the enriched CDF.

Correct Answer: D

Explanation: Only visible for Actualtests4sure members. You can sign-up / login (it's free).

Question 4

The security team is exploring whether or not the Databricks secrets module can be leveraged for connecting to an external database.
After testing the code with all Python variables being defined with strings, they upload the password to the secrets module and configure the correct permissions for the currently active user. They then modify their code to the following (leaving all other variables unchanged).

Which statement describes what will happen when the above code is executed?

A. The connection to the external table will succeed; the string " redacted " will be printed.

B. An interactive input box will appear in the notebook; if the right password is provided, the connection will succeed and the password will be printed in plain text.

C. An interactive input box will appear in the notebook; if the right password is provided, the connection will succeed and the encoded password will be saved to DBFS.

D. The connection to the external table will succeed; the string value of password will be printed in plain text.

E. The connection to the external table will fail; the string " redacted " will be printed.

Correct Answer: A

Explanation: Only visible for Actualtests4sure members. You can sign-up / login (it's free).

Question 5

A task orchestrator has been configured to run two hourly tasks. First, an outside system writes Parquet data to a directory mounted at /mnt/raw_orders/. After this data is written, a Databricks job containing the following code is executed:
(spark.readStream
.format( " parquet " )
.load( " /mnt/raw_orders/ " )
.withWatermark( " time " , " 2 hours " )
.dropDuplicates([ " customer_id " , " order_id " ])
.writeStream
.trigger(once=True)
.table( " orders " )
)
Assume that the fields customer_id and order_id serve as a composite key to uniquely identify each order, and that the time field indicates when the record was queued in the source system. If the upstream system is known to occasionally enqueue duplicate entries for a single order hours apart, which statement is correct?

A. All records will be held in the state store for 2 hours before being deduplicated and committed to the orders table.

B. The orders table will not contain duplicates, but records arriving more than 2 hours late will be ignored and missing from the table.

C. Duplicate records enqueued more than 2 hours apart may be retained and the orders table may contain duplicate records with the same customer_id and order_id.

D. The orders table will contain only the most recent 2 hours of records and no duplicates will be present.

Correct Answer: B

Explanation: Only visible for Actualtests4sure members. You can sign-up / login (it's free).

Question 6

A security analytics pipeline must enrich billions of raw connection logs with geolocation data. The join hinges on finding which IPv4 range each event's address falls into.
Table 1: network_events (# 5 billion rows)
event_id ip_int
42 3232235777
Table 2: ip_ranges (# 2 million rows)
start_ip_int end_ip_int country
3232235520 3232236031 US
The query is currently very slow:
SELECT n.event_id, n.ip_int, r.country
FROM network_events n
JOIN ip_ranges r
ON n.ip_int BETWEEN r.start_ip_int AND r.end_ip_int;
Question:
Which change will most dramatically accelerate the query while preserving its logic?

A. Add a range-join hint /*+ RANGE_JOIN(r, 65536) */.

B. Force a sort-merge join with /*+ MERGE(r) */.

C. Increase spark.sql.shuffle.partitions from 200 to 10000.

D. Add a broadcast hint: /*+ BROADCAST(r) */ for ip_ranges.

Correct Answer: A

Explanation: Only visible for Actualtests4sure members. You can sign-up / login (it's free).

Question 7

The data science team has created and logged a production model using MLflow. The following code correctly imports and applies the production model to output the predictions as a new DataFrame named preds with the schema " customer_id LONG, predictions DOUBLE, date DATE " .

The data science team would like predictions saved to a Delta Lake table with the ability to compare all predictions across time. Churn predictions will be made at most once per day.
Which code block accomplishes this task while minimizing potential compute costs?

A.

B. preds.write.mode( " append " ).saveAsTable( " churn_preds " )

C. preds.write.format( " delta " ).save( " /preds/churn_preds " )

D.

E.

Correct Answer: B

Question 8

A data engineer manages a production Lakeflow Declarative Pipeline that processes customer transaction data. The pipeline includes several data quality expectations such as transaction_amount > 0 and customer_id IS NOT NULL. These expectations are defined using the EXPECT clause in SQL.
The engineer aims to monitor the pipeline's data quality by analyzing the number of records that passed or failed each expectation during the latest pipeline update. The Lakeflow Declarative Pipelines event logs are stored in a Delta table named event_log_table.
For the most recent pipeline update, determine a programmatically appropriate approach to extract information like the name of each expectation, associated dataset, count of records that passed the expectation, and count of records that failed the expectation.
Which method retrieves the desired data quality metrics from the Lakeflow Declarative Pipelines event log?

A. Use the Lakeflow Declarative Pipelines UI to navigate to the specific pipeline, select the dataset, and view the Data Quality tab to manually retrieve the expectation metrics.

B. Query the event_log_table for events with event_type = ' data_quality ' and directly select the passed_records and failed_records fields.

C. Access the event_log_table, filter for events where event_type = ' flow_progress ' , and parse details.
flow_progress.data_quality.expectations field to extract the required metrics.

D. Access the event_log_table, filter for events where event_type = ' expectation_result ' , and extract the expectation metrics from the details field.

Correct Answer: D

Explanation: Only visible for Actualtests4sure members. You can sign-up / login (it's free).

Question 9

A table is registered with the following code:

Both users and orders are Delta Lake tables. Which statement describes the results of querying recent_orders ?

A. All logic will execute at query time and return the result of joining the valid versions of the source tables at the time the query began.

B. The versions of each source table will be stored in the table transaction log; query results will be saved to DBFS with each query.

C. All logic will execute at query time and return the result of joining the valid versions of the source tables at the time the query finishes.

D. All logic will execute when the table is defined and store the result of joining tables to the DBFS; this stored data will be returned when the table is queried.

E. Results will be computed and cached when the table is defined; these cached results will incrementally update as new records are inserted into source tables.

Correct Answer: D

Question 10

Which statement describes Delta Lake Auto Compaction?

A. An asynchronous job runs after the write completes to detect if files could be further compacted; if yes, an optimize job is executed toward a default of 128 MB.

B. Optimized writes use logical partitions instead of directory partitions; because partition boundaries are only represented in metadata, fewer small files are written.

C. An asynchronous job runs after the write completes to detect if files could be further compacted; if yes, an optimize job is executed toward a default of 1 GB.

D. Data is queued in a messaging bus instead of committing data directly to memory; all data is committed from the messaging bus in one batch once the job is complete.

E. Before a Jobs cluster terminates, optimize is executed on all tables modified during the most recent job.

Correct Answer: A

Explanation: Only visible for Actualtests4sure members. You can sign-up / login (it's free).

Question 11

The business reporting tem requires that data for their dashboards be updated every hour. The total processing time for the pipeline that extracts transforms and load the data for their pipeline runs in 10 minutes.
Assuming normal operating conditions, which configuration will meet their service-level agreement requirements with the lowest cost?

A. Schedule a jo to execute the pipeline once and hour on a dedicated interactive cluster.

B. Schedule a Structured Streaming job with a trigger interval of 60 minutes.

C. Configure a job that executes every time new data lands in a given directory.

D. Schedule a job to execute the pipeline once hour on a new job cluster.

Correct Answer: D

Explanation: Only visible for Actualtests4sure members. You can sign-up / login (it's free).

Question 12

The data engineering team has configured a Databricks SQL query and alert to monitor the values in a Delta Lake table. The recent_sensor_recordings table contains an identifying sensor_id alongside the timestamp and temperature for the most recent 5 minutes of recordings.
The below query is used to create the alert:

The query is set to refresh each minute and always completes in less than 10 seconds. The alert is set to trigger when mean (temperature) > 120 . Notifications are triggered to be sent at most every 1 minute.
If this alert raises notifications for 3 consecutive minutes and then stops, which statement must be true?

A. The average temperature recordings for at least one sensor exceeded 120 on three consecutive executions of the query

B. The source query failed to update properly for three consecutive minutes and then restarted

C. The maximum temperature recording for at least one sensor exceeded 120 on three consecutive executions of the query

D. The total average temperature across all sensors exceeded 120 on three consecutive executions of the query

E. The recent_sensor_recordings table was unresponsive for three consecutive runs of the query

Correct Answer: A

Explanation: Only visible for Actualtests4sure members. You can sign-up / login (it's free).

Question 13

What statement is true regarding the retention of job run history?

A. It is retained for 60 days, after which logs are archived

B. It is retained for 90 days or until the run-id is re-used through custom run configuration

C. t is retained for 60 days, during which you can export notebook run results to HTML

D. It is retained for 30 days, during which time you can deliver job run logs to DBFS or S3

E. It is retained until you export or delete job run logs

Correct Answer: C

Databricks Certified Professional Data Engineer - Databricks-Certified-Professional-Data-Engineer Exam Practice Test

Latest Update

Useful Links

Contact Us