Google Cloud Certified Professional Data Engineer Exam

Professional Data Engineer
A Professional Data Engineer enables data-driven decision making by collecting, transforming, and publishing data. A Data Engineer should be able to design, build, operationalize, secure, and monitor data processing systems with a particular emphasis on security and compliance; scalability and efficiency; reliability and fidelity; and flexibility and portability. A Data Engineer should also be able to leverage, deploy, and continuously train pre-existing machine learning models.

The Professional Data Engineer exam assesses your ability to:
Design data processing systems
Build and operationalize data processing systems
Operationalize machine learning models
Ensure solution quality

About this certification exam
Length: 2 hours
Registration fee: $200 (plus tax where applicable)
Languages: English, Japanese.
Exam format: Multiple choice and multiple select, taken in person at a test center. Locate a test center near you.
Prerequisites: None
Recommended experience: 3+ years of industry experience including 1+ years designing and managing solutions using GCP.

Hands-on practice
This exam is designed to test technical skills related to the job role. Hands-on experience is the best preparation for the exam. If you feel you may need more experience or practice, use the hands-on labs available on Qwiklabs as well as the GCP free tier to level up your knowledge and skills.

GCP free tier
GCP always free products
GCP essentials quest
Data engineering quest

4. Practice exam
Check your readiness to take the exam.
Not feeling quite ready? Check out the additional resources listed below and get more hands-on practice with Qwiklabs.

5. Additional resources
In-depth discussions on the concepts and critical components of GCP:
Google Cloud documentation
Google Cloud solutions

6. Schedule your exam
Register and find a location near you.

1. Designing data processing systems
1.1 Selecting the appropriate storage technologies. Considerations include:
Mapping storage systems to business requirements
Data modeling
Tradeoffs involving latency, throughput, transactions
Distributed systems
Schema design

1.2 Designing data pipelines. Considerations include:
Data publishing and visualization (e.g., BigQuery)
Batch and streaming data (e.g., Cloud Dataflow, Cloud Dataproc, Apache Beam, Apache Spark and Hadoop ecosystem, Cloud Pub/Sub, Apache Kafka)
Online (interactive) vs. batch predictions
Job automation and orchestration (e.g., Cloud Composer)

1.3 Designing a data processing solution. Considerations include:
Choice of infrastructure
System availability and fault tolerance
Use of distributed systems
Capacity planning
Hybrid cloud and edge computing
Architecture options (e.g., message brokers, message queues, middleware, service-oriented architecture, serverless functions)
At least once, in-order, and exactly once, etc., event processing

1.4 Migrating data warehousing and data processing. Considerations include:
Awareness of current state and how to migrate a design to a future state
Migrating from on-premises to cloud (Data Transfer Service, Transfer Appliance, Cloud Networking)
Validating a migration

2. Building and operationalizing data processing systems

2.1 Building and operationalizing storage systems. Considerations include:
Effective use of managed services (Cloud Bigtable, Cloud Spanner, Cloud SQL, BigQuery, Cloud Storage, Cloud Datastore, Cloud Memorystore)
Storage costs and performance
Lifecycle management of data

2.2 Building and operationalizing pipelines. Considerations include:
Data cleansing
Batch and streaming
Transformation
Data acquisition and import
Integrating with new data sources

2.3 Building and operationalizing processing infrastructure. Considerations include:
Provisioning resources
Monitoring pipelines
Adjusting pipelines
Testing and quality control

3. Operationalizing machine learning models

3.1 Leveraging pre-built ML models as a service. Considerations include:
ML APIs (e.g., Vision API, Speech API)
Customizing ML APIs (e.g., AutoML Vision, Auto ML text)
Conversational experiences (e.g., Dialogflow)

3.2 Deploying an ML pipeline. Considerations include:
Ingesting appropriate data
Retraining of machine learning models (Cloud Machine Learning Engine, BigQuery ML, Kubeflow, Spark ML)
Continuous evaluation

3.3 Choosing the appropriate training and serving infrastructure. Considerations include:
Distributed vs. single machine
Use of edge compute
Hardware accelerators (e.g., GPU, TPU)

3.4 Measuring, monitoring, and troubleshooting machine learning models. Considerations include:
Machine learning terminology (e.g., features, labels, models, regression, classification, recommendation, supervised and unsupervised learning, evaluation metrics)
Impact of dependencies of machine learning models
Common sources of error (e.g., assumptions about data)

4. Ensuring solution quality

4.1 Designing for security and compliance. Considerations include:
Identity and access management (e.g., Cloud IAM)
Data security (encryption, key management)
Ensuring privacy (e.g., Data Loss Prevention API)
Legal compliance (e.g., Health Insurance Portability and Accountability Act (HIPAA), Children’s Online Privacy Protection Act (COPPA), FedRAMP, General Data Protection Regulation (GDPR))

4.2 Ensuring scalability and efficiency. Considerations include:
Building and running test suites
Pipeline monitoring (e.g., Stackdriver)
Assessing, troubleshooting, and improving data representations and data processing infrastructure
Resizing and autoscaling resources

4.3 Ensuring reliability and fidelity. Considerations include:
Performing data preparation and quality control (e.g., Cloud Dataprep)
Verification and monitoring
Planning, executing, and stress testing data recovery (fault tolerance, rerunning failed jobs, performing retrospective re-analysis)
Choosing between ACID, idempotent, eventually consistent requirements

4.4 Ensuring flexibility and portability. Considerations include:
Mapping to current and future business requirements
Designing for data and application portability (e.g., multi-cloud, data residency requirements)
Data staging, cataloging, and discovery

QUESTION 1
Your company built a TensorFlow neutral-network model with a large number of neurons and layers.
The model fits well for the training data. However, when tested against new data, it performs poorly.
What method can you employ to address this?

A. Threading
B. Serialization
C. Dropout Methods
D. Dimensionality Reduction

Correct Answer: C

QUESTION 2
You are building a model to make clothing recommendations. You know a user’s fashion preference is likely to change over time, so you build a data pipeline to stream new data back to the model as it becomes available.
How should you use this data to train the model?

A. Continuously retrain the model on just the new data.
B. Continuously retrain the model on a combination of existing data and the new data.
C. Train on the existing data while using the new data as your test set.
D. Train on the new data while using the existing data as your test set.

Correct Answer: B

QUESTION 3
You designed a database for patient records as a pilot project to cover a few hundred patients in three clinics.
Your design used a single database table to represent all patients and their visits, and you used self-joins to
generate reports. The server resource utilization was at 50%. Since then, the scope of the project has
expanded. The database must now store 100 times more patient records. You can no longer run the reports,
because they either take too long or they encounter errors with insufficient compute resources.
How should you adjust the database design?

A. Add capacity (memory and disk space) to the database server by the order of 200.
B. Shard the tables into smaller ones based on date ranges, and only generate reports with prespecified date ranges.
C. Normalize the master patient-record table into the patient table and the visits table, and create other necessary tables to avoid self-join.
D. Partition the table into smaller tables, with one for each clinic. Run queries against the smaller table pairs, and use unions for consolidated reports.

Correct Answer: C

QUESTION 4
You create an important report for your large team in Google Data Studio 360. The report uses Google
BigQuery as its data source. You notice that visualizations are not showing data that is less than 1 hour old.
What should you do?

A. Disable caching by editing the report settings.
B. Disable caching in BigQuery by editing table details.
C. Refresh your browser tab showing the visualizations.
D. Clear your browser history for the past hour then reload the tab showing the virtualizations.

Correct Answer: A

QUESTION 5
An external customer provides you with a daily dump of data from their database. The data flows into Google
Cloud Storage GCS as comma-separated values (CSV) files. You want to analyze this data in Google
BigQuery, but the data could have rows that are formatted incorrectly or corrupted. How should you build this pipeline?

A. Use federated data sources, and check data in the SQL query.
B. Enable BigQuery monitoring in Google Stackdriver and create an alert.
C. Import the data into BigQuery using the gcloud CLI and set max_bad_records to 0.
D. Run a Google Cloud Dataflow batch pipeline to import the data into BigQuery, and push errors to another dead-letter table for analysis.

Correct Answer: D

 

Actualkey Google Cloud Certified Professional Data Engineer Exam PDF, Certkingdom Google Cloud Certified Professional Data Engineer Exam PDF

MCTS Training, MCITP Trainnig

Best Google Cloud Certified Professional Data Engineer Exam Certification, Google Cloud Certified Professional Data Engineer Exam Training at certkingdom.com

Google Cloud Certified Professional Data Engineer Exam
Scroll to top