About This Flashcard Deck
This flashcard deck contains 10 cards covering key Data Engineer concepts for the PDE exam. Test your GCP data engineering knowledge. Use active recall by attempting to answer each question before revealing the answer.
Question
What is Dataflow?
Click to reveal answer
Answer
Managed service for Apache Beam pipelines. Supports both batch and streaming. Serverless, autoscaling.
Click to flip back
All Data Engineer Flashcards
Q: What is Dataflow?
A: Managed service for Apache Beam pipelines. Supports both batch and streaming. Serverless, autoscaling.
Q: When to use Dataproc vs Dataflow?
A: Dataproc: existing Spark/Hadoop workloads, need custom libraries. Dataflow: new pipelines, unified batch/stream, serverless.
Q: What is BigQuery ML?
A: Train and run ML models directly in BigQuery using SQL. Supports linear regression, classification, clustering, time-series, etc.
Q: What is Cloud Composer?
A: Managed Apache Airflow for orchestrating data pipelines. DAGs define workflow dependencies.
Q: What is Bigtable best for?
A: High-throughput, low-latency workloads: time-series data, IoT, analytics, large-scale (petabyte) single-key lookups.
Q: What is Cloud Data Fusion?
A: Fully managed visual ETL/ELT tool built on CDAP. Code-free data integration pipelines.
Q: What is a BigQuery partition?
A: Divides a table by date/integer/ingestion time for faster queries and lower costs. Only scans relevant partitions.
Q: What is Pub/Sub exactly-once delivery?
A: Pub/Sub provides at-least-once by default. For exactly-once, use Dataflow with deduplication.
Q: What is Data Catalog?
A: Metadata management and discovery service. Automatically catalogs BigQuery and other assets. Supports tags and search.
Q: What is a BigQuery materialized view?
A: Precomputed results that are automatically refreshed. Speeds up repetitive queries, BigQuery uses them transparently.
GCP Flashcard Study Approach
Google Cloud exams emphasise service selection and architecture decisions. Use these flashcards to build instant recall of GCP service capabilities, then apply that knowledge to scenario-based practice questions. Pay special attention to cards about managed vs. unmanaged services and serverless options — GCP strongly favours managed and serverless architectures in their exam scenarios.