Practice Data Processing Questions Now
Start a timed practice session focusing on Designing Data Processing Systems topics from the PDE question bank.
Start PDE Practice Quiz →PDE Data Processing Question Bank (22 Questions)
Browse all 22 practice questions covering Designing Data Processing Systems for the PDE certification exam. Each question includes the full answer and a detailed explanation to help you understand the concepts.
- Question 1Ingesting and Processing Data
How do you handle late-arriving data in a Pub/Sub-to-BigQuery streaming pipeline?
Show Answer & Explanation
Correct Answer: BExplanation:Late data handling in Dataflow: 1) Windowing (fixed, sliding, session). 2) Watermarks: estimate event-time completeness. 3) Allowed lateness: accept data after watermark passes (triggers pane refinement). 4) Dead-letter: data arriving after allowed lateness → separate topic/table for investigation. 5) Accumulation mode: ACCUMULATING (include all panes) or DISCARDING (only new data).
- Question 2Designing Data Processing Systems
When should you choose batch processing over stream processing for a data pipeline?
Show Answer & Explanation
Correct Answer: BExplanation:Batch: bounded datasets, complete data available, global aggregations (all records needed), hourly/daily freshness OK, simpler to reason about. Streaming: unbounded data, real-time insights needed, event-driven actions. Dataflow unifies both: same Apache Beam pipeline code handles batch (bounded PCollection) and streaming (unbounded PCollection). Choose based on latency requirements.
- Question 3Designing Data Processing Systems
You need to process streaming data from IoT devices with exactly-once semantics and load it into BigQuery. What pipeline architecture should you use?
Show Answer & Explanation
Correct Answer: BExplanation:Streaming pipeline: Pub/Sub (durable ingestion) → Dataflow (streaming Apache Beam — windowing, watermarks, exactly-once via checkpointing) → BigQuery (Storage Write API for exactly-once). Dataflow auto-scales workers, handles late data with allowed lateness, and provides built-in monitoring. Google-provided templates available for this pattern.
- Question 4Designing Data Processing Systems
Your team needs near-real-time analytics with less than 1-minute latency from Pub/Sub to BigQuery. What approach is most cost-effective?
Show Answer & Explanation
Correct Answer: BExplanation:BigQuery Subscription: Pub/Sub writes directly to BigQuery table — no Dataflow, no code, no infrastructure management. Latency: seconds. Cost: Pub/Sub throughput pricing only (no Dataflow workers). Limitations: limited transformations (only field mapping). Use Dataflow when: complex transforms, windowed aggregations, enrichment needed. BigQuery Subscription for: simple ingest with minimal transformation.
- Question 5Maintaining and Automating Data Workloads
How do you monitor a streaming Dataflow pipeline for performance issues?
Show Answer & Explanation
Correct Answer: BExplanation:Dataflow monitoring: System lag (time newest unprocessed element has been waiting), Data freshness (time since most recent element was processed), Watermark age (event-time watermark vs processing time), Worker utilization (CPU, memory, threads), Throughput (elements/sec, bytes/sec). Alert: system lag > 5min, watermark stale. Autoscaling: adjust workers based on backlog.
- Question 6Designing Data Processing Systems
How does Dataflow achieve exactly-once processing in streaming pipelines?
Show Answer & Explanation
Correct Answer: BExplanation:Dataflow exactly-once: 1) Checkpointing: persists pipeline state for recovery. 2) Record ID deduplication: assigns unique IDs, detects duplicates on retry. 3) Bundle-level transactions: commits output atomically per bundle. 4) Sink-specific: BigQuery Storage Write API (exactly-once mode), Cloud Storage (atomic file writes). Combination provides end-to-end exactly-once semantics.
- Question 7Maintaining and Automating Data Workloads
How do you backfill historical data through a Dataflow streaming pipeline?
Show Answer & Explanation
Correct Answer: BExplanation:Beam unified model: same pipeline code handles batch (PCollection from GCS) and streaming (PCollection from Pub/Sub). Backfill: 1) Export historical data to GCS. 2) Run pipeline in batch mode with GCS input. 3) Output to same BigQuery tables. Alternatively: Pub/Sub seek (replay messages from timestamp) for recent data. Cloud Composer: orchestrate backfill DAGs with date parameters.
- Question 8Maintaining and Automating Data Workloads
How should you test a Dataflow pipeline before deploying to production?
Show Answer & Explanation
Correct Answer: BExplanation:Dataflow testing: 1) Unit test: DirectRunner (local, no GCP resources) — test individual transforms with test data. 2) Integration test: run full pipeline on small test dataset in staging project. 3) Output validation: assert schema, row count, data quality. 4) Performance test: run with production-scale data, check resource usage. 5) Canary: run new pipeline version on a subset of production data before full rollout.
- Question 9Ingesting and Processing Data
How do you reuse Dataflow pipelines across teams without sharing source code?
Show Answer & Explanation
Correct Answer: BExplanation:Dataflow templates: Classic (limited parameters, older) or Flex (Docker-based, any parameters, recommended). Create: package pipeline as template → store in Cloud Storage/Artifact Registry. Execute: gcloud dataflow flex-template run --template-file-gcs-location=gs://... --parameters=input=...,output=... Google-provided templates: 30+ for common patterns (GCS→BigQuery, Pub/Sub→BigQuery, JDBC→BigQuery).
- Question 10Ingesting and Processing Data
How do you design Pub/Sub topics and subscriptions for a multi-consumer data pipeline?
Show Answer & Explanation
Correct Answer: BExplanation:Pub/Sub design: Topic = message category (orders, events, logs). Multiple subscriptions per topic: each gets a copy of every message (fan-out). Subscription types: Pull (consumer polls), Push (Pub/Sub sends to HTTPS endpoint/Cloud Run). Ordering: ordering key (per-key FIFO). Dead-letter: forward unprocessable messages. Filtering: subscription-level message filtering (reduce consumer load).
- Question 11Ingesting and Processing Data
When should you use Dataflow SQL instead of writing Java/Python Beam pipelines?
Show Answer & Explanation
Correct Answer: BExplanation:Dataflow SQL: write SQL to define streaming/batch pipelines. Sources: Pub/Sub topics, BigQuery tables. SQL: SELECT, WHERE, GROUP BY, JOIN, windowing functions (TUMBLE, HOP, SESSION). Output: BigQuery tables. Use when: SQL-familiar users, simple transformations. Beam SDK when: complex transforms, custom logic, multiple I/Os, side inputs. Cloud Console UI for visual SQL pipeline building.
- Question 12Ingesting and Processing Data
What are the differences between BigQuery's streaming insert (legacy) and Storage Write API?
Show Answer & Explanation
Correct Answer: BExplanation:Storage Write API: committed mode (exactly-once, transaction support), default mode (at-least-once, simpler), pending mode (batch commit). Throughput: higher than legacy. Cost: lower per-byte. Legacy streaming: insertAll API, at-least-once, per-row pricing, 1MB row limit. Migration: use Storage Write API for new pipelines. Dataflow: automatically uses Storage Write API with BigQueryIO.
- Question 13Ingesting and Processing Data
What is the most efficient way to load large volumes of data into BigQuery?
Show Answer & Explanation
Correct Answer: BExplanation:BigQuery loading: 1) Batch load jobs (bq load): FREE, from GCS (Parquet, Avro, CSV, JSON, ORC). Up to 15TB per load job. 2) Storage Write API: streaming with exactly-once, higher throughput than legacy streaming inserts. 3) DML (INSERT/MERGE): for small updates. 4) BigQuery Data Transfer Service: scheduled loads from GCS, S3, SaaS apps. Bulk: always batch load from GCS for cost.
- Question 14Designing Data Processing Systems
You need to calculate the average temperature per sensor every 5 minutes from streaming data. What Dataflow windowing strategy should you use?
Show Answer & Explanation
Correct Answer: BExplanation:Windowing types: Fixed (tumbling): non-overlapping intervals (every 5 min — good for periodic aggregation). Sliding: overlapping (5-min window every 1 min — good for moving averages). Session: gap-based (close after inactivity — good for user sessions). Fixed 5-min: each sensor's readings grouped into 5-min buckets, average computed per window. Simplest for periodic reporting.
- Question 15Ingesting and Processing Data
When should you use Cloud Data Fusion instead of writing custom Dataflow pipelines?
Show Answer & Explanation
Correct Answer: BExplanation:Cloud Data Fusion: visual ETL tool (based on CDAP). Use when: non-developer users, standard transformations (join, filter, aggregate), 200+ pre-built connectors (SAP, Salesforce, databases). vs Dataflow: custom logic, streaming, code-based. Data Fusion actually generates Dataflow jobs under the hood for execution. Editions: Basic (batch), Enterprise (streaming, replication, lineage).
- Question 16Maintaining and Automating Data Workloads
How should you handle errors and failed records in a Dataflow pipeline?
Show Answer & Explanation
Correct Answer: BExplanation:Error handling: dead-letter pattern. In Dataflow: try/catch in DoFn, output failed records to a side output. Route to: dead-letter Pub/Sub topic (for reprocessing) or BigQuery error table (for investigation). Log: error details, original record, timestamp. Monitor: alert on dead-letter count exceeding threshold. Retry: transient errors with exponential backoff; permanent errors to dead-letter.
- Question 17Ingesting and Processing Data
How do you handle schema evolution in a streaming pipeline when the source schema changes?
Show Answer & Explanation
Correct Answer: BExplanation:Schema evolution: Avro/Protobuf: backward/forward compatible schemas (add fields with defaults, don't remove required fields). BigQuery: schema auto-update (ALLOW_FIELD_ADDITION in load config). Dataflow: dynamic destinations (route records to different tables by schema version). Dead-letter: records that don't match expected schema. Registry: Schema Registry for Kafka topics (enforce compatibility).
- Question 18Designing Data Processing Systems
How do you estimate costs for a BigQuery + Dataflow data platform?
Show Answer & Explanation
Correct Answer: BExplanation:Cost estimation: BigQuery: storage ($0.02/GB/month active), on-demand ($6.25/TB queried), editions (slot-hours). Dataflow: worker hours (vCPU, memory, disk), Streaming Engine (data processed). Pub/Sub: message throughput ($40/TiB). GCS: storage class + operations. Use: Pricing Calculator (before), billing export + INFORMATION_SCHEMA (after). Optimize: reserved slots, compression, partitioning.
- Question 19Maintaining and Automating Data Workloads
A Dataflow streaming job's system lag is increasing over time. How do you troubleshoot?
Show Answer & Explanation
Correct Answer: BExplanation:Dataflow lag troubleshooting: 1) Workers: check CPU utilization (>80% = add workers, increase max-num-workers). 2) Hot keys: one key getting disproportionate data (use Combine.perKeyWithHotKeys, add salt to keys). 3) External calls: slow DB/API in DoFn (add caching, batch calls, async). 4) Data skew: GroupByKey with skewed keys (use Combine when possible). 5) Memory: OOM causes restarts (increase worker memory). Dataflow diagnostics tab shows step-level latency.
- Question 20Designing Data Processing Systems
When should you use Dataproc instead of Dataflow?
Show Answer & Explanation
Correct Answer: BExplanation:Dataproc is ideal for migrating existing Hadoop/Spark jobs, leveraging Spark ML/GraphX libraries, or when teams have strong Spark expertise. Dataflow is better for new pipelines with unified batch/stream.
- Question 21Maintaining and Automating Data Workloads
How should you handle late-arriving data in streaming pipelines?
Show Answer & Explanation
Correct Answer: BExplanation:Dataflow uses watermarks to estimate completeness, allowed lateness to accept late data, and accumulation modes (firing triggers) to update results when late elements arrive within the lateness window.
- Question 22Designing Data Processing Systems
A data lake needs to support both batch and streaming data processing with unified code. Which Google Cloud service provides this?
Show Answer & Explanation
Correct Answer: BExplanation:Dataflow based on Apache Beam provides a unified programming model for both batch and streaming processing, allowing the same pipeline code to run in either mode.
Key Data Processing Concepts for PDE
PDE Data Processing Exam Tips
Designing Data Processing Systems questions in PDE are typically scenario-based. Focus on service-level decision making aligned to official exam objectives. Priority concepts: bigquery, dataflow, dataproc, pub/sub, composer, pipeline.
What PDE Expects
- Anchor your answer in select the most practical, secure, and scalable answer for the stated scenario.
- Data Processing scenarios for PDE are frequently mapped to Domain 1 (~23%), so read the objective carefully before picking controls or architecture.
- Expect multi-service scenarios where Data Processing interacts with IAM, networking, storage, or observability patterns rather than appearing as an isolated service question.
- When two options are both technically valid, prefer the choice that best aligns with the exam's operational scope (Professional) and managed-service best practices.
High-Value Data Processing Concepts
- Know the core Data Processing building blocks cold: bigquery, dataflow, dataproc, pub/sub.
- Review the edge-case features and limits for composer, pipeline; these details are commonly used to differentiate answer choices.
- Practice service-integration reasoning: how Data Processing pairs with Ingesting & Processing, Storing & Managing in real deployment patterns.
- For PDE, explain why the chosen Data Processing design meets reliability, security, and cost expectations better than the alternatives.
Common PDE Traps
- Watch for answers that partially solve the requirement but miss operational constraints.
- Questions in Designing Data Processing Systems often include distractors that look correct for Data Processing but violate least-privilege, durability, or availability requirements.
- Avoid picking options purely by feature name; validate data path, failure handling, and governance impact before answering.
- If the prompt hints at automation or repeatability, eliminate manual-only operational answers first.
Fast Review Checklist
- Can you compare at least two Data Processing implementation paths and justify which one best fits the scenario?
- Can you map the chosen answer back to Designing Data Processing Systems (~23%) outcomes for PDE?
- Can you explain security and access boundaries for Data Processing without relying on default-open assumptions?
- Can you describe how Data Processing integrates with Ingesting & Processing and Storing & Managing during failure, scaling, and monitoring events?