Practice Data Processing Questions Now

Start a timed practice session focusing on Designing Data Processing Systems topics from the PDE question bank.

PDE Data Processing Question Bank (22 Questions)

Browse all 22 practice questions covering Designing Data Processing Systems for the PDE certification exam. Each question includes the full answer and a detailed explanation to help you understand the concepts.

Question 1Ingesting and Processing Data
How do you handle late-arriving data in a Pub/Sub-to-BigQuery streaming pipeline?
ADiscard late data
BConfigure Dataflow windowing with allowed lateness — late data triggers recomputation of window results, and a dead-letter topic captures data that arrives after the allowed lateness period✓
CBuffer all data indefinitely
DProcess everything in batch
Show Answer & Explanation
Correct Answer: B
Explanation:
Late data handling in Dataflow: 1) Windowing (fixed, sliding, session). 2) Watermarks: estimate event-time completeness. 3) Allowed lateness: accept data after watermark passes (triggers pane refinement). 4) Dead-letter: data arriving after allowed lateness → separate topic/table for investigation. 5) Accumulation mode: ACCUMULATING (include all panes) or DISCARDING (only new data).
Question 2Designing Data Processing Systems
When should you choose batch processing over stream processing for a data pipeline?
AAlways use streaming
BWhen data freshness requirements are hours/daily (not real-time), datasets are bounded and complete, and processing logic requires global aggregations across the full dataset✓
CNever use batch
DBatch is always cheaper
Show Answer & Explanation
Correct Answer: B
Explanation:
Batch: bounded datasets, complete data available, global aggregations (all records needed), hourly/daily freshness OK, simpler to reason about. Streaming: unbounded data, real-time insights needed, event-driven actions. Dataflow unifies both: same Apache Beam pipeline code handles batch (bounded PCollection) and streaming (unbounded PCollection). Choose based on latency requirements.
Question 3Designing Data Processing Systems
You need to process streaming data from IoT devices with exactly-once semantics and load it into BigQuery. What pipeline architecture should you use?
ACloud Functions reading from Pub/Sub
BPub/Sub → Dataflow (Apache Beam streaming pipeline) → BigQuery — Dataflow provides exactly-once processing with windowing and watermarks✓
CPub/Sub → Cloud SQL
DDirect insert into BigQuery from devices
Show Answer & Explanation
Correct Answer: B
Explanation:
Streaming pipeline: Pub/Sub (durable ingestion) → Dataflow (streaming Apache Beam — windowing, watermarks, exactly-once via checkpointing) → BigQuery (Storage Write API for exactly-once). Dataflow auto-scales workers, handles late data with allowed lateness, and provides built-in monitoring. Google-provided templates available for this pattern.
Question 4Designing Data Processing Systems
Your team needs near-real-time analytics with less than 1-minute latency from Pub/Sub to BigQuery. What approach is most cost-effective?
ADataflow streaming with 1-second windows
BPub/Sub subscription writing directly to BigQuery (BigQuery Subscription) — a zero-code, auto-managed pipeline with sub-minute latency at lower cost than running Dataflow✓
CBatch load every minute
DCloud Functions processing each message
Show Answer & Explanation
Correct Answer: B
Explanation:
BigQuery Subscription: Pub/Sub writes directly to BigQuery table — no Dataflow, no code, no infrastructure management. Latency: seconds. Cost: Pub/Sub throughput pricing only (no Dataflow workers). Limitations: limited transformations (only field mapping). Use Dataflow when: complex transforms, windowed aggregations, enrichment needed. BigQuery Subscription for: simple ingest with minimal transformation.
Question 5Maintaining and Automating Data Workloads
How do you monitor a streaming Dataflow pipeline for performance issues?
ACheck logs once a day
BMonitor system lag, data freshness, watermark age, worker CPU/memory, and throughput in Cloud Monitoring — set alerts on watermark age exceeding thresholds✓
COnly check if the job is running
DUse BigQuery to check output
Show Answer & Explanation
Correct Answer: B
Explanation:
Dataflow monitoring: System lag (time newest unprocessed element has been waiting), Data freshness (time since most recent element was processed), Watermark age (event-time watermark vs processing time), Worker utilization (CPU, memory, threads), Throughput (elements/sec, bytes/sec). Alert: system lag > 5min, watermark stale. Autoscaling: adjust workers based on backlog.
Question 6Designing Data Processing Systems
How does Dataflow achieve exactly-once processing in streaming pipelines?
AIt processes each record only once naturally
BThrough checkpointing, record deduplication based on message IDs, and transactional commits to sinks — ensuring each record is processed and written exactly once even on retries✓
CIt uses Pub/Sub's exactly-once guarantee
DThrough idempotent writes only
Show Answer & Explanation
Correct Answer: B
Explanation:
Dataflow exactly-once: 1) Checkpointing: persists pipeline state for recovery. 2) Record ID deduplication: assigns unique IDs, detects duplicates on retry. 3) Bundle-level transactions: commits output atomically per bundle. 4) Sink-specific: BigQuery Storage Write API (exactly-once mode), Cloud Storage (atomic file writes). Combination provides end-to-end exactly-once semantics.
Question 7Maintaining and Automating Data Workloads
How do you backfill historical data through a Dataflow streaming pipeline?
AReplay all Pub/Sub messages
BRun the same Beam pipeline in batch mode reading from Cloud Storage (historical data) — Apache Beam's unified model allows the same transforms to work on both bounded and unbounded data✓
CWrite a separate batch pipeline
DBackfilling is impossible
Show Answer & Explanation
Correct Answer: B
Explanation:
Beam unified model: same pipeline code handles batch (PCollection from GCS) and streaming (PCollection from Pub/Sub). Backfill: 1) Export historical data to GCS. 2) Run pipeline in batch mode with GCS input. 3) Output to same BigQuery tables. Alternatively: Pub/Sub seek (replay messages from timestamp) for recent data. Cloud Composer: orchestrate backfill DAGs with date parameters.
Question 8Maintaining and Automating Data Workloads
How should you test a Dataflow pipeline before deploying to production?
ATest only in production
BUnit test transforms with DirectRunner (local), integration test with sample data on TestRunner, validate output schema/content, and run the pipeline on a test dataset in a staging project✓
CSkip testing for data pipelines
DManual data inspection
Show Answer & Explanation
Correct Answer: B
Explanation:
Dataflow testing: 1) Unit test: DirectRunner (local, no GCP resources) — test individual transforms with test data. 2) Integration test: run full pipeline on small test dataset in staging project. 3) Output validation: assert schema, row count, data quality. 4) Performance test: run with production-scale data, check resource usage. 5) Canary: run new pipeline version on a subset of production data before full rollout.
Question 9Ingesting and Processing Data
How do you reuse Dataflow pipelines across teams without sharing source code?
ACopy the code to each team
BCreate Dataflow templates (Classic or Flex) — pre-packaged pipelines that can be executed with runtime parameters without access to source code✓
CShare a VM with the code
DUse Cloud Functions instead
Show Answer & Explanation
Correct Answer: B
Explanation:
Dataflow templates: Classic (limited parameters, older) or Flex (Docker-based, any parameters, recommended). Create: package pipeline as template → store in Cloud Storage/Artifact Registry. Execute: gcloud dataflow flex-template run --template-file-gcs-location=gs://... --parameters=input=...,output=... Google-provided templates: 30+ for common patterns (GCS→BigQuery, Pub/Sub→BigQuery, JDBC→BigQuery).
Question 10Ingesting and Processing Data
How do you design Pub/Sub topics and subscriptions for a multi-consumer data pipeline?
AOne topic with one subscription
BOne topic per data type, multiple subscriptions per topic — each subscription delivers messages to a different consumer independently (fan-out pattern)✓
COne topic for all data
DSeparate topics per consumer
Show Answer & Explanation
Correct Answer: B
Explanation:
Pub/Sub design: Topic = message category (orders, events, logs). Multiple subscriptions per topic: each gets a copy of every message (fan-out). Subscription types: Pull (consumer polls), Push (Pub/Sub sends to HTTPS endpoint/Cloud Run). Ordering: ordering key (per-key FIFO). Dead-letter: forward unprocessable messages. Filtering: subscription-level message filtering (reduce consumer load).
Question 11Ingesting and Processing Data
When should you use Dataflow SQL instead of writing Java/Python Beam pipelines?
AAlways use SQL
BDataflow SQL for SQL-familiar analysts to build streaming pipelines using ZetaSQL — useful for simple filtering, aggregation, and joins on Pub/Sub and BigQuery data without coding✓
CDataflow SQL is deprecated
DOnly for batch processing
Show Answer & Explanation
Correct Answer: B
Explanation:
Dataflow SQL: write SQL to define streaming/batch pipelines. Sources: Pub/Sub topics, BigQuery tables. SQL: SELECT, WHERE, GROUP BY, JOIN, windowing functions (TUMBLE, HOP, SESSION). Output: BigQuery tables. Use when: SQL-familiar users, simple transformations. Beam SDK when: complex transforms, custom logic, multiple I/Os, side inputs. Cloud Console UI for visual SQL pipeline building.
Question 12Ingesting and Processing Data
What are the differences between BigQuery's streaming insert (legacy) and Storage Write API?
AThey are the same
BStorage Write API: exactly-once, higher throughput, lower cost, supports transactions. Legacy streaming: at-least-once, simpler API, per-row pricing. Prefer Storage Write API for new development✓
CLegacy is better
DStorage Write API is for batch only
Show Answer & Explanation
Correct Answer: B
Explanation:
Storage Write API: committed mode (exactly-once, transaction support), default mode (at-least-once, simpler), pending mode (batch commit). Throughput: higher than legacy. Cost: lower per-byte. Legacy streaming: insertAll API, at-least-once, per-row pricing, 1MB row limit. Migration: use Storage Write API for new pipelines. Dataflow: automatically uses Storage Write API with BigQueryIO.
Question 13Ingesting and Processing Data
What is the most efficient way to load large volumes of data into BigQuery?
AINSERT statements
BBigQuery load jobs (batch) for bulk data from Cloud Storage — free, supports Parquet/Avro/CSV/JSON, and handles schema auto-detection. Use Storage Write API for high-throughput streaming✓
CStreaming inserts for everything
Dbq query with INSERT INTO
Show Answer & Explanation
Correct Answer: B
Explanation:
BigQuery loading: 1) Batch load jobs (bq load): FREE, from GCS (Parquet, Avro, CSV, JSON, ORC). Up to 15TB per load job. 2) Storage Write API: streaming with exactly-once, higher throughput than legacy streaming inserts. 3) DML (INSERT/MERGE): for small updates. 4) BigQuery Data Transfer Service: scheduled loads from GCS, S3, SaaS apps. Bulk: always batch load from GCS for cost.
Question 14Designing Data Processing Systems
You need to calculate the average temperature per sensor every 5 minutes from streaming data. What Dataflow windowing strategy should you use?
AGlobal window
BFixed windows of 5 minutes — each window collects all events in a 5-minute interval, then computes the aggregate when the window closes✓
CSliding windows
DSession windows
Show Answer & Explanation
Correct Answer: B
Explanation:
Windowing types: Fixed (tumbling): non-overlapping intervals (every 5 min — good for periodic aggregation). Sliding: overlapping (5-min window every 1 min — good for moving averages). Session: gap-based (close after inactivity — good for user sessions). Fixed 5-min: each sensor's readings grouped into 5-min buckets, average computed per window. Simplest for periodic reporting.
Question 15Ingesting and Processing Data
When should you use Cloud Data Fusion instead of writing custom Dataflow pipelines?
AAlways use custom Dataflow
BWhen citizen data engineers need a visual, code-free ETL/ELT tool — Data Fusion provides a drag-and-drop UI with pre-built connectors for common sources and transformations✓
CNever use Data Fusion
DData Fusion replaces Dataflow entirely
Show Answer & Explanation
Correct Answer: B
Explanation:
Cloud Data Fusion: visual ETL tool (based on CDAP). Use when: non-developer users, standard transformations (join, filter, aggregate), 200+ pre-built connectors (SAP, Salesforce, databases). vs Dataflow: custom logic, streaming, code-based. Data Fusion actually generates Dataflow jobs under the hood for execution. Editions: Basic (batch), Enterprise (streaming, replication, lineage).
Question 16Maintaining and Automating Data Workloads
How should you handle errors and failed records in a Dataflow pipeline?
ALet the pipeline crash
BImplement dead-letter queues — catch and route failed records to a separate Pub/Sub topic or BigQuery table for later investigation, while the pipeline continues processing valid records✓
CRetry infinitely
DSkip all errors silently
Show Answer & Explanation
Correct Answer: B
Explanation:
Error handling: dead-letter pattern. In Dataflow: try/catch in DoFn, output failed records to a side output. Route to: dead-letter Pub/Sub topic (for reprocessing) or BigQuery error table (for investigation). Log: error details, original record, timestamp. Monitor: alert on dead-letter count exceeding threshold. Retry: transient errors with exponential backoff; permanent errors to dead-letter.
Question 17Ingesting and Processing Data
How do you handle schema evolution in a streaming pipeline when the source schema changes?
AStop the pipeline and rebuild
BUse Avro or Protocol Buffers (support schema evolution), configure BigQuery to auto-detect schema changes, and implement forward-compatible schemas in Dataflow with dead-letter for incompatible records✓
CReject all schema changes
DUse schema-less formats only
Show Answer & Explanation
Correct Answer: B
Explanation:
Schema evolution: Avro/Protobuf: backward/forward compatible schemas (add fields with defaults, don't remove required fields). BigQuery: schema auto-update (ALLOW_FIELD_ADDITION in load config). Dataflow: dynamic destinations (route records to different tables by schema version). Dead-letter: records that don't match expected schema. Registry: Schema Registry for Kafka topics (enforce compatibility).
Question 18Designing Data Processing Systems
How do you estimate costs for a BigQuery + Dataflow data platform?
AGuess based on similar projects
BGoogle Cloud Pricing Calculator — estimate BigQuery (storage + queries/slots), Dataflow (worker hours + Streaming Engine), Pub/Sub (throughput), and Cloud Storage. Use INFORMATION_SCHEMA for actual usage data✓
CJust pay as you go
DCosts are fixed
Show Answer & Explanation
Correct Answer: B
Explanation:
Cost estimation: BigQuery: storage ($0.02/GB/month active), on-demand ($6.25/TB queried), editions (slot-hours). Dataflow: worker hours (vCPU, memory, disk), Streaming Engine (data processed). Pub/Sub: message throughput ($40/TiB). GCS: storage class + operations. Use: Pricing Calculator (before), billing export + INFORMATION_SCHEMA (after). Optimize: reserved slots, compression, partitioning.
Question 19Maintaining and Automating Data Workloads
A Dataflow streaming job's system lag is increasing over time. How do you troubleshoot?
ARestart the job
BCheck: 1) Worker CPU/memory (under-provisioned?), 2) Hot keys (uneven data distribution), 3) Slow external calls (API/DB bottleneck), 4) Data skew in GroupByKey. Scale up workers or add key rebalancing✓
CIncrease Pub/Sub throughput
DWait for it to recover
Show Answer & Explanation
Correct Answer: B
Explanation:
Dataflow lag troubleshooting: 1) Workers: check CPU utilization (>80% = add workers, increase max-num-workers). 2) Hot keys: one key getting disproportionate data (use Combine.perKeyWithHotKeys, add salt to keys). 3) External calls: slow DB/API in DoFn (add caching, batch calls, async). 4) Data skew: GroupByKey with skewed keys (use Combine when possible). 5) Memory: OOM causes restarts (increase worker memory). Dataflow diagnostics tab shows step-level latency.
Question 20Designing Data Processing Systems
When should you use Dataproc instead of Dataflow?
AAlways
BFor existing Hadoop/Spark workloads, when you need Spark-specific libraries, or when the team has Spark expertise✓
CFor streaming only
DFor SQL queries only
Show Answer & Explanation
Correct Answer: B
Explanation:
Dataproc is ideal for migrating existing Hadoop/Spark jobs, leveraging Spark ML/GraphX libraries, or when teams have strong Spark expertise. Dataflow is better for new pipelines with unified batch/stream.
Question 21Maintaining and Automating Data Workloads
How should you handle late-arriving data in streaming pipelines?
ADiscard late data
BUse Dataflow watermarks, allowed lateness, and triggers to handle late data by updating windows when late elements arrive✓
CProcess everything immediately
DBuffer all data indefinitely
Show Answer & Explanation
Correct Answer: B
Explanation:
Dataflow uses watermarks to estimate completeness, allowed lateness to accept late data, and accumulation modes (firing triggers) to update results when late elements arrive within the lateness window.
Question 22Designing Data Processing Systems
A data lake needs to support both batch and streaming data processing with unified code. Which Google Cloud service provides this?
ADataproc
BDataflow (Apache Beam)✓
CBigQuery
DCloud Composer
Show Answer & Explanation
Correct Answer: B
Explanation:
Dataflow based on Apache Beam provides a unified programming model for both batch and streaming processing, allowing the same pipeline code to run in either mode.

Key Data Processing Concepts for PDE

bigquerydataflowdataprocpub/subcomposerpipelinebatchstreaming

PDE Data Processing Exam Tips

Designing Data Processing Systems questions in PDE are typically scenario-based. Focus on service-level decision making aligned to official exam objectives. Priority concepts: bigquery, dataflow, dataproc, pub/sub, composer, pipeline.

What PDE Expects

Anchor your answer in select the most practical, secure, and scalable answer for the stated scenario.
Data Processing scenarios for PDE are frequently mapped to Domain 1 (~23%), so read the objective carefully before picking controls or architecture.
Expect multi-service scenarios where Data Processing interacts with IAM, networking, storage, or observability patterns rather than appearing as an isolated service question.
When two options are both technically valid, prefer the choice that best aligns with the exam's operational scope (Professional) and managed-service best practices.

High-Value Data Processing Concepts

Know the core Data Processing building blocks cold: bigquery, dataflow, dataproc, pub/sub.
Review the edge-case features and limits for composer, pipeline; these details are commonly used to differentiate answer choices.
Practice service-integration reasoning: how Data Processing pairs with Ingesting & Processing, Storing & Managing in real deployment patterns.
For PDE, explain why the chosen Data Processing design meets reliability, security, and cost expectations better than the alternatives.

Common PDE Traps

Watch for answers that partially solve the requirement but miss operational constraints.
Questions in Designing Data Processing Systems often include distractors that look correct for Data Processing but violate least-privilege, durability, or availability requirements.
Avoid picking options purely by feature name; validate data path, failure handling, and governance impact before answering.
If the prompt hints at automation or repeatability, eliminate manual-only operational answers first.

Fast Review Checklist

Can you compare at least two Data Processing implementation paths and justify which one best fits the scenario?
Can you map the chosen answer back to Designing Data Processing Systems (~23%) outcomes for PDE?
Can you explain security and access boundaries for Data Processing without relying on default-open assumptions?
Can you describe how Data Processing integrates with Ingesting & Processing and Storing & Managing during failure, scaling, and monitoring events?

Exam Domains Covering Data Processing

Domain 1Designing Data Processing Systems~23%

Related Resources

🎯 Free PDE Mock Exam 📝 Ingesting & Processing Questions 📝 Storing & Managing Questions

More PDE Study Resources

← PDE Study Hub 30-Day Study Plan Full Practice Exam

🔄 Designing Data Processing Systems - PDE Practice Questions