🔄 Designing Data Processing Systems - PDE Practice Questions

Design data pipelines using BigQuery, Dataflow, Dataproc, Pub/Sub, and Cloud Composer for batch and streaming.

22Questions Available
1Exam Domains

Practice Data Processing Questions Now

Start a timed practice session focusing on Designing Data Processing Systems topics from the PDE question bank.

Start PDE Practice Quiz →

PDE Data Processing Question Bank (22 Questions)

Browse all 22 practice questions covering Designing Data Processing Systems for the PDE certification exam. Each question includes the full answer and a detailed explanation to help you understand the concepts.

  1. Question 1Ingesting and Processing Data

    How do you handle late-arriving data in a Pub/Sub-to-BigQuery streaming pipeline?

    ADiscard late data
    BConfigure Dataflow windowing with allowed lateness — late data triggers recomputation of window results, and a dead-letter topic captures data that arrives after the allowed lateness period
    CBuffer all data indefinitely
    DProcess everything in batch
    Show Answer & Explanation
    Correct Answer: B
    Explanation:

    Late data handling in Dataflow: 1) Windowing (fixed, sliding, session). 2) Watermarks: estimate event-time completeness. 3) Allowed lateness: accept data after watermark passes (triggers pane refinement). 4) Dead-letter: data arriving after allowed lateness → separate topic/table for investigation. 5) Accumulation mode: ACCUMULATING (include all panes) or DISCARDING (only new data).

  2. Question 2Designing Data Processing Systems

    When should you choose batch processing over stream processing for a data pipeline?

    AAlways use streaming
    BWhen data freshness requirements are hours/daily (not real-time), datasets are bounded and complete, and processing logic requires global aggregations across the full dataset
    CNever use batch
    DBatch is always cheaper
    Show Answer & Explanation
    Correct Answer: B
    Explanation:

    Batch: bounded datasets, complete data available, global aggregations (all records needed), hourly/daily freshness OK, simpler to reason about. Streaming: unbounded data, real-time insights needed, event-driven actions. Dataflow unifies both: same Apache Beam pipeline code handles batch (bounded PCollection) and streaming (unbounded PCollection). Choose based on latency requirements.

  3. Question 3Designing Data Processing Systems

    You need to process streaming data from IoT devices with exactly-once semantics and load it into BigQuery. What pipeline architecture should you use?

    ACloud Functions reading from Pub/Sub
    BPub/Sub → Dataflow (Apache Beam streaming pipeline) → BigQuery — Dataflow provides exactly-once processing with windowing and watermarks
    CPub/Sub → Cloud SQL
    DDirect insert into BigQuery from devices
    Show Answer & Explanation
    Correct Answer: B
    Explanation:

    Streaming pipeline: Pub/Sub (durable ingestion) → Dataflow (streaming Apache Beam — windowing, watermarks, exactly-once via checkpointing) → BigQuery (Storage Write API for exactly-once). Dataflow auto-scales workers, handles late data with allowed lateness, and provides built-in monitoring. Google-provided templates available for this pattern.

  4. Question 4Designing Data Processing Systems

    Your team needs near-real-time analytics with less than 1-minute latency from Pub/Sub to BigQuery. What approach is most cost-effective?

    ADataflow streaming with 1-second windows
    BPub/Sub subscription writing directly to BigQuery (BigQuery Subscription) — a zero-code, auto-managed pipeline with sub-minute latency at lower cost than running Dataflow
    CBatch load every minute
    DCloud Functions processing each message
    Show Answer & Explanation
    Correct Answer: B
    Explanation:

    BigQuery Subscription: Pub/Sub writes directly to BigQuery table — no Dataflow, no code, no infrastructure management. Latency: seconds. Cost: Pub/Sub throughput pricing only (no Dataflow workers). Limitations: limited transformations (only field mapping). Use Dataflow when: complex transforms, windowed aggregations, enrichment needed. BigQuery Subscription for: simple ingest with minimal transformation.

  5. Question 5Maintaining and Automating Data Workloads

    How do you monitor a streaming Dataflow pipeline for performance issues?

    ACheck logs once a day
    BMonitor system lag, data freshness, watermark age, worker CPU/memory, and throughput in Cloud Monitoring — set alerts on watermark age exceeding thresholds
    COnly check if the job is running
    DUse BigQuery to check output
    Show Answer & Explanation
    Correct Answer: B
    Explanation:

    Dataflow monitoring: System lag (time newest unprocessed element has been waiting), Data freshness (time since most recent element was processed), Watermark age (event-time watermark vs processing time), Worker utilization (CPU, memory, threads), Throughput (elements/sec, bytes/sec). Alert: system lag > 5min, watermark stale. Autoscaling: adjust workers based on backlog.

  6. Question 6Designing Data Processing Systems

    How does Dataflow achieve exactly-once processing in streaming pipelines?

    AIt processes each record only once naturally
    BThrough checkpointing, record deduplication based on message IDs, and transactional commits to sinks — ensuring each record is processed and written exactly once even on retries
    CIt uses Pub/Sub's exactly-once guarantee
    DThrough idempotent writes only
    Show Answer & Explanation
    Correct Answer: B
    Explanation:

    Dataflow exactly-once: 1) Checkpointing: persists pipeline state for recovery. 2) Record ID deduplication: assigns unique IDs, detects duplicates on retry. 3) Bundle-level transactions: commits output atomically per bundle. 4) Sink-specific: BigQuery Storage Write API (exactly-once mode), Cloud Storage (atomic file writes). Combination provides end-to-end exactly-once semantics.

  7. Question 7Maintaining and Automating Data Workloads

    How do you backfill historical data through a Dataflow streaming pipeline?

    AReplay all Pub/Sub messages
    BRun the same Beam pipeline in batch mode reading from Cloud Storage (historical data) — Apache Beam's unified model allows the same transforms to work on both bounded and unbounded data
    CWrite a separate batch pipeline
    DBackfilling is impossible
    Show Answer & Explanation
    Correct Answer: B
    Explanation:

    Beam unified model: same pipeline code handles batch (PCollection from GCS) and streaming (PCollection from Pub/Sub). Backfill: 1) Export historical data to GCS. 2) Run pipeline in batch mode with GCS input. 3) Output to same BigQuery tables. Alternatively: Pub/Sub seek (replay messages from timestamp) for recent data. Cloud Composer: orchestrate backfill DAGs with date parameters.

  8. Question 8Maintaining and Automating Data Workloads

    How should you test a Dataflow pipeline before deploying to production?

    ATest only in production
    BUnit test transforms with DirectRunner (local), integration test with sample data on TestRunner, validate output schema/content, and run the pipeline on a test dataset in a staging project
    CSkip testing for data pipelines
    DManual data inspection
    Show Answer & Explanation
    Correct Answer: B
    Explanation:

    Dataflow testing: 1) Unit test: DirectRunner (local, no GCP resources) — test individual transforms with test data. 2) Integration test: run full pipeline on small test dataset in staging project. 3) Output validation: assert schema, row count, data quality. 4) Performance test: run with production-scale data, check resource usage. 5) Canary: run new pipeline version on a subset of production data before full rollout.

  9. Question 9Ingesting and Processing Data

    How do you reuse Dataflow pipelines across teams without sharing source code?

    ACopy the code to each team
    BCreate Dataflow templates (Classic or Flex) — pre-packaged pipelines that can be executed with runtime parameters without access to source code
    CShare a VM with the code
    DUse Cloud Functions instead
    Show Answer & Explanation
    Correct Answer: B
    Explanation:

    Dataflow templates: Classic (limited parameters, older) or Flex (Docker-based, any parameters, recommended). Create: package pipeline as template → store in Cloud Storage/Artifact Registry. Execute: gcloud dataflow flex-template run --template-file-gcs-location=gs://... --parameters=input=...,output=... Google-provided templates: 30+ for common patterns (GCS→BigQuery, Pub/Sub→BigQuery, JDBC→BigQuery).

  10. Question 10Ingesting and Processing Data

    How do you design Pub/Sub topics and subscriptions for a multi-consumer data pipeline?

    AOne topic with one subscription
    BOne topic per data type, multiple subscriptions per topic — each subscription delivers messages to a different consumer independently (fan-out pattern)
    COne topic for all data
    DSeparate topics per consumer
    Show Answer & Explanation
    Correct Answer: B
    Explanation:

    Pub/Sub design: Topic = message category (orders, events, logs). Multiple subscriptions per topic: each gets a copy of every message (fan-out). Subscription types: Pull (consumer polls), Push (Pub/Sub sends to HTTPS endpoint/Cloud Run). Ordering: ordering key (per-key FIFO). Dead-letter: forward unprocessable messages. Filtering: subscription-level message filtering (reduce consumer load).

  11. Question 11Ingesting and Processing Data

    When should you use Dataflow SQL instead of writing Java/Python Beam pipelines?

    AAlways use SQL
    BDataflow SQL for SQL-familiar analysts to build streaming pipelines using ZetaSQL — useful for simple filtering, aggregation, and joins on Pub/Sub and BigQuery data without coding
    CDataflow SQL is deprecated
    DOnly for batch processing
    Show Answer & Explanation
    Correct Answer: B
    Explanation:

    Dataflow SQL: write SQL to define streaming/batch pipelines. Sources: Pub/Sub topics, BigQuery tables. SQL: SELECT, WHERE, GROUP BY, JOIN, windowing functions (TUMBLE, HOP, SESSION). Output: BigQuery tables. Use when: SQL-familiar users, simple transformations. Beam SDK when: complex transforms, custom logic, multiple I/Os, side inputs. Cloud Console UI for visual SQL pipeline building.

  12. Question 12Ingesting and Processing Data

    What are the differences between BigQuery's streaming insert (legacy) and Storage Write API?

    AThey are the same
    BStorage Write API: exactly-once, higher throughput, lower cost, supports transactions. Legacy streaming: at-least-once, simpler API, per-row pricing. Prefer Storage Write API for new development
    CLegacy is better
    DStorage Write API is for batch only
    Show Answer & Explanation
    Correct Answer: B
    Explanation:

    Storage Write API: committed mode (exactly-once, transaction support), default mode (at-least-once, simpler), pending mode (batch commit). Throughput: higher than legacy. Cost: lower per-byte. Legacy streaming: insertAll API, at-least-once, per-row pricing, 1MB row limit. Migration: use Storage Write API for new pipelines. Dataflow: automatically uses Storage Write API with BigQueryIO.

  13. Question 13Ingesting and Processing Data

    What is the most efficient way to load large volumes of data into BigQuery?

    AINSERT statements
    BBigQuery load jobs (batch) for bulk data from Cloud Storage — free, supports Parquet/Avro/CSV/JSON, and handles schema auto-detection. Use Storage Write API for high-throughput streaming
    CStreaming inserts for everything
    Dbq query with INSERT INTO
    Show Answer & Explanation
    Correct Answer: B
    Explanation:

    BigQuery loading: 1) Batch load jobs (bq load): FREE, from GCS (Parquet, Avro, CSV, JSON, ORC). Up to 15TB per load job. 2) Storage Write API: streaming with exactly-once, higher throughput than legacy streaming inserts. 3) DML (INSERT/MERGE): for small updates. 4) BigQuery Data Transfer Service: scheduled loads from GCS, S3, SaaS apps. Bulk: always batch load from GCS for cost.

  14. Question 14Designing Data Processing Systems

    You need to calculate the average temperature per sensor every 5 minutes from streaming data. What Dataflow windowing strategy should you use?

    AGlobal window
    BFixed windows of 5 minutes — each window collects all events in a 5-minute interval, then computes the aggregate when the window closes
    CSliding windows
    DSession windows
    Show Answer & Explanation
    Correct Answer: B
    Explanation:

    Windowing types: Fixed (tumbling): non-overlapping intervals (every 5 min — good for periodic aggregation). Sliding: overlapping (5-min window every 1 min — good for moving averages). Session: gap-based (close after inactivity — good for user sessions). Fixed 5-min: each sensor's readings grouped into 5-min buckets, average computed per window. Simplest for periodic reporting.

  15. Question 15Ingesting and Processing Data

    When should you use Cloud Data Fusion instead of writing custom Dataflow pipelines?

    AAlways use custom Dataflow
    BWhen citizen data engineers need a visual, code-free ETL/ELT tool — Data Fusion provides a drag-and-drop UI with pre-built connectors for common sources and transformations
    CNever use Data Fusion
    DData Fusion replaces Dataflow entirely
    Show Answer & Explanation
    Correct Answer: B
    Explanation:

    Cloud Data Fusion: visual ETL tool (based on CDAP). Use when: non-developer users, standard transformations (join, filter, aggregate), 200+ pre-built connectors (SAP, Salesforce, databases). vs Dataflow: custom logic, streaming, code-based. Data Fusion actually generates Dataflow jobs under the hood for execution. Editions: Basic (batch), Enterprise (streaming, replication, lineage).

  16. Question 16Maintaining and Automating Data Workloads

    How should you handle errors and failed records in a Dataflow pipeline?

    ALet the pipeline crash
    BImplement dead-letter queues — catch and route failed records to a separate Pub/Sub topic or BigQuery table for later investigation, while the pipeline continues processing valid records
    CRetry infinitely
    DSkip all errors silently
    Show Answer & Explanation
    Correct Answer: B
    Explanation:

    Error handling: dead-letter pattern. In Dataflow: try/catch in DoFn, output failed records to a side output. Route to: dead-letter Pub/Sub topic (for reprocessing) or BigQuery error table (for investigation). Log: error details, original record, timestamp. Monitor: alert on dead-letter count exceeding threshold. Retry: transient errors with exponential backoff; permanent errors to dead-letter.

  17. Question 17Ingesting and Processing Data

    How do you handle schema evolution in a streaming pipeline when the source schema changes?

    AStop the pipeline and rebuild
    BUse Avro or Protocol Buffers (support schema evolution), configure BigQuery to auto-detect schema changes, and implement forward-compatible schemas in Dataflow with dead-letter for incompatible records
    CReject all schema changes
    DUse schema-less formats only
    Show Answer & Explanation
    Correct Answer: B
    Explanation:

    Schema evolution: Avro/Protobuf: backward/forward compatible schemas (add fields with defaults, don't remove required fields). BigQuery: schema auto-update (ALLOW_FIELD_ADDITION in load config). Dataflow: dynamic destinations (route records to different tables by schema version). Dead-letter: records that don't match expected schema. Registry: Schema Registry for Kafka topics (enforce compatibility).

  18. Question 18Designing Data Processing Systems

    How do you estimate costs for a BigQuery + Dataflow data platform?

    AGuess based on similar projects
    BGoogle Cloud Pricing Calculator — estimate BigQuery (storage + queries/slots), Dataflow (worker hours + Streaming Engine), Pub/Sub (throughput), and Cloud Storage. Use INFORMATION_SCHEMA for actual usage data
    CJust pay as you go
    DCosts are fixed
    Show Answer & Explanation
    Correct Answer: B
    Explanation:

    Cost estimation: BigQuery: storage ($0.02/GB/month active), on-demand ($6.25/TB queried), editions (slot-hours). Dataflow: worker hours (vCPU, memory, disk), Streaming Engine (data processed). Pub/Sub: message throughput ($40/TiB). GCS: storage class + operations. Use: Pricing Calculator (before), billing export + INFORMATION_SCHEMA (after). Optimize: reserved slots, compression, partitioning.

  19. Question 19Maintaining and Automating Data Workloads

    A Dataflow streaming job's system lag is increasing over time. How do you troubleshoot?

    ARestart the job
    BCheck: 1) Worker CPU/memory (under-provisioned?), 2) Hot keys (uneven data distribution), 3) Slow external calls (API/DB bottleneck), 4) Data skew in GroupByKey. Scale up workers or add key rebalancing
    CIncrease Pub/Sub throughput
    DWait for it to recover
    Show Answer & Explanation
    Correct Answer: B
    Explanation:

    Dataflow lag troubleshooting: 1) Workers: check CPU utilization (>80% = add workers, increase max-num-workers). 2) Hot keys: one key getting disproportionate data (use Combine.perKeyWithHotKeys, add salt to keys). 3) External calls: slow DB/API in DoFn (add caching, batch calls, async). 4) Data skew: GroupByKey with skewed keys (use Combine when possible). 5) Memory: OOM causes restarts (increase worker memory). Dataflow diagnostics tab shows step-level latency.

  20. Question 20Designing Data Processing Systems

    When should you use Dataproc instead of Dataflow?

    AAlways
    BFor existing Hadoop/Spark workloads, when you need Spark-specific libraries, or when the team has Spark expertise
    CFor streaming only
    DFor SQL queries only
    Show Answer & Explanation
    Correct Answer: B
    Explanation:

    Dataproc is ideal for migrating existing Hadoop/Spark jobs, leveraging Spark ML/GraphX libraries, or when teams have strong Spark expertise. Dataflow is better for new pipelines with unified batch/stream.

  21. Question 21Maintaining and Automating Data Workloads

    How should you handle late-arriving data in streaming pipelines?

    ADiscard late data
    BUse Dataflow watermarks, allowed lateness, and triggers to handle late data by updating windows when late elements arrive
    CProcess everything immediately
    DBuffer all data indefinitely
    Show Answer & Explanation
    Correct Answer: B
    Explanation:

    Dataflow uses watermarks to estimate completeness, allowed lateness to accept late data, and accumulation modes (firing triggers) to update results when late elements arrive within the lateness window.

  22. Question 22Designing Data Processing Systems

    A data lake needs to support both batch and streaming data processing with unified code. Which Google Cloud service provides this?

    ADataproc
    BDataflow (Apache Beam)
    CBigQuery
    DCloud Composer
    Show Answer & Explanation
    Correct Answer: B
    Explanation:

    Dataflow based on Apache Beam provides a unified programming model for both batch and streaming processing, allowing the same pipeline code to run in either mode.

Key Data Processing Concepts for PDE

bigquerydataflowdataprocpub/subcomposerpipelinebatchstreaming

PDE Data Processing Exam Tips

Designing Data Processing Systems questions in PDE are typically scenario-based. Focus on service-level decision making aligned to official exam objectives. Priority concepts: bigquery, dataflow, dataproc, pub/sub, composer, pipeline.

What PDE Expects

  • Anchor your answer in select the most practical, secure, and scalable answer for the stated scenario.
  • Data Processing scenarios for PDE are frequently mapped to Domain 1 (~23%), so read the objective carefully before picking controls or architecture.
  • Expect multi-service scenarios where Data Processing interacts with IAM, networking, storage, or observability patterns rather than appearing as an isolated service question.
  • When two options are both technically valid, prefer the choice that best aligns with the exam's operational scope (Professional) and managed-service best practices.

High-Value Data Processing Concepts

  • Know the core Data Processing building blocks cold: bigquery, dataflow, dataproc, pub/sub.
  • Review the edge-case features and limits for composer, pipeline; these details are commonly used to differentiate answer choices.
  • Practice service-integration reasoning: how Data Processing pairs with Ingesting & Processing, Storing & Managing in real deployment patterns.
  • For PDE, explain why the chosen Data Processing design meets reliability, security, and cost expectations better than the alternatives.

Common PDE Traps

  • Watch for answers that partially solve the requirement but miss operational constraints.
  • Questions in Designing Data Processing Systems often include distractors that look correct for Data Processing but violate least-privilege, durability, or availability requirements.
  • Avoid picking options purely by feature name; validate data path, failure handling, and governance impact before answering.
  • If the prompt hints at automation or repeatability, eliminate manual-only operational answers first.

Fast Review Checklist

  • Can you compare at least two Data Processing implementation paths and justify which one best fits the scenario?
  • Can you map the chosen answer back to Designing Data Processing Systems (~23%) outcomes for PDE?
  • Can you explain security and access boundaries for Data Processing without relying on default-open assumptions?
  • Can you describe how Data Processing integrates with Ingesting & Processing and Storing & Managing during failure, scaling, and monitoring events?

Exam Domains Covering Data Processing

Related Resources

More PDE Study Resources