🔄 Designing Data Processing Systems - PDE Practice Questions

Design data pipelines using BigQuery, Dataflow, Dataproc, Pub/Sub, and Cloud Composer for batch and streaming.

22Questions Available
1Exam Domains

Practice Data Processing Questions Now

Start a timed practice session focusing on Designing Data Processing Systems topics from the PDE question bank.

Start PDE Practice Quiz →

PDE Data Processing Question Bank (22 Questions)

Browse all 22 practice questions covering Designing Data Processing Systems for the PDE certification exam. Answers are intentionally hidden on this page so you can self-test first before checking results in quiz mode.

  1. Question 1Ingesting and Processing Data

    How do you handle late-arriving data in a Pub/Sub-to-BigQuery streaming pipeline?

    ADiscard late data
    BConfigure Dataflow windowing with allowed lateness — late data triggers recomputation of window results, and a dead-letter topic captures data that arrives after the allowed lateness period
    CBuffer all data indefinitely
    DProcess everything in batch

    Answer hidden for practice.

    Use the interactive quiz to reveal the correct answer and explanation.

    Start PDE Quiz
  2. Question 2Designing Data Processing Systems

    When should you choose batch processing over stream processing for a data pipeline?

    AAlways use streaming
    BWhen data freshness requirements are hours/daily (not real-time), datasets are bounded and complete, and processing logic requires global aggregations across the full dataset
    CNever use batch
    DBatch is always cheaper

    Answer hidden for practice.

    Use the interactive quiz to reveal the correct answer and explanation.

    Start PDE Quiz
  3. Question 3Designing Data Processing Systems

    You need to process streaming data from IoT devices with exactly-once semantics and load it into BigQuery. What pipeline architecture should you use?

    ACloud Functions reading from Pub/Sub
    BPub/Sub → Dataflow (Apache Beam streaming pipeline) → BigQuery — Dataflow provides exactly-once processing with windowing and watermarks
    CPub/Sub → Cloud SQL
    DDirect insert into BigQuery from devices

    Answer hidden for practice.

    Use the interactive quiz to reveal the correct answer and explanation.

    Start PDE Quiz
  4. Question 4Designing Data Processing Systems

    Your team needs near-real-time analytics with less than 1-minute latency from Pub/Sub to BigQuery. What approach is most cost-effective?

    ADataflow streaming with 1-second windows
    BPub/Sub subscription writing directly to BigQuery (BigQuery Subscription) — a zero-code, auto-managed pipeline with sub-minute latency at lower cost than running Dataflow
    CBatch load every minute
    DCloud Functions processing each message

    Answer hidden for practice.

    Use the interactive quiz to reveal the correct answer and explanation.

    Start PDE Quiz
  5. Question 5Maintaining and Automating Data Workloads

    How do you monitor a streaming Dataflow pipeline for performance issues?

    ACheck logs once a day
    BMonitor system lag, data freshness, watermark age, worker CPU/memory, and throughput in Cloud Monitoring — set alerts on watermark age exceeding thresholds
    COnly check if the job is running
    DUse BigQuery to check output

    Answer hidden for practice.

    Use the interactive quiz to reveal the correct answer and explanation.

    Start PDE Quiz
  6. Question 6Designing Data Processing Systems

    How does Dataflow achieve exactly-once processing in streaming pipelines?

    AIt processes each record only once naturally
    BThrough checkpointing, record deduplication based on message IDs, and transactional commits to sinks — ensuring each record is processed and written exactly once even on retries
    CIt uses Pub/Sub's exactly-once guarantee
    DThrough idempotent writes only

    Answer hidden for practice.

    Use the interactive quiz to reveal the correct answer and explanation.

    Start PDE Quiz
  7. Question 7Maintaining and Automating Data Workloads

    How do you backfill historical data through a Dataflow streaming pipeline?

    AReplay all Pub/Sub messages
    BRun the same Beam pipeline in batch mode reading from Cloud Storage (historical data) — Apache Beam's unified model allows the same transforms to work on both bounded and unbounded data
    CWrite a separate batch pipeline
    DBackfilling is impossible

    Answer hidden for practice.

    Use the interactive quiz to reveal the correct answer and explanation.

    Start PDE Quiz
  8. Question 8Maintaining and Automating Data Workloads

    How should you test a Dataflow pipeline before deploying to production?

    ATest only in production
    BUnit test transforms with DirectRunner (local), integration test with sample data on TestRunner, validate output schema/content, and run the pipeline on a test dataset in a staging project
    CSkip testing for data pipelines
    DManual data inspection

    Answer hidden for practice.

    Use the interactive quiz to reveal the correct answer and explanation.

    Start PDE Quiz
  9. Question 9Ingesting and Processing Data

    How do you reuse Dataflow pipelines across teams without sharing source code?

    ACopy the code to each team
    BCreate Dataflow templates (Classic or Flex) — pre-packaged pipelines that can be executed with runtime parameters without access to source code
    CShare a VM with the code
    DUse Cloud Functions instead

    Answer hidden for practice.

    Use the interactive quiz to reveal the correct answer and explanation.

    Start PDE Quiz
  10. Question 10Ingesting and Processing Data

    How do you design Pub/Sub topics and subscriptions for a multi-consumer data pipeline?

    AOne topic with one subscription
    BOne topic per data type, multiple subscriptions per topic — each subscription delivers messages to a different consumer independently (fan-out pattern)
    COne topic for all data
    DSeparate topics per consumer

    Answer hidden for practice.

    Use the interactive quiz to reveal the correct answer and explanation.

    Start PDE Quiz
  11. Question 11Ingesting and Processing Data

    When should you use Dataflow SQL instead of writing Java/Python Beam pipelines?

    AAlways use SQL
    BDataflow SQL for SQL-familiar analysts to build streaming pipelines using ZetaSQL — useful for simple filtering, aggregation, and joins on Pub/Sub and BigQuery data without coding
    CDataflow SQL is deprecated
    DOnly for batch processing

    Answer hidden for practice.

    Use the interactive quiz to reveal the correct answer and explanation.

    Start PDE Quiz
  12. Question 12Ingesting and Processing Data

    What are the differences between BigQuery's streaming insert (legacy) and Storage Write API?

    AThey are the same
    BStorage Write API: exactly-once, higher throughput, lower cost, supports transactions. Legacy streaming: at-least-once, simpler API, per-row pricing. Prefer Storage Write API for new development
    CLegacy is better
    DStorage Write API is for batch only

    Answer hidden for practice.

    Use the interactive quiz to reveal the correct answer and explanation.

    Start PDE Quiz
  13. Question 13Ingesting and Processing Data

    What is the most efficient way to load large volumes of data into BigQuery?

    AINSERT statements
    BBigQuery load jobs (batch) for bulk data from Cloud Storage — free, supports Parquet/Avro/CSV/JSON, and handles schema auto-detection. Use Storage Write API for high-throughput streaming
    CStreaming inserts for everything
    Dbq query with INSERT INTO

    Answer hidden for practice.

    Use the interactive quiz to reveal the correct answer and explanation.

    Start PDE Quiz
  14. Question 14Designing Data Processing Systems

    You need to calculate the average temperature per sensor every 5 minutes from streaming data. What Dataflow windowing strategy should you use?

    AGlobal window
    BFixed windows of 5 minutes — each window collects all events in a 5-minute interval, then computes the aggregate when the window closes
    CSliding windows
    DSession windows

    Answer hidden for practice.

    Use the interactive quiz to reveal the correct answer and explanation.

    Start PDE Quiz
  15. Question 15Ingesting and Processing Data

    When should you use Cloud Data Fusion instead of writing custom Dataflow pipelines?

    AAlways use custom Dataflow
    BWhen citizen data engineers need a visual, code-free ETL/ELT tool — Data Fusion provides a drag-and-drop UI with pre-built connectors for common sources and transformations
    CNever use Data Fusion
    DData Fusion replaces Dataflow entirely

    Answer hidden for practice.

    Use the interactive quiz to reveal the correct answer and explanation.

    Start PDE Quiz
  16. Question 16Maintaining and Automating Data Workloads

    How should you handle errors and failed records in a Dataflow pipeline?

    ALet the pipeline crash
    BImplement dead-letter queues — catch and route failed records to a separate Pub/Sub topic or BigQuery table for later investigation, while the pipeline continues processing valid records
    CRetry infinitely
    DSkip all errors silently

    Answer hidden for practice.

    Use the interactive quiz to reveal the correct answer and explanation.

    Start PDE Quiz
  17. Question 17Ingesting and Processing Data

    How do you handle schema evolution in a streaming pipeline when the source schema changes?

    AStop the pipeline and rebuild
    BUse Avro or Protocol Buffers (support schema evolution), configure BigQuery to auto-detect schema changes, and implement forward-compatible schemas in Dataflow with dead-letter for incompatible records
    CReject all schema changes
    DUse schema-less formats only

    Answer hidden for practice.

    Use the interactive quiz to reveal the correct answer and explanation.

    Start PDE Quiz
  18. Question 18Designing Data Processing Systems

    How do you estimate costs for a BigQuery + Dataflow data platform?

    AGuess based on similar projects
    BGoogle Cloud Pricing Calculator — estimate BigQuery (storage + queries/slots), Dataflow (worker hours + Streaming Engine), Pub/Sub (throughput), and Cloud Storage. Use INFORMATION_SCHEMA for actual usage data
    CJust pay as you go
    DCosts are fixed

    Answer hidden for practice.

    Use the interactive quiz to reveal the correct answer and explanation.

    Start PDE Quiz
  19. Question 19Maintaining and Automating Data Workloads

    A Dataflow streaming job's system lag is increasing over time. How do you troubleshoot?

    ARestart the job
    BCheck: 1) Worker CPU/memory (under-provisioned?), 2) Hot keys (uneven data distribution), 3) Slow external calls (API/DB bottleneck), 4) Data skew in GroupByKey. Scale up workers or add key rebalancing
    CIncrease Pub/Sub throughput
    DWait for it to recover

    Answer hidden for practice.

    Use the interactive quiz to reveal the correct answer and explanation.

    Start PDE Quiz
  20. Question 20Designing Data Processing Systems

    When should you use Dataproc instead of Dataflow?

    AAlways
    BFor existing Hadoop/Spark workloads, when you need Spark-specific libraries, or when the team has Spark expertise
    CFor streaming only
    DFor SQL queries only

    Answer hidden for practice.

    Use the interactive quiz to reveal the correct answer and explanation.

    Start PDE Quiz
  21. Question 21Maintaining and Automating Data Workloads

    How should you handle late-arriving data in streaming pipelines?

    ADiscard late data
    BUse Dataflow watermarks, allowed lateness, and triggers to handle late data by updating windows when late elements arrive
    CProcess everything immediately
    DBuffer all data indefinitely

    Answer hidden for practice.

    Use the interactive quiz to reveal the correct answer and explanation.

    Start PDE Quiz
  22. Question 22Designing Data Processing Systems

    A data lake needs to support both batch and streaming data processing with unified code. Which Google Cloud service provides this?

    ADataproc
    BDataflow (Apache Beam)
    CBigQuery
    DCloud Composer

    Answer hidden for practice.

    Use the interactive quiz to reveal the correct answer and explanation.

    Start PDE Quiz

Key Data Processing Concepts for PDE

bigquerydataflowdataprocpub/subcomposerpipelinebatchstreaming

PDE Data Processing Exam Tips

Designing Data Processing Systems questions in PDE are typically scenario-based. Focus on service-level decision making aligned to official exam objectives. Priority concepts: bigquery, dataflow, dataproc, pub/sub, composer, pipeline.

What PDE Expects

  • Anchor your answer in select the most practical, secure, and scalable answer for the stated scenario.
  • Data Processing scenarios for PDE are frequently mapped to Domain 1 (~23%), so read the objective carefully before picking controls or architecture.
  • Expect multi-topic scenarios where Data Processing interacts with IAM, networking, data, or operations patterns rather than appearing as an isolated question.
  • When two options are both technically valid, prefer the choice that best aligns with the exam's operational scope (Professional) and vendor best practices.

High-Value Data Processing Concepts

  • Know the core Data Processing building blocks cold: bigquery, dataflow, dataproc, pub/sub.
  • Review the edge-case features and limits for composer, pipeline; these details are commonly used to differentiate answer choices.
  • Practice service-integration reasoning: how Data Processing pairs with Ingesting & Processing, Storing & Managing in real deployment patterns.
  • For PDE, explain why the chosen Data Processing design meets reliability, security, and cost expectations better than the alternatives.

Common PDE Traps

  • Watch for answers that partially solve the requirement but miss operational constraints.
  • Questions in Designing Data Processing Systems often include distractors that look correct for Data Processing but violate least-privilege, reliability, or scalability requirements.
  • Avoid picking options purely by feature name; validate data path, failure handling, and governance impact before answering.
  • If the prompt hints at automation or repeatability, eliminate manual-only operational answers first.

Fast Review Checklist

  • Can you compare at least two Data Processing implementation paths and justify which one best fits the scenario?
  • Can you map the chosen answer back to Designing Data Processing Systems (~23%) outcomes for PDE?
  • Can you explain security and access boundaries for Data Processing without relying on default-open assumptions?
  • Can you describe how Data Processing integrates with Ingesting & Processing and Storing & Managing during failure, scaling, and monitoring events?

Exam Domains Covering Data Processing

Related Resources

More PDE Study Resources