📋 AWS Glue Cheat Sheet

AWS Glue is the centerpiece of DEA-C01 ETL questions: crawlers, Data Catalog, ETL jobs, DataBrew, job bookmarks, and schema management.

Core Components

  • Crawlers automatically discover schema and populate the Glue Data Catalog.
  • The Data Catalog is a centralized metadata repository for databases, tables, and partitions.
  • Glue ETL jobs run Apache Spark (Python/Scala) or Python Shell scripts in a serverless environment.
  • DataBrew provides a visual, no-code interface for data preparation and profiling.

Job Management

  • Job bookmarks track previously processed data to enable incremental ETL.
  • Glue workflows orchestrate multiple crawlers and jobs into a single pipeline.
  • Glue Studio provides a visual drag-and-drop interface for building ETL pipelines.
  • DynamicFrames extend Spark DataFrames with schema flexibility and built-in transformations.

Exam Cues

  • Need serverless ETL with schema discovery: AWS Glue.
  • Need visual data preparation without code: Glue DataBrew.
  • Need incremental processing to avoid re-reading old data: job bookmarks.
  • Need centralized metadata for Athena, EMR, and Redshift Spectrum: Glue Data Catalog.

Practice AWS Glue Questions

Put your knowledge to the test with practice questions.

More DEA-C01 Cheat Sheets