Core Components
- Crawlers automatically discover schema and populate the Glue Data Catalog.
- The Data Catalog is a centralized metadata repository for databases, tables, and partitions.
- Glue ETL jobs run Apache Spark (Python/Scala) or Python Shell scripts in a serverless environment.
- DataBrew provides a visual, no-code interface for data preparation and profiling.
Job Management
- Job bookmarks track previously processed data to enable incremental ETL.
- Glue workflows orchestrate multiple crawlers and jobs into a single pipeline.
- Glue Studio provides a visual drag-and-drop interface for building ETL pipelines.
- DynamicFrames extend Spark DataFrames with schema flexibility and built-in transformations.
Exam Cues
- Need serverless ETL with schema discovery: AWS Glue.
- Need visual data preparation without code: Glue DataBrew.
- Need incremental processing to avoid re-reading old data: job bookmarks.
- Need centralized metadata for Athena, EMR, and Redshift Spectrum: Glue Data Catalog.
Practice AWS Glue Questions
Put your knowledge to the test with practice questions.