Core Workflow
- Ingest source data into durable storage, commonly S3.
- Discover schema and metadata with Glue crawlers and the Data Catalog.
- Transform and clean data with Glue, DataBrew, SageMaker Processing, or notebooks.
- Validate schema, missing values, class balance, and outliers before training.
Exam Cues
- Need serverless SQL over S3: Athena.
- Need managed ETL and cataloging: AWS Glue.
- Need reusable low-latency features: SageMaker Feature Store online store.
- Need training datasets and historical features: offline feature store or S3.
Practice Data Preparation Questions
Put your knowledge to the test with practice questions.