A Local Data Stack That Speaks AWS

The AWS bill for a development environment makes no sense. You are paying for compute and storage to run queries that touch maybe 50MB of test data, just because the pipeline was built assuming it would always run in the cloud. There is a better way to set this up, and it costs nothing until you are ready to deploy for real.

The key word is API-compatible. MinIO speaks the same protocol as S3. LocalStack answers the same calls as Lambda, Secrets Manager, and SQS. Dagster abstracts its infrastructure connections behind “Resources.” This means your code does not know, and does not care, whether it is talking to localhost or eu-west-1. The switch is a configuration change, not a rewrite.

Storage: MinIO as S3. One Docker container, one environment variable. endpoint_url = "http://localhost:9000" locally, endpoint_url = None in production. Your boto3 code, your dbt profiles, your Terraform state, none of them change. I have run this setup on projects for months before the client even had an AWS account, and the migration to real S3 took about twenty minutes.

Transformation: dbt Core + DuckDB. DuckDB reads Parquet directly from MinIO the same way Athena would, and dbt-duckdb makes it a proper adapter. You write the SQL once, and when you move to production you swap in dbt-athena or dbt-redshift and the models mostly follow. I say “mostly” because there are always a few date functions that disagree between dialects, but that is a problem you fix in an afternoon, not a reason to avoid the approach.

Orchestration: Dagster. I know Airflow is what most are traditionally running, but if you are starting a new project today, Dagster’s Resources model is built exactly for this kind of environment switching. You define a LocalStorageResource and a ProductionS3Resource, and you point at one or the other via your deployment config. The pipeline logic never changes, and you can run a full end-to-end test on your laptop, no cloud account required.

The connective tissue: LocalStack. Not everything is storage and SQL. If your pipeline triggers a Lambda on file arrival, or reads a secret from Secrets Manager, LocalStack provides a local endpoint for those too. Install awscli-local and you get awslocal as a drop-in for aws, which is a small thing but genuinely useful when you are testing event-driven behaviour.

The structure that keeps this clean:

my-data-pipeline/
├── docker-compose.yml       # MinIO + LocalStack
├── dagster_project/
│   ├── resources.py         # Local vs Production resources
│   └── assets.py
├── dbt_project/
│   ├── profiles.yml         # dev (duckdb) and prod (redshift) targets
│   └── models/
└── terraform/               # mirrors local topology to cloud

When you are ready for a real environment, Terraform provisions the S3 buckets and the Redshift cluster, you flip STAGE=prod, and the code picks up IAM roles automatically because you used standard AWS SDKs from the start. The point is not to avoid AWS forever, but to not need it until you actually need it.

Need help with your migration?