Enterprise Data Lake Migration to Cloud Architecture

Lead Data Engineer · 2024 · 9 months · 8 people · 10 min read

Migrated 15TB of data from on-premises Hadoop cluster to cloud-based data lake, reducing costs by 60% while improving query performance 10x

Overview

Led the migration of a legacy on-premises Hadoop-based data warehouse to a modern cloud-native data lake architecture on AWS, enabling self-service analytics and machine learning workloads at scale

Problem

The company's data infrastructure was built on aging on-premises Hadoop clusters that were expensive to maintain, difficult to scale, and created bottlenecks for analytics teams. Data scientists waited days for environment provisioning, analysts struggled with complex SQL interfaces, and infrastructure costs were growing 40% year-over-year without corresponding capability improvements.

Constraints

Zero downtime for critical daily reports and dashboards
Must maintain data lineage and audit trails for compliance
Limited budget for new tooling - maximize use of managed services
Existing team had no cloud experience
Data governance and security requirements from Fortune 500 clients

Approach

Implemented a phased migration strategy starting with non-critical reporting data, building confidence before moving production datasets. Adopted a lakehouse architecture using S3 for storage, AWS Glue for ETL orchestration, Athena for SQL analytics, and EMR for Spark workloads. Established data quality gates and automated testing to ensure accuracy throughout migration.

Key Decisions

Adopt Parquet with partitioning over keeping raw CSV/JSON formats

Reasoning:

Parquet's columnar storage reduced storage by 85% and improved query performance 10-15x through column pruning and predicate pushdown. The investment in converting formats paid back within 3 months through reduced storage and compute costs.

Alternatives considered:

Keep existing CSV/JSON formats for easier compatibility
Use Avro for better schema evolution support
Adopt ORC format for Hive compatibility

Use AWS Glue for ETL orchestration instead of Airflow

Reasoning:

Glue's serverless model eliminated infrastructure management overhead and scaled automatically. While Airflow offered more flexibility, our team needed to focus on data quality and business logic rather than managing orchestration infrastructure.

Alternatives considered:

Deploy and manage Apache Airflow on EC2
Use AWS Step Functions with Lambda
Continue with existing on-prem schedulers

Implement medallion architecture (Bronze/Silver/Gold layers)

Reasoning:

Separating raw ingestion (Bronze), cleaned data (Silver), and business-level aggregates (Gold) created clear boundaries for data quality expectations and enabled different consumption patterns. Analysts could query Gold tables directly while data scientists accessed Silver for feature engineering.

Alternatives considered:

Single unified data layer
Traditional dimensional modeling with star schemas
Event-sourced architecture with rebuild capability

Tech Stack

AWS S3
AWS Glue
Amazon Athena
Amazon EMR
Apache Spark
Apache Parquet
Terraform
Python
dbt
Great Expectations

Result & Impact

60% decrease in infrastructure costs

Cost Reduction
10x faster for typical analytics queries

Query Performance
85% reduction through Parquet compression

Storage Efficiency
Daily → Hourly for most datasets

Data Freshness
3 days → 15 minutes for new analytics environments

Time to Provision

The migration transformed how the organization approaches data. Analysts gained self-service access to data through visual query builders and automated documentation. Data scientists reduced experimentation cycle time from weeks to days. Most importantly, the team shifted from firefighting infrastructure issues to building new capabilities that drove business value. The cloud architecture enabled new use cases like real-time fraud detection and personalized recommendations that were previously impossible.

Learnings

Starting with a pilot dataset (1TB of non-critical data) built team confidence and validated architecture choices before committing to full migration
Automated data quality checks caught 80% of migration issues before they reached production. Investing in Great Expectations framework paid dividends throughout the project.
Running parallel systems (old and new) for 6 weeks allowed gradual cutover and gave users time to adapt, but the extended overlap period cost more than expected
Schema-on-read sounds appealing but caused confusion. Adding a data catalog (AWS Glue Catalog) early would have prevented weeks of documentation scrambles
Costs are harder to predict in cloud than on-prem. We underestimated data transfer costs between services and needed to implement caching and regional strategies
Teaching team members cloud concepts required structured learning time, not just documentation links. Weekly working sessions were far more effective than self-paced learning

Migration Strategy and Phases

Phase 1: Foundation and Proof of Concept (Weeks 1-8)

We started by establishing the cloud infrastructure and validating our architecture with a representative dataset:

Infrastructure Setup:

Provisioned AWS Landing Zone with proper security controls
Established IAM roles following principle of least privilege
Set up VPC with private subnets for data processing
Configured S3 buckets with encryption, versioning, and lifecycle policies

Pilot Migration:

Selected customer transaction data (~1TB, well-understood schema)
Built ETL pipeline using Glue: CSV → Parquet with snappy compression
Implemented partitioning strategy: year/month/day/hour
Created Athena tables with partition projection for optimal query performance

Success Criteria Validation:

Query performance: 95th percentile queries completing under 10 seconds ✓
Cost targets: Processing costs under $0.10 per GB ✓
Data accuracy: 100% match with source systems ✓

This phase proved the architecture could handle our requirements and gave stakeholders confidence to proceed.

Phase 2: Incremental Dataset Migration (Weeks 9-24)

With validated architecture, we migrated datasets in priority order:

Migration Approach per Dataset:

# Typical migration pipeline structure
1. Extract from Hadoop HDFS → Stage in S3 Bronze layer
2. Data quality validation (row counts, schema checks, null analysis)
3. Transform to Parquet with partitioning → Silver layer
4. Create aggregations and business views → Gold layer
5. Run parallel queries (old vs new) to verify identical results
6. Switch production consumers to new pipeline
7. Monitor for 2 weeks before decommissioning old pipeline

Priority Order:

Operational Reports (weeks 9-12): Daily/weekly reports that drove business decisions
Customer Analytics (weeks 13-16): User behavior, segmentation, lifetime value
Financial Data (weeks 17-20): Revenue, forecasting, compliance reporting
Machine Learning Datasets (weeks 21-24): Training data for recommendation systems

Each dataset migration followed the same playbook, allowing us to parallelize efforts across team members while maintaining consistency.

Phase 3: Advanced Capabilities (Weeks 25-36)

With data migrated, we focused on enabling new use cases:

Data Quality Framework:

Implemented Great Expectations for automated validation
Created 200+ data quality tests covering completeness, accuracy, consistency
Set up alerting for quality failures before bad data propagated downstream

Self-Service Analytics:

Deployed Redash for SQL-based exploration (later migrated to Tableau)
Created curated data marts for common analytical questions
Built automated documentation with dbt to surface table descriptions and lineage

Performance Optimization:

Analyzed query patterns and created materialized views for common joins
Implemented result caching in Athena (saved 30% on query costs)
Tuned Parquet file sizes to optimal 128-256MB for query performance

Cost Optimization:

Implemented S3 Intelligent-Tiering for automatic cost savings on infrequently accessed data
Set up lifecycle policies: Hot data (30 days) → Warm (90 days) → Cold (365 days) → Glacier
Created cost allocation tags to track spending by team and project

Technical Architecture Deep Dive

Medallion Architecture Implementation

Bronze Layer (Raw Data):

Exact copy of source data in S3
Minimal transformations (maybe decompression or file format standardization)
Maintains full history for reprocessing capabilities
Partition by ingestion date: s3://data-lake/bronze/source_system/table/ingest_date=2024-01-15/

Silver Layer (Cleaned & Conformed):

Validated and cleaned data
Standardized schema across similar datasets
Deduplicated records, applied business rules
Partitioned by business dimensions: s3://data-lake/silver/domain/table/year=2024/month=01/day=15/

Gold Layer (Business Aggregates):

Pre-aggregated metrics for common queries
Dimension tables with slowly changing dimension (SCD) tracking
Denormalized for query performance
Organized by business domain: s3://data-lake/gold/customer-analytics/daily-activity/date=2024-01-15/

ETL Pipeline Example: Customer Transaction Data

# Glue PySpark job for transforming Bronze → Silver
from awsglue.context import GlueContext
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, to_date, year, month, dayofmonth

spark = SparkSession.builder.appName("CustomerTransactionsSilver").getOrCreate()
glue_context = GlueContext(spark.sparkContext)

# Read from Bronze layer
bronze_df = spark.read.parquet("s3://data-lake/bronze/transactions/")

# Data quality and transformation
silver_df = (
    bronze_df
    # Remove duplicates based on transaction_id
    .dropDuplicates(["transaction_id"])
    # Filter out invalid records
    .filter(col("amount") > 0)
    .filter(col("customer_id").isNotNull())
    # Standardize data types
    .withColumn("transaction_date", to_date(col("timestamp")))
    # Add partition columns
    .withColumn("year", year(col("transaction_date")))
    .withColumn("month", month(col("transaction_date")))
    .withColumn("day", dayofmonth(col("transaction_date")))
)

# Write to Silver layer with partitioning
(
    silver_df
    .write
    .mode("overwrite")
    .partitionBy("year", "month", "day")
    .parquet("s3://data-lake/silver/transactions/")
)

# Update Glue Catalog
glue_context.create_dynamic_frame_from_catalog(
    database="silver",
    table_name="customer_transactions"
).toDF()

Data Quality Framework

We implemented automated quality checks at each layer boundary:

Bronze → Silver Quality Gates:

Schema validation: expected columns present with correct data types
Completeness: critical fields have < 1% null values
Uniqueness: primary keys have no duplicates
Freshness: data arrived within expected time window

Silver → Gold Quality Gates:

Referential integrity: foreign keys resolve correctly
Business rule validation: amounts, dates, status codes within expected ranges
Historical comparison: metrics within 10% of historical patterns (anomaly detection)

Failed quality checks blocked downstream propagation and triggered alerts to the data engineering team.

Cost Analysis and ROI

Before Migration (On-Premises Hadoop)

Annual Costs:

Hardware amortization: $180K
Data center hosting: $60K
Network bandwidth: $24K
Operations team (2 FTE): $240K
Software licenses: $36K Total: $540K/year

Hidden Costs:

3-day average wait time for new analytics environments
Frequent outages requiring weekend maintenance
Inability to scale for peak workloads (quarterly reporting)

After Migration (Cloud Data Lake)

Annual Costs:

S3 storage (compressed): $18K
Glue ETL: $36K
Athena queries: $48K
EMR on-demand for ML: $54K
Data transfer: $12K
Operations team (1.5 FTE): $180K Total: $348K/year

Savings: $192K/year (36% cost reduction in year 1)

By year 2, further optimization and team efficiency gains increased savings to 60% compared to on-premises baseline.

Soft Benefits (Not Quantified Above)

Faster innovation: New analytics capabilities deployed in days vs months
Scalability: Handle 10x data growth without major architecture changes
Disaster recovery: Geographic replication provided business continuity
Compliance: Encryption and audit logging simplified regulatory requirements
Talent: Easier to recruit data engineers excited about modern cloud stack

Key Challenges and How We Solved Them

Challenge 1: Data Quality Degradation During Migration

Problem: Early migration runs showed 2-5% data loss due to encoding issues (Latin-1 vs UTF-8) and timestamp parsing errors.

Solution:

Created comprehensive profiling scripts that ran against both source and destination
Implemented byte-level comparison for sample records
Added explicit encoding detection and conversion in ETL
Built automated regression testing comparing row counts, sums, and distributions

Challenge 2: Cost Spiraling for Complex Queries

Problem: Some analytical queries were scanning terabytes of data, generating $50-100 bills per query. Costs were trending 3x higher than budget.

Solution:

Implemented query result caching in Athena (hit rate improved to 45%)
Educated users on writing partition-aware queries
Created materialized aggregates for expensive joins
Set up Athena workgroup quotas to cap spending per team

Challenge 3: Knowledge Transfer and Adoption

Problem: Team members comfortable with Hive/Hadoop struggled with cloud concepts and new tools. Adoption was slower than expected.

Solution:

Created internal certification program: AWS Cloud Practitioner → Associate-level certifications
Weekly “Cloud Office Hours” for hands-on help
Pair programming sessions between cloud-experienced and cloud-learning team members
Built comprehensive documentation with runbooks for common tasks

Challenge 4: Schema Evolution and Breaking Changes

Problem: Source systems occasionally changed schemas without notice, breaking downstream pipelines.

Solution:

Implemented schema versioning in Glue Catalog
Created schema evolution rules (e.g., new columns allowed, dropped columns require approval)
Built monitoring to detect schema changes within 1 hour of occurrence
Established data contracts with upstream system owners

Lessons Learned and Recommendations

What Worked Well

Incremental migration de-risked the project. We could have gone faster with a “big bang” but would have sacrificed sleep and sanity.
Parallel production runs for 2-4 weeks per dataset caught subtle differences (timezone handling, rounding) that unit tests missed.
Data catalog investment (AWS Glue Catalog + dbt documentation) paid immediate dividends. Users could discover and understand data without asking the data team.
Automated testing in CI/CD for data pipelines prevented regressions and gave confidence to refactor.

What We’d Do Differently

Start with cost monitoring day 1. We didn’t implement comprehensive cost allocation for the first 3 months and lost visibility into spending patterns.
Negotiate Reserved Capacity for predictable workloads earlier. We could have saved 30% on Glue costs.
Invest more in user training upfront. We focused on migration mechanics and underinvested in change management. User adoption would have been smoother with better onboarding.
Establish SLAs earlier. Without defined data freshness and quality SLAs, we fielded constant ad-hoc questions about “when will data be ready.”
Document tribal knowledge during migration. Several on-prem quirks and workarounds weren’t written down, leading to confusion when original team members moved to other projects.

Looking Forward

This migration positioned the company for next-generation analytics:

Machine Learning at Scale: Data scientists now spin up EMR clusters on-demand, train models on full datasets (not samples), and iterate rapidly
Real-Time Analytics: We’re adding Kinesis streams for real-time ingestion, complementing the batch architecture
Data Mesh Evolution: Domains are beginning to own their data pipelines, with the platform team providing self-service tooling
BI Democratization: 200+ employees now query data directly, up from 15 before migration

The modern cloud architecture isn’t just about cost savings—it’s about enabling data-driven decision making at every level of the organization.

All projects