Enterprise Data Lake Migration to Cloud Architecture

Lead Data Engineer · 2024 · 9 months · 8 people · 10 min read

Migrated 15TB of data from on-premises Hadoop cluster to cloud-based data lake, reducing costs by 60% while improving query performance 10x

Overview

Led the migration of a legacy on-premises Hadoop-based data warehouse to a modern cloud-native data lake architecture on AWS, enabling self-service analytics and machine learning workloads at scale

Problem

The company's data infrastructure was built on aging on-premises Hadoop clusters that were expensive to maintain, difficult to scale, and created bottlenecks for analytics teams. Data scientists waited days for environment provisioning, analysts struggled with complex SQL interfaces, and infrastructure costs were growing 40% year-over-year without corresponding capability improvements.

Constraints

  • Zero downtime for critical daily reports and dashboards
  • Must maintain data lineage and audit trails for compliance
  • Limited budget for new tooling - maximize use of managed services
  • Existing team had no cloud experience
  • Data governance and security requirements from Fortune 500 clients

Approach

Implemented a phased migration strategy starting with non-critical reporting data, building confidence before moving production datasets. Adopted a lakehouse architecture using S3 for storage, AWS Glue for ETL orchestration, Athena for SQL analytics, and EMR for Spark workloads. Established data quality gates and automated testing to ensure accuracy throughout migration.

Key Decisions

Adopt Parquet with partitioning over keeping raw CSV/JSON formats

Reasoning:

Parquet's columnar storage reduced storage by 85% and improved query performance 10-15x through column pruning and predicate pushdown. The investment in converting formats paid back within 3 months through reduced storage and compute costs.

Alternatives considered:
  • Keep existing CSV/JSON formats for easier compatibility
  • Use Avro for better schema evolution support
  • Adopt ORC format for Hive compatibility

Use AWS Glue for ETL orchestration instead of Airflow

Reasoning:

Glue's serverless model eliminated infrastructure management overhead and scaled automatically. While Airflow offered more flexibility, our team needed to focus on data quality and business logic rather than managing orchestration infrastructure.

Alternatives considered:
  • Deploy and manage Apache Airflow on EC2
  • Use AWS Step Functions with Lambda
  • Continue with existing on-prem schedulers

Implement medallion architecture (Bronze/Silver/Gold layers)

Reasoning:

Separating raw ingestion (Bronze), cleaned data (Silver), and business-level aggregates (Gold) created clear boundaries for data quality expectations and enabled different consumption patterns. Analysts could query Gold tables directly while data scientists accessed Silver for feature engineering.

Alternatives considered:
  • Single unified data layer
  • Traditional dimensional modeling with star schemas
  • Event-sourced architecture with rebuild capability

Tech Stack

  • AWS S3
  • AWS Glue
  • Amazon Athena
  • Amazon EMR
  • Apache Spark
  • Apache Parquet
  • Terraform
  • Python
  • dbt
  • Great Expectations

Result & Impact

  • 60% decrease in infrastructure costs
    Cost Reduction
  • 10x faster for typical analytics queries
    Query Performance
  • 85% reduction through Parquet compression
    Storage Efficiency
  • Daily → Hourly for most datasets
    Data Freshness
  • 3 days → 15 minutes for new analytics environments
    Time to Provision

The migration transformed how the organization approaches data. Analysts gained self-service access to data through visual query builders and automated documentation. Data scientists reduced experimentation cycle time from weeks to days. Most importantly, the team shifted from firefighting infrastructure issues to building new capabilities that drove business value. The cloud architecture enabled new use cases like real-time fraud detection and personalized recommendations that were previously impossible.

Learnings

  • Starting with a pilot dataset (1TB of non-critical data) built team confidence and validated architecture choices before committing to full migration
  • Automated data quality checks caught 80% of migration issues before they reached production. Investing in Great Expectations framework paid dividends throughout the project.
  • Running parallel systems (old and new) for 6 weeks allowed gradual cutover and gave users time to adapt, but the extended overlap period cost more than expected
  • Schema-on-read sounds appealing but caused confusion. Adding a data catalog (AWS Glue Catalog) early would have prevented weeks of documentation scrambles
  • Costs are harder to predict in cloud than on-prem. We underestimated data transfer costs between services and needed to implement caching and regional strategies
  • Teaching team members cloud concepts required structured learning time, not just documentation links. Weekly working sessions were far more effective than self-paced learning

Migration Strategy and Phases

Phase 1: Foundation and Proof of Concept (Weeks 1-8)

We started by establishing the cloud infrastructure and validating our architecture with a representative dataset:

Infrastructure Setup:

  • Provisioned AWS Landing Zone with proper security controls
  • Established IAM roles following principle of least privilege
  • Set up VPC with private subnets for data processing
  • Configured S3 buckets with encryption, versioning, and lifecycle policies

Pilot Migration:

  • Selected customer transaction data (~1TB, well-understood schema)
  • Built ETL pipeline using Glue: CSV → Parquet with snappy compression
  • Implemented partitioning strategy: year/month/day/hour
  • Created Athena tables with partition projection for optimal query performance

Success Criteria Validation:

  • Query performance: 95th percentile queries completing under 10 seconds ✓
  • Cost targets: Processing costs under $0.10 per GB ✓
  • Data accuracy: 100% match with source systems ✓

This phase proved the architecture could handle our requirements and gave stakeholders confidence to proceed.

Phase 2: Incremental Dataset Migration (Weeks 9-24)

With validated architecture, we migrated datasets in priority order:

Migration Approach per Dataset:

# Typical migration pipeline structure
1. Extract from Hadoop HDFS → Stage in S3 Bronze layer
2. Data quality validation (row counts, schema checks, null analysis)
3. Transform to Parquet with partitioning → Silver layer
4. Create aggregations and business views → Gold layer
5. Run parallel queries (old vs new) to verify identical results
6. Switch production consumers to new pipeline
7. Monitor for 2 weeks before decommissioning old pipeline

Priority Order:

  1. Operational Reports (weeks 9-12): Daily/weekly reports that drove business decisions
  2. Customer Analytics (weeks 13-16): User behavior, segmentation, lifetime value
  3. Financial Data (weeks 17-20): Revenue, forecasting, compliance reporting
  4. Machine Learning Datasets (weeks 21-24): Training data for recommendation systems

Each dataset migration followed the same playbook, allowing us to parallelize efforts across team members while maintaining consistency.

Phase 3: Advanced Capabilities (Weeks 25-36)

With data migrated, we focused on enabling new use cases:

Data Quality Framework:

  • Implemented Great Expectations for automated validation
  • Created 200+ data quality tests covering completeness, accuracy, consistency
  • Set up alerting for quality failures before bad data propagated downstream

Self-Service Analytics:

  • Deployed Redash for SQL-based exploration (later migrated to Tableau)
  • Created curated data marts for common analytical questions
  • Built automated documentation with dbt to surface table descriptions and lineage

Performance Optimization:

  • Analyzed query patterns and created materialized views for common joins
  • Implemented result caching in Athena (saved 30% on query costs)
  • Tuned Parquet file sizes to optimal 128-256MB for query performance

Cost Optimization:

  • Implemented S3 Intelligent-Tiering for automatic cost savings on infrequently accessed data
  • Set up lifecycle policies: Hot data (30 days) → Warm (90 days) → Cold (365 days) → Glacier
  • Created cost allocation tags to track spending by team and project

Technical Architecture Deep Dive

Medallion Architecture Implementation

Bronze Layer (Raw Data):

  • Exact copy of source data in S3
  • Minimal transformations (maybe decompression or file format standardization)
  • Maintains full history for reprocessing capabilities
  • Partition by ingestion date: s3://data-lake/bronze/source_system/table/ingest_date=2024-01-15/

Silver Layer (Cleaned & Conformed):

  • Validated and cleaned data
  • Standardized schema across similar datasets
  • Deduplicated records, applied business rules
  • Partitioned by business dimensions: s3://data-lake/silver/domain/table/year=2024/month=01/day=15/

Gold Layer (Business Aggregates):

  • Pre-aggregated metrics for common queries
  • Dimension tables with slowly changing dimension (SCD) tracking
  • Denormalized for query performance
  • Organized by business domain: s3://data-lake/gold/customer-analytics/daily-activity/date=2024-01-15/

ETL Pipeline Example: Customer Transaction Data

# Glue PySpark job for transforming Bronze → Silver
from awsglue.context import GlueContext
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, to_date, year, month, dayofmonth

spark = SparkSession.builder.appName("CustomerTransactionsSilver").getOrCreate()
glue_context = GlueContext(spark.sparkContext)

# Read from Bronze layer
bronze_df = spark.read.parquet("s3://data-lake/bronze/transactions/")

# Data quality and transformation
silver_df = (
    bronze_df
    # Remove duplicates based on transaction_id
    .dropDuplicates(["transaction_id"])
    # Filter out invalid records
    .filter(col("amount") > 0)
    .filter(col("customer_id").isNotNull())
    # Standardize data types
    .withColumn("transaction_date", to_date(col("timestamp")))
    # Add partition columns
    .withColumn("year", year(col("transaction_date")))
    .withColumn("month", month(col("transaction_date")))
    .withColumn("day", dayofmonth(col("transaction_date")))
)

# Write to Silver layer with partitioning
(
    silver_df
    .write
    .mode("overwrite")
    .partitionBy("year", "month", "day")
    .parquet("s3://data-lake/silver/transactions/")
)

# Update Glue Catalog
glue_context.create_dynamic_frame_from_catalog(
    database="silver",
    table_name="customer_transactions"
).toDF()

Data Quality Framework

We implemented automated quality checks at each layer boundary:

Bronze → Silver Quality Gates:

  • Schema validation: expected columns present with correct data types
  • Completeness: critical fields have < 1% null values
  • Uniqueness: primary keys have no duplicates
  • Freshness: data arrived within expected time window

Silver → Gold Quality Gates:

  • Referential integrity: foreign keys resolve correctly
  • Business rule validation: amounts, dates, status codes within expected ranges
  • Historical comparison: metrics within 10% of historical patterns (anomaly detection)

Failed quality checks blocked downstream propagation and triggered alerts to the data engineering team.

Cost Analysis and ROI

Before Migration (On-Premises Hadoop)

Annual Costs:

  • Hardware amortization: $180K
  • Data center hosting: $60K
  • Network bandwidth: $24K
  • Operations team (2 FTE): $240K
  • Software licenses: $36K Total: $540K/year

Hidden Costs:

  • 3-day average wait time for new analytics environments
  • Frequent outages requiring weekend maintenance
  • Inability to scale for peak workloads (quarterly reporting)

After Migration (Cloud Data Lake)

Annual Costs:

  • S3 storage (compressed): $18K
  • Glue ETL: $36K
  • Athena queries: $48K
  • EMR on-demand for ML: $54K
  • Data transfer: $12K
  • Operations team (1.5 FTE): $180K Total: $348K/year

Savings: $192K/year (36% cost reduction in year 1)

By year 2, further optimization and team efficiency gains increased savings to 60% compared to on-premises baseline.

Soft Benefits (Not Quantified Above)

  • Faster innovation: New analytics capabilities deployed in days vs months
  • Scalability: Handle 10x data growth without major architecture changes
  • Disaster recovery: Geographic replication provided business continuity
  • Compliance: Encryption and audit logging simplified regulatory requirements
  • Talent: Easier to recruit data engineers excited about modern cloud stack

Key Challenges and How We Solved Them

Challenge 1: Data Quality Degradation During Migration

Problem: Early migration runs showed 2-5% data loss due to encoding issues (Latin-1 vs UTF-8) and timestamp parsing errors.

Solution:

  • Created comprehensive profiling scripts that ran against both source and destination
  • Implemented byte-level comparison for sample records
  • Added explicit encoding detection and conversion in ETL
  • Built automated regression testing comparing row counts, sums, and distributions

Challenge 2: Cost Spiraling for Complex Queries

Problem: Some analytical queries were scanning terabytes of data, generating $50-100 bills per query. Costs were trending 3x higher than budget.

Solution:

  • Implemented query result caching in Athena (hit rate improved to 45%)
  • Educated users on writing partition-aware queries
  • Created materialized aggregates for expensive joins
  • Set up Athena workgroup quotas to cap spending per team

Challenge 3: Knowledge Transfer and Adoption

Problem: Team members comfortable with Hive/Hadoop struggled with cloud concepts and new tools. Adoption was slower than expected.

Solution:

  • Created internal certification program: AWS Cloud Practitioner → Associate-level certifications
  • Weekly “Cloud Office Hours” for hands-on help
  • Pair programming sessions between cloud-experienced and cloud-learning team members
  • Built comprehensive documentation with runbooks for common tasks

Challenge 4: Schema Evolution and Breaking Changes

Problem: Source systems occasionally changed schemas without notice, breaking downstream pipelines.

Solution:

  • Implemented schema versioning in Glue Catalog
  • Created schema evolution rules (e.g., new columns allowed, dropped columns require approval)
  • Built monitoring to detect schema changes within 1 hour of occurrence
  • Established data contracts with upstream system owners

Lessons Learned and Recommendations

What Worked Well

  1. Incremental migration de-risked the project. We could have gone faster with a “big bang” but would have sacrificed sleep and sanity.

  2. Parallel production runs for 2-4 weeks per dataset caught subtle differences (timezone handling, rounding) that unit tests missed.

  3. Data catalog investment (AWS Glue Catalog + dbt documentation) paid immediate dividends. Users could discover and understand data without asking the data team.

  4. Automated testing in CI/CD for data pipelines prevented regressions and gave confidence to refactor.

What We’d Do Differently

  1. Start with cost monitoring day 1. We didn’t implement comprehensive cost allocation for the first 3 months and lost visibility into spending patterns.

  2. Negotiate Reserved Capacity for predictable workloads earlier. We could have saved 30% on Glue costs.

  3. Invest more in user training upfront. We focused on migration mechanics and underinvested in change management. User adoption would have been smoother with better onboarding.

  4. Establish SLAs earlier. Without defined data freshness and quality SLAs, we fielded constant ad-hoc questions about “when will data be ready.”

  5. Document tribal knowledge during migration. Several on-prem quirks and workarounds weren’t written down, leading to confusion when original team members moved to other projects.

Looking Forward

This migration positioned the company for next-generation analytics:

  • Machine Learning at Scale: Data scientists now spin up EMR clusters on-demand, train models on full datasets (not samples), and iterate rapidly
  • Real-Time Analytics: We’re adding Kinesis streams for real-time ingestion, complementing the batch architecture
  • Data Mesh Evolution: Domains are beginning to own their data pipelines, with the platform team providing self-service tooling
  • BI Democratization: 200+ employees now query data directly, up from 15 before migration

The modern cloud architecture isn’t just about cost savings—it’s about enabling data-driven decision making at every level of the organization.