Choosing Columnar Storage (Parquet) Over Row-Based Formats for Analytics Workloads

Background

As our analytics platform scaled, we faced a critical decision about data storage formats. Our data lake held structured and semi-structured datasets growing at 1TB+ per quarter, with business analysts, data scientists, and automated reports all querying the same underlying data.

The existing approach—storing data as compressed CSV and JSON files in S3—worked initially but created mounting problems:

Query Performance: Even simple aggregations scanned entire datasets, taking 2-5 minutes for operations that should complete in seconds
Storage Costs: Growing at $150/month just for storage, with data transfer costs adding another $200-300/month
Resource Waste: Queries pulled far more data than needed, consuming excessive CPU and memory
Schema Drift: No enforcement meant inconsistent data types and frequent parsing errors

We needed a format optimized for analytics: fast reads, efficient compression, and good ecosystem support.

The Parquet Advantage

Apache Parquet emerged as the clear choice for several reasons:

Columnar Storage Design

Unlike row-based formats that store entire records together, Parquet organizes data by column. This fundamentally changes performance characteristics:

Row-based (CSV):
[id, name, date, amount, category, ...]
[id, name, date, amount, category, ...]
[id, name, date, amount, category, ...]

Columnar (Parquet):
[id, id, id, ...]
[name, name, name, ...]
[date, date, date, ...]
[amount, amount, amount, ...]

Why this matters: Analytics queries typically aggregate or filter on a few columns across many rows. Columnar storage means reading only the columns you need, not entire records.

Compression Benefits

Parquet achieves exceptional compression ratios because similar data types compress well together:

Before (CSV): 5TB compressed with gzip
After (Parquet): 600GB with Snappy compression (8.3x reduction)
Additional benefit: Snappy decompression is faster than gzip, improving query performance

Performance Improvements

Real-world query performance gains:

Aggregations: 2.5 minutes → 8 seconds (95% improvement)
Filtered queries: 3 minutes → 4 seconds (97% improvement)
Column selection: Full scan → only relevant columns read (80-95% less data)

Partitioning Strategy

We implemented a partitioning scheme that aligned with query patterns:

s3://data-lake/analytics/
  events/
    year=2026/
      month=02/
        day=19/
          region=us-east/
            part-00000.parquet
            part-00001.parquet

Impact: Most queries filter by date and region. This structure allows query engines to skip entire partitions, reading only relevant files.

Implementation Approach

Data Pipeline Changes

Ingestion Layer: Modified Spark jobs to write Parquet instead of CSV
Schema Management: Established schema registry and validation
Migration: Backfilled historical data (1 week process for 5TB)
Query Layer: Updated Athena tables to point to Parquet locations

Code Example

Simple Spark conversion:

# Old approach
df.write.format("csv").option("compression", "gzip").save("s3://bucket/data")

# New approach
df.write.format("parquet") \
    .option("compression", "snappy") \
    .partitionBy("year", "month", "day", "region") \
    .mode("append") \
    .save("s3://bucket/data")

Schema Evolution

Parquet handles schema changes gracefully:

Add columns: New columns appear as null in old files
Remove columns: Query engines simply don’t read them
Type changes: Require careful migration planning

Results After 6 Months

Cost Reduction

Storage costs: $150/month → $20/month (87% reduction)
Data transfer: $250/month → $40/month (84% reduction)
Compute costs: 30% reduction due to faster query completion

Performance Improvements

Average query time: 2.5 minutes → 12 seconds
95th percentile: 5 minutes → 30 seconds
Concurrent queries: Limited by resource contention → 5-10x more throughput

Business Impact

Ad-hoc analysis: Enabled real-time exploration instead of scheduled reports
Dashboard refresh: 15 minutes → 45 seconds
Data science workflows: Faster experimentation cycles
User satisfaction: Analysts could iterate on queries instead of waiting

Key Learnings

Format matters more than expected: 10x improvements are achievable with the right storage format
Partitioning is critical: Even with Parquet, poor partitioning negates benefits
Schema enforcement pays off: Catching type errors at write-time vs query-time saves hours of debugging
Ecosystem support matters: Parquet’s wide adoption meant seamless tool integration
Migration is manageable: Parallel old/new systems during transition worked well

When Parquet Isn’t the Answer

Not every use case benefits from columnar storage:

Transactional workloads: Row-based formats better for record-level updates
Small datasets: Overhead not worth it for files under 100MB
Need human readability: CSV/JSON better for quick inspection
Real-time streaming: May want Avro or Protocol Buffers for low latency

For analytics on large structured datasets, Parquet is hard to beat.

Context

Decision

Alternatives Considered

Continue with CSV files stored in S3

Use Apache ORC (Optimized Row Columnar)

Load everything into a data warehouse (Redshift/BigQuery)

Reasoning

Background

The Parquet Advantage

Columnar Storage Design

Compression Benefits

Performance Improvements

Partitioning Strategy

Implementation Approach

Data Pipeline Changes

Code Example

Schema Evolution

Results After 6 Months

Cost Reduction

Performance Improvements

Business Impact

Key Learnings

When Parquet Isn’t the Answer