Case StudyMar 27, 20263 min read

Snap! Raise: 86% OpenSearch Cost Reduction

This case study shows how Snap! Raise achieved an 86% OpenSearch cost reduction while scaling safely through orphan-index cleanup, shard discipline, and proactive alerting.

Introduction

$16,000 per month on OpenSearch. Eighteen data nodes. Nearly 16,000 shards. And only 14 GB of real data on disk. When Snap! Raise shared their AWS bill, the mismatch between provisioned capacity and actual need was dramatic. Over five months, we reduced monthly cost to $2,300 for the primary cluster while expanding total platform capacity from 6,000 to 30,000 stores.

The Challenge

Snap! Raise is one of the largest digital fundraising platforms for youth athletics in the US, supporting fundraising operations for more than 150,000 schools and teams. Their Magento-based e-commerce platform used AWS OpenSearch for product search across thousands of stores.

Monthly cost: $16,000
Cluster shape: 18 data nodes (r6g.xlarge, 32 GiB RAM each) + 5 master nodes
Total memory and storage: 318 GiB RAM, 1.75 TB allocated disk
Shard count: 15,786 across 7,884 indices
Actual disk usage: 14 GB

The cluster was using less than 1% of allocated storage, but billing reflected provisioned infrastructure, not true utilization.

Diagnosis

The first check was _cat/indices sorted by size. The pattern was immediate: Snap! Raise followed a store-per-index model. Magento created a new versioned index on catalog updates and switched aliases, but old versions were never removed.

That drove index sprawl. The cluster had 7,884 indices, while roughly 1,740 were actually needed. Each orphaned index carried two shards (primary + replica), inflating shard count over time.

AWS OpenSearch recommends about 25 shards per GB of RAM. This cluster was running close to 50 shards/GB. More nodes were added to absorb overhead, and costs kept climbing.

The Solution

The optimization was delivered in three phases over five months.

Phase 1 — Right-Sizing (August 2022)

With 1,740 active indices at 1 primary + 1 replica, the target was 3,480 shards. At 25 shards/GB, the minimum requirement was 139 GB RAM. Three r6g.xlarge nodes across two availability zones provided 192 GB with practical headroom.

Monthly cost dropped from $16,000 to $1,700 (about 89%).

Phase 2 — Controlled Scaling (September–November 2022)

September: Max stores increased from 6K to 14K, cost at $3,900/month.
November: Max stores reached 18K, cost at $4,900/month.

At this point, an automated cleanup script was introduced. It removed indices with no active alias, zero-document indices, and duplicate versions for the same store.

Phase 3 — Final Optimization (November–December 2022)

After lifecycle control was in place, we ran load tests with full reindex cycles, production-level query traffic, and CPU/memory/latency tracking.

Result: the workload remained stable with fewer data nodes. We also added shard-threshold monitors running every 10 minutes with Slack notifications for early warning.

Results

Metric	Before	After
prod1 monthly cost	$16,000	$2,300
Data nodes (prod1)	18	8
Max store capacity (prod1)	6,000	16,000
Total platform capacity	6,000	30,000
Cost reduction	—	86%
Index management	Manual	Automated cleanup + alerting

Total monthly cost across three clusters reached approximately $4,150 while supporting 30,000 stores.

Key Takeaways

Index sprawl is a hidden cost driver. Orphaned indices can create major over-provisioning even when data volume is small.
Shard-per-GB is an operational guardrail. Staying near the guideline helps control both cost and stability.
Automatic index creation requires automatic cleanup. Lifecycle management is mandatory in generated-index architectures.
Validate node count with testing. Capacity decisions should come from measured load, not assumptions.
Proactive monitoring prevents expensive drift. Simple shard alerts can stop recurring over-provisioning early.

Need help optimizing your Elasticsearch or OpenSearch clusters? Visit searchali.com.