Making Scientific Raster Data Accessible — A Cloud-Native Approach

November 28, 2025 9 minute read

I recently completed a platform for sharing scientific raster data—the kind of multi-dimensional datasets that climate scientists, earth observation researchers, and environmental analysts work with daily. These NetCDF files can be massive, difficult to query, and challenging to serve over the web. This project tackles those problems, transforming unwieldy data files into accessible, queryable resources through a modern cloud-native architecture.

Scientific raster data typically comes in NetCDF format—a self-describing, multi-dimensional file format that’s excellent for archival but challenging for web access. A single file might contain temperature readings across space and time, with dimensions for latitude, longitude, time, and multiple variables. These files can be gigabytes or terabytes in size.

The challenges are clear:

Size: Files too large to download casually
Access patterns: Users want specific slices (a location’s timeseries, a map at one timestamp), not entire files
Discoverability: Finding relevant datasets across collections is difficult
Performance: Reading from NetCDF over HTTP is slow without optimization

Traditional approaches—FTP servers, direct file downloads—don’t scale for modern web applications or interactive analysis. Users need APIs that return exactly what they need: a map tile, a timeseries chart, a spatial subset.

┌─────────────┐
│   Users     │
└──────┬──────┘
       │
       ▼
┌─────────────────┐
│  CloudFront CDN │ (Tile caching, HTTPS)
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│      ALB        │ (HTTPS listener, path routing)
└────────┬────────┘
         │
    ┌────┴────┐
    │         │
    ▼         ▼
┌────────┐ ┌──────────┐
│ Tiles  │ │Timeseries│ (ECS Fargate services)
│Service │ │ Service  │
└───┬────┘ └────┬─────┘
    │           │
    │           ▼
    │      ┌─────────┐
    │      │  Dask   │ (Scheduler + Workers)
    │      │ Cluster │
    │      └────┬────┘
    │           │
    └───────┬───┴──────┐
            │          │
            ▼          ▼
       ┌────────┐  ┌──────────┐
       │   S3   │  │OpenSearch│
       │ Zarr/  │  │  (STAC)  │
       │  COG   │  └──────────┘
       └────────┘
            ▲
            │
    ┌───────┴────────┐
    │   Ingestion    │
    │    Pipeline    │
    │ (Lambda + Step │
    │   Functions)   │
    └───────▲────────┘
            │
       ┌────┴────┐
       │   S3    │
       │  Raw    │
       │ NetCDF  │
       └─────────┘

High-Level Architecture Flow

The Solution: A Multi-Format, API-First Platform

The platform addresses these challenges through a dual-format storage strategy and automated processing pipeline. When a NetCDF file is uploaded, it’s automatically converted into two cloud-optimized formats:

Cloud-Optimized GeoTIFF (COG) for spatial queries:

Optimized for reading specific spatial regions
Supports efficient map tile generation
Internal tiling and overviews for fast access at multiple zoom levels

Zarr for temporal queries:

Chunked, compressed array storage
Optimized for reading timeseries at specific locations
Enables parallel processing across time dimensions

This dual-format approach means the right tool for each job: COG for “show me a map” and Zarr for “show me how this location changed over time.”

What the Platform Enables

For End Users:

Interactive map tiles: Visualize any variable at any timestamp through standard web mapping libraries
Timeseries extraction: Query any point location to get its complete temporal profile
STAC catalog: Search and discover datasets by location, time, and metadata
RESTful API: Standard HTTP endpoints that work with any client

For Data Providers:

Automated ingestion: Drop NetCDF files in S3, everything else happens automatically
Format conversion: NetCDF → Zarr + COG without manual intervention
Metadata extraction: STAC items created and indexed automatically
Scalable processing: Handles files from megabytes to gigabytes

![Placeholder: Screenshot of map tiles showing raster data visualization]

Architecture: How It Works

The platform is built on AWS using a serverless and container-based architecture that balances cost, performance, and maintainability.

Ingestion Pipeline

Component Interaction Flow

Tile Request Flow:

User requests tile → CloudFront (cache check)
Cache miss → ALB → Tiles ECS Service
Service queries OpenSearch for COG location
Service reads COG from S3, renders tile
Response cached at CloudFront edge

Timeseries Request Flow:

User requests timeseries → CloudFront (no cache) → ALB → Timeseries ECS Service
Service queries OpenSearch for overlapping datasets
Service submits Dask tasks to read Zarr from S3
Dask workers process in parallel, aggregate results
Service caches result in Redis, returns to user

Ingestion Flow:

NetCDF uploaded to S3 raw bucket
S3 event triggers Lambda
Lambda starts Step Functions workflow
Workflow orchestrates: NetCDF→Zarr conversion, COG generation, STAC item creation
STAC item indexed in OpenSearch

When a NetCDF file lands in S3:

S3 Event Trigger → Lambda function validates the file
Step Functions orchestrates the conversion workflow
Conversion Tasks (Lambda or ECS) process the file:
- NetCDF → Zarr (rechunked for optimal access)
- NetCDF → COG (with overviews for multiple zoom levels)
STAC Creation extracts metadata (bounding box, temporal extent, variables)
DynamoDB Indexing makes the dataset searchable

The entire pipeline is event-driven and scales automatically. Upload one file or one hundred—the system handles it.

API Layer

The API runs on ECS Fargate with two primary endpoints:

Tiles Service (/tiles/{collection}/{z}/{x}/{y}.png):

Reads from COG files in S3
Renders map tiles on-demand
Cached by CloudFront CDN for global performance
Supports multiple resampling methods and color ramps

Timeseries Service (/api/timeseries):

Queries Zarr stores for point-based temporal data
Uses Dask for distributed processing of large queries
Caches results in Redis for repeated queries
Returns JSON with timestamps and values

Both services authenticate via AWS Cognito, support CORS for web clients, and expose Prometheus metrics for monitoring.

![Placeholder: Sequence diagram showing API request flow]

Technical Highlights

STAC Catalog with DynamoDB

The platform implements a SpatioTemporal Asset Catalog (STAC) for dataset discovery. Initially designed with OpenSearch, we migrated to DynamoDB and achieved a 90% cost reduction while improving query performance for common access patterns.

DynamoDB provides:

Single-digit millisecond latency for item lookups
Automatic scaling with on-demand billing
Global secondary indexes for collection and temporal queries
Point-in-time recovery for data protection

The STAC implementation supports standard query patterns: search by collection, filter by datetime range, and spatial bounding box queries. This makes the catalog compatible with existing STAC clients and tools.

Distributed Computing with Dask

For computationally intensive timeseries queries—extracting data across many timestamps or large spatial regions—the platform uses Dask for distributed processing. A Dask scheduler runs as an ECS service with worker tasks that scale based on demand.

This architecture enables:

Parallel reading from multiple Zarr chunks
Aggregation across temporal dimensions
Efficient memory management for large arrays
Graceful degradation when Dask is unavailable

Security and Authentication

The platform implements defense-in-depth security:

Network isolation: ECS tasks run in private subnets with no direct internet access
VPC endpoints: S3 and DynamoDB accessed through VPC endpoints to avoid NAT gateway costs
IAM least privilege: Task roles scoped to specific S3 prefixes and DynamoDB tables
Cognito authentication: JWT validation on all API requests
Encryption: Data encrypted at rest (S3, DynamoDB) and in transit (TLS)
Security scanning: Checkov validates Terraform configurations against security best practices

Infrastructure as Code

The entire platform is defined in Terraform with modular components:

Network module: VPC, subnets, security groups, VPC endpoints
IAM module: Roles and policies for ECS tasks, Lambda functions, Step Functions
Data module: S3 buckets, DynamoDB tables, ElastiCache Redis
ECS module: Task definitions, services, autoscaling policies, ALB configuration
Ingestion module: Lambda functions, Step Functions state machines, S3 event notifications

This modular approach enables:

Reusable components across environments
Clear separation of concerns
Easier testing and validation
Documented infrastructure through code

CI/CD Pipeline

GitHub Actions automates the deployment workflow:

Test: Run unit tests and property-based tests
Build: Create Docker image and push to ECR
Security: Run Checkov security scans on Terraform
Documentation: Auto-generate Terraform module documentation
Deploy: Update ECS task definitions and trigger rolling deployment
Smoke Test: Validate health endpoints and sample API requests

The pipeline includes cost estimation with Infracost, providing visibility into infrastructure costs before deployment.

![Placeholder: CI/CD pipeline diagram]

Performance and Scale

The platform is designed for production workloads:

Autoscaling:

Tiles service: 2-10 tasks based on CPU utilization
Timeseries service: 2-20 tasks based on CPU utilization
Target tracking at 70% CPU with configurable cooldown periods

Caching:

CloudFront caches tiles for 24 hours at edge locations
Redis caches timeseries results with configurable TTL
Reduces backend load and improves response times

Monitoring:

CloudWatch logs with 7-day retention
Prometheus metrics exposed at /metrics endpoint
CloudWatch alarms for high error rates, latency, and service health
Structured logging for debugging and analysis

Cost Optimization:

DynamoDB on-demand billing (pay per request)
ECS Fargate spot instances for non-critical workloads
S3 lifecycle policies for archival
VPC endpoints to avoid NAT gateway costs
Estimated monthly cost: $85-285 depending on usage

Real-World Applications

This architecture is applicable to various scientific domains:

Climate Science:

Historical temperature and precipitation data
Climate model outputs
Reanalysis datasets

Earth Observation:

Satellite imagery timeseries
Land cover change detection
Vegetation indices

Environmental Monitoring:

Air quality measurements
Ocean temperature and salinity
Soil moisture data

Hydrology:

Streamflow predictions
Groundwater levels
Precipitation forecasts

The key is transforming static files into queryable, accessible data through standardized APIs.

Lessons Learned

Right-size your database: OpenSearch was overkill for simple STAC catalog queries. DynamoDB provided better performance at 10% of the cost. Choose databases based on actual query patterns, not perceived needs.

Dual formats work: COG for spatial queries and Zarr for temporal queries proved effective. Each format is optimized for its use case, and the storage overhead is justified by query performance.

Event-driven ingestion scales: S3 events triggering Lambda functions provide a simple, scalable ingestion pattern. No polling, no scheduling—just upload and process.

Infrastructure as code is essential: Terraform modules made the infrastructure reproducible, testable, and documentable. The ability to tear down and recreate environments saved significant debugging time.

Security from the start: Implementing security best practices early—private subnets, least-privilege IAM, encryption—is easier than retrofitting later. Automated security scanning with Checkov caught issues before deployment.

Monitor everything: CloudWatch logs, metrics, and alarms provided visibility into system behavior. Prometheus metrics enabled detailed performance analysis. You can’t optimize what you don’t measure.

Future Enhancements

Several improvements are on the roadmap:

Kerchunk for virtual datasets: Instead of converting NetCDF to Zarr, create JSON reference files that enable Zarr-like access to original NetCDF files. This eliminates data duplication and speeds ingestion dramatically.

Serverless API with Lambda: For lower-traffic deployments, replace ECS services with API Gateway + Lambda to reduce costs and eliminate idle capacity.

Enhanced STAC features: Add support for STAC collections, asset management, and more complex spatial queries.

Batch processing optimization: Use AWS Batch for large file conversions to optimize cost and performance.

Multi-region deployment: Replicate data and services across regions for global access and disaster recovery.

Conclusion

Building a platform for scientific raster data requires balancing multiple concerns: performance, cost, security, and maintainability. This architecture demonstrates that it’s possible to make large, complex datasets accessible through modern cloud-native patterns.

The key insights:

Format matters: Cloud-optimized formats (COG, Zarr) enable efficient access patterns
Automation scales: Event-driven pipelines handle variable workloads without manual intervention
Standards enable interoperability: STAC makes datasets discoverable and compatible with existing tools
Right-sizing saves money: Choose services based on actual needs, not maximum capabilities
Infrastructure as code: Terraform makes complex systems reproducible and maintainable

The result is a platform that transforms unwieldy NetCDF files into accessible, queryable resources—making scientific data more useful for research, analysis, and decision-making.

Resources

STAC Specification - SpatioTemporal Asset Catalog standard
Cloud-Optimized GeoTIFF - COG format specification
Zarr - Chunked, compressed array storage
Dask - Distributed computing in Python
Terraform AWS Provider - Infrastructure as code
rio-tiler - Raster tile generation library

Twitter Facebook LinkedIn

Tilmann Steinmetz

Making Scientific Raster Data Accessible — A Cloud-Native Approach

The Solution: A Multi-Format, API-First Platform

What the Platform Enables

Architecture: How It Works

Ingestion Pipeline

Component Interaction Flow

API Layer

Technical Highlights

STAC Catalog with DynamoDB

Distributed Computing with Dask

Security and Authentication

Infrastructure as Code

CI/CD Pipeline

Performance and Scale

Real-World Applications

Lessons Learned

Future Enhancements

Conclusion

Resources

You May Also Enjoy

Generative AI: Web Search vs. Local LLM Results

Building Enterprise CI/CD with GitHub Actions — From Tests to Production

Detecting Rooftop Solar from high-resolution aerial imagery — scalable Roboflow inferencing + spatial clustering - Auckland case study

Spatial Analysis Methods — Distinguish, Decide, and Apply

Tilmann Steinmetz

The Problem: NetCDF Files Are Hard to Share

The Solution: A Multi-Format, API-First Platform

What the Platform Enables

Architecture: How It Works

Ingestion Pipeline

Component Interaction Flow

API Layer

Technical Highlights

STAC Catalog with DynamoDB

Distributed Computing with Dask

Security and Authentication

Infrastructure as Code

CI/CD Pipeline

Performance and Scale

Real-World Applications

Lessons Learned

Future Enhancements

Conclusion

Resources

You May Also Enjoy

Generative AI: Web Search vs. Local LLM Results

Building Enterprise CI/CD with GitHub Actions — From Tests to Production

Detecting Rooftop Solar from high-resolution aerial imagery — scalable Roboflow inferencing + spatial clustering - Auckland case study

Spatial Analysis Methods — Distinguish, Decide, and Apply