Overview¶
The HLS STAC Geoparquet Archive is an unofficial copy of the HLS 2.0 granule STAC item metadata that is generated in the HLS pipeline. The data are stored in two hive-partitioned parquet datasets (one per collection, partitioned by year and month). The parquet files are updated every 5 days from CMR API Granule queries, covering both the previous month (straggler catch-up) and the current month (incremental build).
Warning: This archive is not guaranteed to contain all of the records available in CMR, particularly for the most recent months. If you need the most recent granules do not use this archive!
The parquet files can be accessed from the nasa-maap-data-store bucket in AWS S3 (us-west-2):
s3://nasa-maap-data-store/file-staging/nasa-map/hls-stac-geoparquet-archive/v2/{collection}/year={year}/month={month}/{collection}-{year}-{month}.parquet
where collection is either HLSL30_2.0 (Landsat) or HLSS30_2.0 (Sentinel-2).
Usage¶
rustac¶
The rustac package can be used to query the archive via the DuckdbClient interface. To use this approach your environment must be configured with AWS credentials that provide ListBucket access to the nasa-maap-data-store bucket in S3 (this will work in the MAAP hub).
Note: The HLSL30_2.0 and HLSS30_2.0 collections must be queried separately because the STAC items have slightly different parquet schemas.
from rustac import DuckdbClient
client = DuckdbClient(use_hive_partitioning=True)
# configure duckdb to find S3 credentials for listing/reading the files in S3
# on the MAAP HUB
# aws_session = boto3.Session()
# creds = aws_session.get_credentials().get_frozen_credentials()
# client.execute(
# f"""
# CREATE OR REPLACE SECRET secret (
# TYPE S3,
# REGION '{aws_session.region_name}',
# KEY_ID '{creds.access_key}',
# SECRET '{creds.secret_key}',
# SESSION_TOKEN '{creds.token}'
# );
# """
# )
# on the MAAP ADE
client.execute(
"""
CREATE OR REPLACE SECRET secret (
TYPE S3,
PROVIDER credential_chain
);
"""
)
parquet_href = "s3://nasa-maap-data-store/file-staging/nasa-map/hls-stac-geoparquet-archive/v2/{collection}/**/*.parquet"
datetime = "2025-05-01T00:00:00Z/2025-05-31T23:59:59Z"
bbox = (-90, 45, -85, 50)
hls_l30_items = client.search(
href=parquet_href.format(collection="HLSL30_2.0"),
datetime=datetime,
bbox=bbox,
)
print(f"found {len(hls_l30_items)} HLSL30_2.0 items")
hls_s30_items = client.search(
href=parquet_href.format(collection="HLSS30_2.0"),
datetime=datetime,
bbox=bbox,
)
print(f"found {len(hls_s30_items)} HLSS30_2.0 items")
found 292 HLSL30_2.0 items found 394 HLSS30_2.0 items
Example item¶
The items in the HLS STAC Geoparquet Archive were copied directly from the STAC item JSON files that are produced for every HLS granule (e.g. https://data.lpdaac.earthdatacloud.nasa.gov/lp-prod-public/HLSS30.020/HLS.S30.T21JXN.2025341T134221.v2.0/HLS.S30.T21JXN.2025341T134221.v2.0_stac.json).
from pystac import Item
Item.from_dict(hls_s30_items[0])
Comparison to CMR API granules¶
This archive is generated by running granule queries for the HLS collections from the CMR API and represents a snapshot of a dynamic catalog. The archive is updated every 5 days, covering both the previous month (to catch stragglers) and the current month (incremental updates as new granules are published). This is bound to be a partially incomplete copy of the canonical source, but it should have 99% of the full set of granules.
Loading ITables v2.8.1 from the init_notebook_mode cell...
(need help?)
|