Download Sentinel-1 & Sentinel-2 data cubes of huge-scale (larger-than-memory) on any machine with integrated cloud detection, snow masking, harmonization, merging, and temporal composites.
- The model for cloud detection will be made available within the next couple of weeks.
- This package is in early alpha stage. There will be bugs! If you encounter any error, warning, memory issue, etc. please open a GitHub issue with the code to reproduce.
- This package is meant for large-scale processing and any area that is smaller than 8km in width and height will not run faster because of the underlying processing scheme.
This package is tested with Python 3.12.*. It may or may not work with other versions.
pip install sentle
or
git clone [email protected]:cmosig/sentle.git
cd sentle
pip install -e .
(1) Setup
There is only one important function: process
. Here, you specify all parameters and the function returns a lazy dask array with the shape (#timesteps, #bands, #pixelsy, #pixelsx)
.
from sentle import sentle
from rasterio.crs import CRS
da = sentle.process(
target_crs=CRS.from_string("EPSG:32633"),
bound_left=176000,
bound_bottom=5660000,
bound_right=216000,
bound_top=5700000,
datetime="2022-06-17/2023-06-17",
target_resolution=10,
S2_mask_snow=True,
S2_cloud_classification=True,
S2_cloud_classification_device="cuda",
S1_assets=["vv", "vh"],
S2_apply_snow_mask=True,
S2_apply_cloud_mask=True,
time_composite_freq="7d",
num_workers=7,
)
This code downloads data for a 40km by 40km area with one year of both Sentinel-1 and Sentinel-2. Clouds and snow are detected and replaced with NaNs. Data is also averaged every 7 days. A lazy dask array is returned:
Explanation:
target_crs
: Specifies the target CRS that all data will be reprojected to.target_resolution
: Determines the spatial resolution that all data is reprojected to in thetarget_crs
.bound_*
: Spatial bounds intarget_crs
of the area you want to download. Undefined behavior if difference between opposite bounds is not divisable bytarget_resolution
.datetime
: Time range that will be downloaded.S2_mask_snow
: Whether to compute snow mask for Sentinel-2 data.S2_cloud_classification
: Whether to perform a cloud classification layer for Sentinel-2 data.S2_cloud_classification_device
: Where to run cloud classification. If you have an Nvidia GPU then passcuda
otherwisecpu
(default).S2_apply_*
: Whether to apply the respective mask, i.e., replace values by NaN.S1_assets
: Which Sentinel-1 assets to download. Disable Sentinel-1 by setting this toNone
.time_composite_freq
: Rounding interval across which data is averaged. Usespandas.Timestamp.round(time_composite_freq)
. Cloud/snow masks are dropped after masking because they cannot be aggregated.num_workers
: Number of cores to use. Plan about 4 GiB of memory usage per worker.
(2) Compute
You either run .compute()
on the returned dask array or pass the object to
sentle.save_as_zarr(da, path="..."))
, which setups zarr storage and saves each chunk as to disk as
soon as it's ready. The latter enables an area and temporal range to be
computed that is much larger than the RAM on your machine.
(3) Visualize
Load the data with xarray and visualize using for example the awesome lexcube package. Here, band B02 is visualized from the above example. One is able to spot the cloud gaps and the spotty coverage during winter.
import lexcube
import xarray as xr
da = xr.open_zarr("mycube.zarr").sentle
lexcube.Cube3DWidget(da.sel(band="B02"), vmin=0, vmax=4000)
Upon initialization, sentle
prints a link to a dask dashboard. Check the bottom right pane in the Status tab for a progress bar.
A variety of other stats are also visible there. If you are working on a remote machine you may need to use port forwarding to access the remote dashboard.
Increase the number of workers using the num_workers
parameter when setting up the Sentle
class. With default spatial chunk size of 4000, specified by processing_spatial_chunk_size
, you should plan with 4GiB per worker. At the moment (will change), each worker also initiates its own model on the GPU, meaning more workers will also mean that more GPU VRAM will be used.
Increase the processing_spatial_chunk_size
from 4000
to something higher in the process
function. This will increase spatial chunk sizes, but will also increase worker memory requirements.
Every time you start a python kernel and run sentle.process
, a new dask cluster is setup. When you run sentle.process
again, the old cluster is used. If you want to start a new cluster, you need to restart the kernel.
You need to wrap the sentle code inside a if __name__ == "__main__:
for the dask code to work properly. This is dask requirement.
The number of files opened is limited and each dask worker opens a couple of
files. You'll have to increase the limit with ulimit -n 100000
or ask your administrator. This is a dask issue :)
Please submit issues or pull requests if you feel like something is missing or needs to be fixed.
This project is licensed under the MIT License - see the LICENSE.md file for details.
Thank you to David Montero for all the discussions and his awesome packages which inspired this.