Datatoy Logo

Disk cache

Replay an expensive step without recomputing

`.cached_task(fn, cache_dir=...)` computes `fn(x)` and serializes the result to `.pkl` with a SHA-256 key of the input. On the next run, if the key exists, the result is read from disk — the function is never called.

python
import tempfile
from olympipe import Pipeline

def expensive_transform(x: int) -> dict:
    time.sleep(0.5)    # simulation calcul lourd
    return {"value": x ** 3, "meta": compute_stats(x)}

with tempfile.TemporaryDirectory() as cache_dir:
    # 1er run : calcule et persiste sur disque
    results = (
        Pipeline(range(1000))
        .cached_task(expensive_transform, cache_dir=cache_dir)
        .uncache()
        .wait_for_result()
    )
    # ~500s

    # 2ème run : lecture disque uniquement
    results_again = (
        Pipeline(range(1000))
        .cached_task(expensive_transform, cache_dir=cache_dir)
        .uncache()
        .wait_for_result()
    )
    # < 2s

Performance

1 000 items, 0.5s/item computation

First run 8.3min
Second run (cache) 1.8s
🚀 277.8× faster

How it works

The key is the SHA-256 of the pickled input. Files are stored at `cache_dir/<step_hash[:16]>/<key>.pkl`. Works with multi-step pipelines: each step has its own cache subfolder.

Related examples