update tutorial for clarity #16354

zzstoatzz · 2024-12-12T02:17:12Z

makes a couple updates to the Build a data pipeline tutorial

existing use of .submit()/ .result() is bit more verbose and I would rather recommend using sequential .map calls in cases where the io actually takes a while, bc in this example the second loop blocks at each result() call for each item, instead of resolving futures concurrently like .map(list[T]).result(). i also think its easier to read / reason about

for example

slower

faster

import time
from typing import Any

from prefect import flow, task


@task
def slow_api_call(item: str) -> dict[str, Any]:
    """Simulate a slow API call that takes 1 second"""
    time.sleep(1)  # Simulate network delay
    return {"item": item, "data": f"data for {item}"}


@task
def slow_processing(data: dict[str, Any]) -> str:
    """Simulate slow processing that takes 0.5 seconds"""
    time.sleep(0.5)  # Simulate processing time
    return f"Processed {data['item']}: {data['data']}"


@flow(name="inefficient-flow")
def process_inefficiently(items: list[str]) -> None:
    """
    Inefficient approach using submit/result in sequence.
    For N items, this takes roughly N * (1 + 0.5) seconds because we block
    on each result() call.
    """
    print("\nStarting inefficient flow...")
    start_time = time.time()

    # First phase: submit API calls
    api_futures = []
    for item in items:
        api_futures.append({"item": item, "future": slow_api_call.submit(item)})

    # Second phase: wait for each API call and process
    # This blocks sequentially on each result() call!
    for future_info in api_futures:
        item = future_info["item"]
        api_result = future_info["future"].result()  # Blocks here!
        processed = slow_processing(api_result)  # Waits for processing
        print(f"Completed {item}: {processed}")

    duration = time.time() - start_time
    print(f"Inefficient flow took {duration:.2f} seconds")


@flow(name="efficient-flow")
def process_efficiently(items: list[str]) -> None:
    """
    Efficient approach using map.
    For N items, this takes roughly max(1, 0.5) * N seconds because
    all futures are resolved concurrently.
    """
    print("\nStarting efficient flow...")
    start_time = time.time()

    # Map over all items for API calls
    api_results = slow_api_call.map(items)

    # Map over all results for processing
    # This resolves all futures concurrently!
    processed = slow_processing.map(api_results).result()

    # Print results
    for item, result in zip(items, processed):
        print(f"Completed {item}: {result}")

    duration = time.time() - start_time
    print(f"Efficient flow took {duration:.2f} seconds")


if __name__ == "__main__":
    test_items = ["item1", "item2", "item3", "item4"]

    # Run both flows to compare
    process_inefficiently(test_items)
    process_efficiently(test_items)

moves rate_limit into the task. happy to reconsider this, I was just thinking that i'd want my task runtimes to reflect time waiting for the github api to free up - but i could also see situations where you might want to avoid submission until some resource is free
keeping with the theme of the tutorial, updates the full examples to compound use of adopted features throughout the page (i.e. continue to use retries=3 even when adding caching or rate limiting)
in normal convention, defines tasks that are referenced in the show_stars flow before said flow
uses proper type hinting

discdiver

Minor suggested edits.

docs/v3/tutorials/pipelines.mdx

zzstoatzz · 2024-12-12T18:14:49Z

good suggestions @discdiver ! updated in 73498a8

.github/workflows/time-docker-build.yaml

github-actions bot added the docs label Dec 12, 2024

mintlify bot deployed to staging December 12, 2024 02:18 View deployment

zzstoatzz marked this pull request as ready for review December 12, 2024 02:26

zzstoatzz requested review from discdiver, cicdw and desertaxle as code owners December 12, 2024 02:26

zzstoatzz self-assigned this Dec 12, 2024

discdiver reviewed Dec 12, 2024

View reviewed changes

zzstoatzz added 2 commits December 12, 2024 12:11

update tutorial for clarity

5745996

apply review suggestions

73498a8

zzstoatzz force-pushed the improve-data-pipeline-tutorial branch from 451e71e to 73498a8 Compare December 12, 2024 18:14

only on push to main

bc4f55d

discdiver approved these changes Dec 12, 2024

View reviewed changes

zzstoatzz enabled auto-merge (squash) December 12, 2024 18:37

zzstoatzz commented Dec 12, 2024

View reviewed changes

.github/workflows/time-docker-build.yaml Outdated Show resolved Hide resolved

Update .github/workflows/time-docker-build.yaml

45e7c5b

zzstoatzz merged commit b2faff3 into main Dec 12, 2024
5 checks passed

zzstoatzz deleted the improve-data-pipeline-tutorial branch December 12, 2024 18:45

EmilRex pushed a commit that referenced this pull request Dec 13, 2024

update tutorial for clarity (#16354)

8ba208a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

update tutorial for clarity #16354

update tutorial for clarity #16354

zzstoatzz commented Dec 12, 2024 •

edited

Loading

discdiver left a comment

zzstoatzz commented Dec 12, 2024

update tutorial for clarity #16354

update tutorial for clarity #16354

Conversation

zzstoatzz commented Dec 12, 2024 • edited Loading

slower

faster

discdiver left a comment

Choose a reason for hiding this comment

zzstoatzz commented Dec 12, 2024

zzstoatzz commented Dec 12, 2024 •

edited

Loading