direct ingestion into the Online Store #1863

vas28r13 · 2021-09-15T16:38:47Z

Problem
We'd like to be able to ingest data into the Online store from streaming sources. We think that supporting direct ingestion into the Online store could be lightweight solution to support ingesting data from streaming sources in the newer Feast versions.

In Feast v0.12, materialization from the Offline store is the way to get data into the Online store.

We believe we can manage the streaming processes on our end but would love to have a way to ingest data into the Online Store directly.

Use Case
a model process/service will get a stream of data and would need to keep track of the new/updated features in the Online feature store

Code example
feature_store_object.ingest_df("feature_view_name", dataframe_object)

Process 1

    feature_store = _get_feature_store()
    for feature_data in _kafka_consume():
        # clean the data
        features_df = _clean_features(feature_data)

        # ingest the data directly into the online store
        feature_store.ingest_df("stream_ingestion", features_df)

Process 2 (job)

feature_store = _get_feature_store()

# materialization from Offline sources
feature_store.materialize_incremental(end_date=datetime.now(timezone.utc))

Potential Issue
Since there are now 2 ways to ingest the data into the online store (materialization and direct), online_write_batch in the online store implementation needs to have logic to look at the event_timestamp since event time could be delayed. For example, Process 2 happens after Process 1, but the event timestamp in the features ingested by Process 1 are actually more recent so the features in Process 2 should not overwrite what is in the Online store even though it happens after Process 1

Let us know what you think. We implemented this locally for our use case but wonder if this is something that could be useful in the official Feast version as well.

The text was updated successfully, but these errors were encountered:

woop · 2021-09-15T16:57:28Z

@vas28r13 Thanks for writing this up. Overall it makes sense to me.

We believe we can manage the streaming processes on our end but would love to have a way to ingest data into the Online Store directly.

How would you ensure that streaming processes/jobs stay in sync with feature views in the feature store? It seems like there are broadly two approaches

Provide a Push API and have an upstream system push events to the API. Write those events into the FS. Feast doesn't manage the upstream job
Feast launches ingestion jobs which consume from a topic and write into an online store (this write may go through a Push API or could be direct to the storage layer).

(1) adds less responsibility to Feast, but also makes it harder to keep jobs in sync. In (2) Feast would be able to launch new consuming jobs, take down old ones, or update the schema of events that the job processes.

Since there are now 2 ways to ingest the data into the online store (materialization and direct), online_write_batch in the online store implementation needs to have logic to look at the event_timestamp since event time could be delayed. For example, Process 2 happens after Process 1, but the event timestamp in the features ingested by Process 1 are actually more recent so the features in Process 2 should not overwrite what is in the Online store even though it happens after Process 1

Correct, we should have logic that ensures we arent overwriting newer data with older data.

vas28r13 · 2021-09-15T18:00:09Z

@woop great point! Although I like the flexibility with (approach 1), keeping the streaming processes/jobs on our end in sync with iterations on FeatureViews in the feature store would be an issue.
I was thinking that we'd have a system to version our feature views (maybe just through feature view namespaces) because at some point there is still a dependency of the model and/or streaming process on a certain feature view schema. Essentially, the codebase of the model and/or streaming process would reference the version of the FeatureViews (streaming_data_v1).
Let me know what you think.

adchia · 2022-01-05T04:50:02Z

Note that this has been implemented, but for online stores aside from Redis, there still remains extra wiring work to check event timestamps. This work kind of exists in adchia#1, but I paused it because it seemed like we'd cause materialization to slow down. One natural way to get around this would be to support multiple versions of the same data and then we pull the latest version in serving. This would only make sense though if we have some TTL logic like #1988

vas28r13 mentioned this issue Oct 14, 2021

direct data ingestion into Online store #1939

Merged

adchia added the kind/feature New feature or request label Jan 7, 2022

adchia closed this as completed Mar 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

direct ingestion into the Online Store #1863

direct ingestion into the Online Store #1863

vas28r13 commented Sep 15, 2021

woop commented Sep 15, 2021 •

edited

Loading

vas28r13 commented Sep 15, 2021

adchia commented Jan 5, 2022

direct ingestion into the Online Store #1863

direct ingestion into the Online Store #1863

Comments

vas28r13 commented Sep 15, 2021

woop commented Sep 15, 2021 • edited Loading

vas28r13 commented Sep 15, 2021

adchia commented Jan 5, 2022

woop commented Sep 15, 2021 •

edited

Loading