Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

direct ingestion into the Online Store #1863

Closed
vas28r13 opened this issue Sep 15, 2021 · 3 comments
Closed

direct ingestion into the Online Store #1863

vas28r13 opened this issue Sep 15, 2021 · 3 comments
Labels
kind/feature New feature or request

Comments

@vas28r13
Copy link

Problem
We'd like to be able to ingest data into the Online store from streaming sources. We think that supporting direct ingestion into the Online store could be lightweight solution to support ingesting data from streaming sources in the newer Feast versions.

In Feast v0.12, materialization from the Offline store is the way to get data into the Online store.

We believe we can manage the streaming processes on our end but would love to have a way to ingest data into the Online Store directly.

Use Case
a model process/service will get a stream of data and would need to keep track of the new/updated features in the Online feature store

Code example
feature_store_object.ingest_df("feature_view_name", dataframe_object)

Process 1

    feature_store = _get_feature_store()
    for feature_data in _kafka_consume():
        # clean the data
        features_df = _clean_features(feature_data)

        # ingest the data directly into the online store
        feature_store.ingest_df("stream_ingestion", features_df)

Process 2 (job)

feature_store = _get_feature_store()

# materialization from Offline sources
feature_store.materialize_incremental(end_date=datetime.now(timezone.utc))

Potential Issue
Since there are now 2 ways to ingest the data into the online store (materialization and direct), online_write_batch in the online store implementation needs to have logic to look at the event_timestamp since event time could be delayed. For example, Process 2 happens after Process 1, but the event timestamp in the features ingested by Process 1 are actually more recent so the features in Process 2 should not overwrite what is in the Online store even though it happens after Process 1

Let us know what you think. We implemented this locally for our use case but wonder if this is something that could be useful in the official Feast version as well.

@woop
Copy link
Member

woop commented Sep 15, 2021

@vas28r13 Thanks for writing this up. Overall it makes sense to me.

We believe we can manage the streaming processes on our end but would love to have a way to ingest data into the Online Store directly.

How would you ensure that streaming processes/jobs stay in sync with feature views in the feature store? It seems like there are broadly two approaches

  1. Provide a Push API and have an upstream system push events to the API. Write those events into the FS. Feast doesn't manage the upstream job
  2. Feast launches ingestion jobs which consume from a topic and write into an online store (this write may go through a Push API or could be direct to the storage layer).

(1) adds less responsibility to Feast, but also makes it harder to keep jobs in sync. In (2) Feast would be able to launch new consuming jobs, take down old ones, or update the schema of events that the job processes.

Since there are now 2 ways to ingest the data into the online store (materialization and direct), online_write_batch in the online store implementation needs to have logic to look at the event_timestamp since event time could be delayed. For example, Process 2 happens after Process 1, but the event timestamp in the features ingested by Process 1 are actually more recent so the features in Process 2 should not overwrite what is in the Online store even though it happens after Process 1

Correct, we should have logic that ensures we arent overwriting newer data with older data.

@vas28r13
Copy link
Author

@woop great point! Although I like the flexibility with (approach 1), keeping the streaming processes/jobs on our end in sync with iterations on FeatureViews in the feature store would be an issue.
I was thinking that we'd have a system to version our feature views (maybe just through feature view namespaces) because at some point there is still a dependency of the model and/or streaming process on a certain feature view schema. Essentially, the codebase of the model and/or streaming process would reference the version of the FeatureViews (streaming_data_v1).
Let me know what you think.

@adchia
Copy link
Collaborator

adchia commented Jan 5, 2022

Note that this has been implemented, but for online stores aside from Redis, there still remains extra wiring work to check event timestamps. This work kind of exists in adchia#1, but I paused it because it seemed like we'd cause materialization to slow down. One natural way to get around this would be to support multiple versions of the same data and then we pull the latest version in serving. This would only make sense though if we have some TTL logic like #1988

@adchia adchia added the kind/feature New feature or request label Jan 7, 2022
@adchia adchia closed this as completed Mar 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants