Name	Name	Last commit message	Last commit date
Latest commit shuttie make join def explicit, allow optional scopes while joining Jun 24, 2021 409862b · Jun 24, 2021 History 55 Commits
.github/workflows	.github/workflows	cassandra connector (#4 )	Jun 15, 2021
api	api	make join def explicit, allow optional scopes while joining	Jun 24, 2021
connector	connector	YAML format for api config and schema (#13 )	Jun 18, 2021
core	core	make join def explicit, allow optional scopes while joining	Jun 24, 2021
docs	docs	YAML format for api config and schema (#13 )	Jun 18, 2021
flink	flink	make join def explicit, allow optional scopes while joining	Jun 24, 2021
project	project	get rid of kryo (#6 )	Jun 16, 2021
.gitattributes	.gitattributes	mark images as binary files	Jun 11, 2021
.gitignore	.gitignore	project skeleton	May 10, 2021
.scalafmt.conf	.scalafmt.conf	project skeleton	May 10, 2021
LICENSE	LICENSE	add license	May 10, 2021
README.md	README.md	add license footer	Jun 11, 2021
build.sbt	build.sbt	make join def explicit, allow optional scopes while joining	Jun 24, 2021
docker-compose.yaml	docker-compose.yaml	redis: add persistence class	May 17, 2021

Repository files navigation

Featury: An online ML feature store

Featury is an end-to-end framework, built to simplify typical scenarios of online-offline ML feature engineering:

Online API to serve latest values of ML features.
Feature value changes are tracked and persisted into an offline storage (HDFS or S3 bucket).
Historical ML feature values can be joined with training data offline in Spark/Flink to do model training, feature boostrapping and offline evaluation.

It differs from existing solutions like Feast/Hopsworks in the following ways:

Featury not only handles get-set actions for feature values, but can do stateful processing:
- increments for counters and periodic counters
- string frequency sampling
- numerical stats estimation for median, average and percentiles.
- bounded lists and maps.
Platform-agnostic offline model training: feature value histories are plain CSV/JSON/Protobuf/Parquet files, so you can use any tool like Spark/Flink/Pandas to do offline training.
DB-agnostic for online feature reads: can use Redis/Postgres/Cassandra for persistence.
Stateless and cloud-native: single jar file for local development, single k8s Deployment for production

The problem solved by feature stores is typical for majority of production ML system deployments:

how can you be sure that feature computation is exactly the same while running ML inference online, and training your model offline?
feature values drift in time, and while doing offline training, you may need to get access to a historical values of the feature.
new features require bootstrapping, so you should be able not only to compute them now, but also for all historical actions back in time
while doing online inference, you may need to access hundreds of features for hundreds of items with low latency

Featury tries to solve these problems by encapsulating feature value tracking task:

it logs all feature value changes, so for any feature you have full historical view of changes.
Write operations in Featury are extremely fast and happening in the background.
Feature values are eventually recomputed (like computing running median over a sampled reservoir) and exposed in inference API.

Featury is licensed under the Apache 2.0.