Feature Store for ML

Data scientists are duplicating work because they don’t have a centralized feature store. Everybody I talk to really wants to build or even buy a feature store......if an organization had a feature store, the ramp-up period [for Data Scientists can be much faster].


Harish Dodi, O'Reilly Podcast

Featured from Medium

Common Problems taking ML from...

Common problems taking ML from lab to production

A look into disappearing data and degraded performance preventing ML models from shipping...

by Charna Parkey, Ph.D.

MLOps with a Feature Store

If AI is to become embedded in the DNA of Enterprise computing systems, Enterprises must first re-align their machine learning..

by Fabio Buso

Hopsworks

The first fully open-source Feature Store, based around Dataframes (Spark/Pandas), Hive (offline), and MySQL Cluster (online) databases. Supports model training/management/serving and Provenance.

Visit Hopsworks

Zipline

AirBnB use Zipline for Feature management as part of their BigHead platform for ML. services: Big Query (offline) and Big Table (online) and Redis (low-latency), using Beam for feature engineering.  

Comcast

Comcast have had 2 iterations of their Feature Store, and as of early 2020, appear to be using Redis as their online Feature Store. They have previously used Flink for online feature computation and its queryable state API.

Pinterest

Galaxy is Pinterest’s incremental dataflow-based Feature Store on AWS. It includes a DSL for Feature Engineering, Linchpin.

Aluxio Inc

Pinterest – Big Data Machine Learning Platform at Pinterest

Wix

Wix’ Feature Store is based on storing feature data in protobufs, with batch processing using SparkSQL on parquet files stored in S3 and online serving based on HBase/Redis. It provides a Python API for accessing training data as Pandas Dataframes.

Bigbid

The Bigabid Feature Store contains thousands o features and is a  centralized software library and documentation center that “creates a single feature from a standardized input (data)”. Read more here:
https://www.bigabid.com/blog/data-the-importance-of-having-a-feature-store

Apple

Overton is Apple’s platform for managing data and models for ML. There is a publication about it: Overton: A Data System for Monitoring and Improving Machine-Learned Products

StreamSQL

StreamSQL have built a Feature Store as a commercial product based on Apache Pulsar, Cassandra, and Redis.

Tecton

Tecton are developing a managed feature store for AWS that manages both features and feature computations.

ScribbleData

ScribbleData have developed a feature store for ML.

Intuit

Intuit have built a feature store as part of their data science platform. It was developed for AWS and uses S3 and Dynamo as its offline/online feature serving layers.

Michelangelo

The first Feature Store (by Uber) that provides a DSL and is heavily built around Spark/Scala with Hive (offline) and Cassandra (online) databases. It is now called Michelangelo Palette.

See this talk about Michelangelo Palette at InfoQ (not available on youtube yet):

https://www.infoq.com/presentations/michelangelo-palette-uber/

Feast

GoJek/Google released Feast in early 2019 and it is built around Google Cloud services: Big Query (offline) and Big Table (online) and Redis (low-latency), using Beam for feature engineering.

Netflix

Netflix uses shared feature encoder libraries in their MetaFlow platform to ensure consistency between training and serving, and S3 for offline features and microservices for serving online features. There are shared feature engineering libraries, written in Java. Runway, their model mgmt platform, builds on Metaflow.

FBLearner

Not much is known about Facebook’s Feature Store, cursory information is given here.

Databricks

Early work on sharing feature computation jobs by Databricks.

Twitter

Twitter decided to build a library, not a store. It is a set of shared feature libraries and metadata, along with shared file systems, object stores, and databases.

Zomato

Zomato have used Flink to compute features in real-time and then integrate their real-time feature store with their applications. They note that the real-time feature store needs high throughput read and write at low latency (>1m writes/min). These use manged ElastiCache/Redis on AWS for the online feature store.

Survey Monkey

A Feature Store for AWS that has both an offline and an online database.

http://snurran.sics.se/surveymonkey.pdf

Spotify

A Feature Store for KubeFlow on GCP.

Concepts & Articles

Feature Store Articles

Feature Store Concepts

Consistent Features – Online & Offline

If feature engineering code is not the same in training and inferencing systems, there is a risk that the code will not be consistent, and, therefore, predictions may not be reliable as the features may not be the same. One solution is the have feature engineering jobs write their feature data to both an online and an offline database. Both training and inferencing applications need to read their features when they make predictions – online applications may need low latency (real-time) access to that feature data. The other solution is to use shared feature engineering libraries (ok, if your online application and training application are both able to use the same shared libraries (e.g., both are JVM-based)).

Time Travel

“Given these events in the past what were the feature values of the customer during the time of that event” Carlo Hyvönen

Time-travel is not normally found in databases – you cannot typically query the value of some column at some point in time. You can work around this by ensuring all schemas defining feature data include a datetime/event-time column. However, recent data lakes have added support for time-travel queries, by storing all updates enabling queries on old values for features.  Some data platforms supporting time travel functionality:

Feature Engineering

Michelangelo added a domain-specific language (DSL) to support engineering features from raw data sources (databases, data lake). However, it is also popular to use general purpose frameworks like Apache Spark/PySpark, Pandas, Apache Flink, and Apache Beam.

Materialize Train/Test Data?

Training data for models can be either streamed directly from the feature store into models or it can be materialized to a storage system, such as S3, HDFS, or a local filesystem. When multiple frameworks are used for ML – TensorFlow, PyTorch, Scikit-Learn, then materializing train/test data into the native file format for the framework (.tfrecords for Tensorflow, .npy for PyTorch) is recommended.

Common file formats for ML frameworks:

  • .tfrecords (TensorFlow/Keras)
  • .npy (PyTorch, Scikit-Learn)
  • .csv (Scikit-Learn, others)
  • .petastorm (TensorFlow/Keras, PyTorch)
  • .h5 (Keras)

Online Feature Store

Models may have been trained with hundreds of features, but online applications may just receive a few of those features from an user interaction (userId, sessionId, productId, datetime, etc). The online feature store is used by online applications to lookup the missing features and build a feature vector that is sent to an online model for predictions. Online models are typically served over the network, as it decouples the model’s lifecycle from the application’s lifecycle.  The latency, throughput, security, and high availability of the online feature store are critical to its success in the enterprise. Below is shown the throughput of some key-value and in-memory databases that are used in existing feature stores.

Feature Store Comparison

Platform
Open-Source
Offline
Online
Metadata
Feature Engineering
Supported  Platforms
TimeTravel /
Point-in-Time Queries
Training Data
AGPL-V3
Hudi/Hive
MySQL Cluster
DB Tables, Elasticsearch
(Py)Spark, Python
AWS, GCP, On-Prem
SQL Join or
Hudi Queries
.tfrecords, .csv, .npy, .petastorm, .hf5, etc
N/A
Hive
Cassandra
Content
Spark, DSL
Proprietary
SQL Join
Streamed to models?
Apache V2
BigQuery
BigTable/Redis
DB Tables
Beam, Python
GCP
SQL Join
Streamed to models
N/A
Kafka/Cassandra
Kafka/ Cassandra
Protocol Buffers
Shared libraries
Proprietary
?
Protobuf
N/A
Hive
KV Store
KV Entries
Flink, Spark, DSL
Proprietary
Schema
Streamed to models?
N/A
HDFS, Cassandra
Kafka / Redis
Github
Flink, Spark
Proprietary
No?
Unknown
N/A
Kafka & S3
Kafka & Microservices
Protobufs
Spark, shared  libraries
Proprietary
Custom
Protobuf
N/A
HDFS
Strato / Manhatten
Scala shared feature libraries
Scala DSL, Scalding,  
shared libraries
Proprietary
No
Unknown
N/A
?
Yes, no details
Yes, no details
?
Proprietary
?
Unknown
Content
S3/Hive
Yes, no details
Yes, no details
DSL (Linchpin), Spark
Proprietary
?
Unknown