As a starting point, you need to consider that in Machine Learning or MLOps, in order to use models that make predictions, you need features (an individual measurable property or characteristic of a phenomenon being observed) to feed them with. Those features are simply the data that will be used by the models. For example, that data could be a row of an excel sheet or the pixels in a picture.
Features are “any measurable input that can be used in a predictive model”.
As “fuel for AI systems”, features are used to train ML models and make predictions. The issue with predictions is that they require a lot of data or features. The more data, the better the predictions.
They also need to be organized in order to make sense; the data for the features needs to be pulled from somewhere (a data source) and the features need to be stored after being computed (feature engineering - transforming the source data into features) for the ML pipeline to be able to use the features.
Machine Learning, in general, requires ready-made datasets of features to train models correctly. When we say datasets, we mean that the features are typically accessed as files in a file system (you can also read features directly as Dataframes from the Feature Store).
The Feature Store is where the features are stored and organized for the explicit purpose of being used to either train models (by Data Scientists) or make predictions (by applications that have a trained model). It is a central location where you can either create or update groups of features created from multiple different data sources, or create and update new datasets from those feature groups for training models or for use in applications that do not want to compute the features but just retrieve them when it needs them to make predictions.
The first public feature store, Michelangelo Palette, was announced by Uber in 2017. The main purpose of a feature store is to facilitate the discovery, documentation, and reuse of features and to ensure their correctness, whether they are used by batch or online applications. The feature store provides a high throughput batch API for creating point-in-time correct training data and retrieving features for batch predictions, a low latency serving API for retrieving features for online predictions. The feature store helps ensure consistent feature computations across the batch and serving APIs.
Here is where the feature store comes in. Let’s say you’re working with an e-commerce recommendation system where a search query for items would benefit from personalization. When a user issues the search query, it is typically executed on a stateless web application- modern microservice principles have made many such services stateless with operational state stored in a database, key-value store, or message bus.
The search query is information poor - it only contains some text, and the only other state available is the user ID and/or session ID. However, the user ID and session ID can be used to retrieve large numbers of precomputed features from the feature store.
The original information-poor signal (search query and user/session IDs) can now be transformed into an information-rich signal by enrichment with features representing the user’s history (items the user interacted with, orders) and features representing context (what items are popular right now). The feature store provides history and context to enable applications to become AI-enabled.
So now in more technical terms, here is why having a feature store will make your life easier:
If you would like to learn more about feature stores, we have a variety of different blogs that you could read:
Feature stores come in different packages. There are feature stores that come with an online database, virtual feature stores that don’t come with any database (FeatureForm). There are feature stores that are tightly integrated with a single Data Warehouse (Rasgo and Snowflake), and those that can use compute, such as Spark, that work with many different data stores (Data warehouses, data lakes, event buses, etc).
Each feature store provides a different solution for plugging into existing Enterprise data infrastructure and existing machine learning tooling. During the Feature Store Summit, many organizations discussed the motivations, benefits, and challenges of their individual approaches.
In general, the feature store fits into Enterprise infrastructure that usually consists of operational data and analytical data. The operational data is the apps that drive the business. The analytical data is the data warehouse or datalake that stores large volumes of data used to analyze and optimize the business. Data pipelines (ELT or ETL) extract data from operational datastores, transform it and write it to the data warehouse/data-lake. The starting point for many Enterprises that aim to productionize AI is to start with the analytical data, and build feature pipelines from there to produce training data for models and features for batch scoring. But what is a feature pipeline?
Feature pipelines are similar to data pipelines, but instead of the output being rows in a table, the output is data that has been aggregated, validated, and transformed into a format that is suitable for input to a model. Data needs to be validated because machine learning algorithms are very sensitive to bad data. This data often needs to be aggregated, as features are compressed signals over many data points - such as, the number of credit card purchases in the last month. Finally, machine learning models expect well-formed numerical data as input, and the input data needs to be transformed into a format the model expects and scaled to enable the model to converge during training. For example, you will need to encode categorical variables and scale/normalize numerical variables.
One other challenge with feature pipelines is that the output is not always back to the analytical data stores. Sometimes features are needed in the operational data stores, for example, if you are building a user-facing application that needs to retrieve pre-computed historical or contextual features at runtime to enrich its feature vectors. One approach would be to design feature pipelines to store their output to a key-value store on the operational side and then also store their output to a data lake or warehouse.
The operational database stores the latest values for features and the analytical system stores the historical values for features - used for training and batch predictions. This proposed solution might sound feasible but it runs into many issues: the feature pipelines quickly become pipeline jungles with no re-use of features, there is no support for access control and governance, no way for Data Scientists to discover what features are available, and no controlled way to reuse features. Here is where the feature store works its magic. It comes in as an abstraction that, for machine learning, unifies the operational data on one side and the analytical data on the other.
Another challenge is how data can be transferred from the analytical platform to the operational platform - not something that is not traditionally done in many enterprises or organizations, and has recently been termed “reverse ETL”. If you are more familiar with the problem domain, you might wonder: “If I transform the data before the feature store, how can the data scientists inspect and understand the data? How can they browse and understand the feature values/statistics if the categorical variable has been one-hot encoded?”
So you would be right to think that transformations are usually moved after the feature store and they need to be consistent between serving and training.
Another great reason for integrating a feature store into your infrastructure, is the ability to let many different personas communicate without speaking the same ‘language’. Data engineers typically take responsibility for the analytical data and have to get that data to the feature store with the help of data scientists, so they have to build these feature pipelines together with data scientists.
Data scientists can build additional feature pipelines, re-write features, use the features to train models and ML engineers are usually responsible for putting these models in production. Most of the time these personas write in different coding languages which makes the feature store a unified platform that helps them collaborate and work together.