feat: [Backend] DQM with native compute, multi-backend support, REST API, CLI by jyejare · Pull Request #6202 · feast-dev/feast

jyejare · 2026-03-31T10:53:28Z

What this PR does / why we need it:

This PR introduces comprehensive feature quality monitoring capabilities to Feast, enabling proactive tracking of feature distributions and data quality metrics. Currently, Feast has no built-in tools for monitoring feature health in production — ML teams must build custom solutions to detect issues like distribution shifts, elevated null rates, or degraded data quality before they silently impact model performance.

What it adds:

Core Monitoring Engine

Hybrid computation engine — SQL push-down on the native OfflineStore as the primary compute path, with a Python-based (PyArrow/NumPy) fallback for backends that don't implement native compute. This leverages the offline store as a compute engine (same architecture as Feast materialization).
Fully native storage — Monitoring metrics are stored within the configured OfflineStore backend itself (no separate monitoring database). Six static methods on the OfflineStore base class (compute_monitoring_metrics, get_monitoring_max_timestamp, ensure_monitoring_tables, save_monitoring_metrics, query_monitoring_metrics, clear_monitoring_baseline) handle compute and storage.
PyArrow-based metrics computation (MetricsCalculator) — Backend-agnostic statistical computation as fallback, supporting:
- Numeric features: mean, stddev, min/max, percentiles (p50/p75/p90/p95/p99), null rate, histograms
- Categorical features: top-N value counts with other/unique counts
- Automatic feature type classification from Feast's PrimitiveFeastType and ValueType

Multi-Backend Support (8 Offline Stores)

All 6 native monitoring methods implemented for each backend with dialect-specific SQL:

Backend	Compute	Storage	Dialect highlights
PostgreSQL	SQL push-down	`INSERT ON CONFLICT`	`PERCENTILE_CONT`, `WIDTH_BUCKET`
Snowflake	SQL push-down	`MERGE` with `VARIANT` JSON	`APPROX_PERCENTILE`, `WIDTH_BUCKET`
BigQuery	SQL push-down	`MERGE` into BQ tables	`APPROX_QUANTILES`, parameterized queries
Redshift	SQL push-down	`MERGE` via Data API	`APPROXIMATE PERCENTILE_DISC`
Spark	SparkSQL push-down	Parquet tables	`PERCENTILE_APPROX`, `spark.sql()`
Oracle	SQL via Ibis	`MERGE FROM DUAL`	`PERCENTILE_CONT WITHIN GROUP`
DuckDB	In-memory SQL	Parquet files	`QUANTILE_CONT`, `HISTOGRAM`
Dask	PyArrow compute	Parquet files	`pyarrow.compute` + `numpy`

Multi-Granularity Time-Series Metrics

5 granularities: daily, weekly, biweekly, monthly, quarterly
Auto-compute mode: Detects latest event timestamp and computes all granularities in one job
Pre-computed metrics stored per date + granularity for fast retrieval
On-demand transient compute: Fresh statistics for arbitrary date ranges (not stored)

Batch + Log Data Source Support

Batch source: Reads from the feature view's batch_source via OfflineStore.pull_all_from_table_or_query()
Log source: Reads from feature serving logs via FeatureService.logging_config destination, using __log_timestamp as event timestamp
Feature name normalization: Prefixed log column names (driver_stats__conv_rate) are parsed back to their original feature_view_name + feature_name for storage compatibility and drift detection
data_source_type column (batch / log) differentiates metrics in storage

Orchestration Service (`MonitoringService`)

Ties registry, offline store, calculator, and storage together
Computes and aggregates metrics at feature, feature view, and feature service levels
Cached OfflineStore instance for performance
Unified compute/timestamp methods handling both batch and log paths with SQL push-down + fallback

Shared Utilities (`monitoring_utils.py`)

Centralized table name constants, column lists, PK definitions
monitoring_table_meta(), opt_float(), empty_numeric_metric(), empty_categorical_metric(), normalize_monitoring_row(), build_view_aggregate()
Used by all 8 backends — eliminates duplication and ensures consistency

DQM Job Engine (`DQMJobManager`)

Asynchronous job abstraction for metric computation (compute, baseline, auto_compute)
Job status tracking in feast_monitoring_jobs table
Supports future integration with Ray/Spark job runners

REST API (`/monitoring/`)

Method	Endpoint	Description
`POST`	`/monitoring/compute`	Submit batch DQM job
`POST`	`/monitoring/auto_compute`	Auto-detect dates, all granularities
`POST`	`/monitoring/compute/transient`	On-demand compute (not stored)
`POST`	`/monitoring/compute/log`	Compute from serving logs
`POST`	`/monitoring/auto_compute/log`	Auto-detect log dates, all granularities
`GET`	`/monitoring/jobs/{job_id}`	DQM job status
`GET`	`/monitoring/metrics/features`	Per-feature metrics
`GET`	`/monitoring/metrics/feature_views`	Per-view aggregates
`GET`	`/monitoring/metrics/feature_services`	Per-service aggregates
`GET`	`/monitoring/metrics/baseline`	Baseline distribution retrieval
`GET`	`/monitoring/metrics/timeseries`	Time-series data for trend analysis

All endpoints support cascading filters: project, feature_service_name, feature_view_name, feature_name, granularity, data_source_type, date range.

RBAC enforced using existing AuthzedAction.DESCRIBE (read) and AuthzedAction.UPDATE (compute).

CLI (`feast monitor run`)

Options:
  --feature-view TEXT     Feature view name (omit for all)
  --feature-name TEXT     Feature name(s), repeatable
  --start-date TEXT       Start date YYYY-MM-DD (omit for auto-detect)
  --end-date TEXT         End date YYYY-MM-DD (omit for auto-detect)
  --granularity TEXT      daily | weekly | biweekly | monthly | quarterly
  --set-baseline          Mark this computation as baseline
  --source-type TEXT      batch | log | all (default: batch)

Auto-Baseline on `feast apply`

Automatically queues baseline metric computation for new features on feast apply
Non-blocking (async DQM job), idempotent (skips existing baselines)
Configurable — can be disabled via feature_store.yaml:

feature_server:
  dqm:
    distribution:
      initial:
        enabled: false

Feast Operator Support

New CRD types: DqmConfig, DqmDistributionConfig, DqmInitialDistributionConfig added to FeatureStoreSpec
Operator generates feature_server.dqm section in feature_store.yaml when DQM config is set
DeepCopy methods auto-generated via make generate
Disabling auto-baseline from operator CR:

apiVersion: feast.dev/v1
kind: FeatureStore
spec:
  feastProject: my_project
  dqm:
    distribution:
      initial:
        enabled: false

Documentation

How-to guide: docs/how-to-guides/feature-monitoring.md — Production setup, CLI usage, REST API reference, orchestrator integration (Airflow, KFP, cron, K8s CronJob), backend compatibility table
Quickstart notebook: examples/monitoring/monitoring-quickstart.ipynb — 12-step hands-on walkthrough with visualization examples
docs/SUMMARY.md updated with links to both

Design decisions:

Native OfflineStore compute + storage — Each backend implements its own SQL push-down for metrics calculation and uses its native UPSERT/MERGE for storage. No separate monitoring database needed.
Hybrid fallback — Backends that don't implement native compute fall back to Python/PyArrow, ensuring all offline stores are supported.
Separate /monitoring/ route rather than extending existing /metrics/ — The existing metrics route serves registry inventory metadata; monitoring serves statistical feature quality data with a different data path.
DQM Job Engine for async computation — Supports future Ray/Spark integration for distributed metric computation.

Which issue(s) this PR fixes:

Partially Fixes #5919

Checks

I've made sure the tests are passing.
My commits are signed off (git commit -s)
My PR title follows conventional commits format

Testing Strategy

Unit tests
Integration tests
Operator unit tests (Ginkgo)

Test coverage (all passing):

Test Suite	Count	Covers
`test_metrics_calculator.py`	19	Numeric/categorical computation, edge cases (empty, all-null, single value, high cardinality), type classification, PyArrow type classification
`test_monitoring_integration.py`	16+	End-to-end batch/log computation, baseline flow, view/service aggregation, native storage dispatch, log feature name normalization, REST API endpoints, CLI, RBAC enforcement
`repo_config_test.go`	92	Operator repo config generation including DQM config with initial distribution disabled, YAML serialization verification

Snyk SAST scan: 0 vulnerabilities across all new files.

devin-ai-integration

Devin Review found 2 new potential issues.

View 8 additional findings in Devin Review.

ntkathole · 2026-05-06T14:06:20Z

    (feast_feature_freshness_seconds)."""


+class DqmInitialDistributionConfig(FeastConfigBaseModel):


I think these configs should also live at top instead of feature_server. It uses the offline store, not the online server. This field is more similar to materialization, which is top-level config.

#feature_store.yaml feature_monitoring: auto_baseline: false

This matches the pattern: materialization: spans offline+online stores, openlineage: spans apply+materialize - feature_monitoring: spans offline store (compute/storage) + apply trigger + server API.

This could be due to the existing metrics config is residing under server config. The metrics(even operational metrics) could also be computed for offline store. So I think its good if we move both operational metrics and dqm metrics under the parent monitoring config.

Srihari1192 · 2026-05-14T03:59:13Z

@jyejare Please find the few observation during testing

The per-feature metrics query endpoint (GET /monitoring/metrics/features) rejects granularity=daily with HTTP 422 despite daily metrics being actively computed and stored by both POST /monitoring/auto_compute and feast monitor run -g daily — the data exists in storage but is unreachable via the query API due to daily being absent from the endpoint's granularity enum validator, while the CLI explicitly lists it as a valid option (-g [daily|weekly|biweekly|monthly|quarterly]).


    curl -k -s \
  '..../api/v1/monitoring/metrics/features?project=my_project&feature_view_name=driver_hourly_stats&granularity=daily'
{"status_code":422,"detail":"Out of range float values are not JSON compliant: nan","error_type":"ValueError"}%

The monitoring service has incomplete dtype support across both compute and read paths — Map/Json/Struct features are silently skipped at compute time with WARNING: Unsupported dtype log lines and never stored, while String features are computed and stored correctly but cause HTTP 422 on GET /monitoring/metrics/features due to a response schema mismatch between categorical metrics format (top-N histogram) and the numeric-only response model. In both cases the job reports Status: completed with total_features reflecting only successfully computed features rather than total registered features, giving no indication that any features were skipped or unreadable.

error logs with feast cli Auto-computing batch metrics for all granularities... WARNING:feast.monitoring.monitoring_service:Unsupported dtype 'Map' for feature 'driver_metadata', skipping WARNING:feast.monitoring.monitoring_service:Unsupported dtype 'Json' for feature 'driver_config', skipping WARNING:feast.monitoring.monitoring_service:Unsupported dtype 'Struct({name: String, age: String})' for feature 'driver_profile', skipping Status: completed Feature views computed: 1 Features computed: 15 Granularities: biweekly, daily, monthly, quarterly, weekly Duration: 293ms

Signed-off-by: Jitendra Yejare <[email protected]>

Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com> Signed-off-by: Jitendra Yejare <[email protected]>

jyejare · 2026-05-20T15:21:45Z

@Srihari1192 Both the issues are fixed and regarding the other observation about Map/Json/Struct features being silently skipped — that's working as designed. These complex types don't have meaningful statistical metrics (mean, stddev, etc.). The WARNING log is the correct behavior, and computed_features correctly reflects how many features were actually computed. No change needed there.

Signed-off-by: Jitendra Yejare <[email protected]>

jyejare requested a review from a team as a code owner March 31, 2026 10:53

jyejare marked this pull request as draft March 31, 2026 10:54

jyejare force-pushed the monitoring_plus branch from 4340dbb to 940a4af Compare March 31, 2026 10:54

This comment was marked as resolved.

Sign in to view

jyejare force-pushed the monitoring_plus branch 4 times, most recently from d0b45bb to c06853e Compare April 21, 2026 14:00

jyejare marked this pull request as ready for review April 21, 2026 14:34

jyejare requested review from a team and sudohainguyen as code owners April 21, 2026 14:34

jyejare requested review from lokeshrangineni, robhowley and tokoko and removed request for a team April 21, 2026 14:34

devin-ai-integration Bot reviewed Apr 21, 2026

View reviewed changes

Comment thread sdk/python/feast/infra/offline_stores/bigquery.py Outdated

Comment thread sdk/python/feast/infra/offline_stores/contrib/spark_offline_store/spark.py Outdated

jyejare changed the title ~~feat: Add feature quality monitoring with statistical metrics, REST API, and CLI~~ feat: Add feature quality monitoring with native offline store compute/storage, multi-backend support, REST API, CLI, and operator config Apr 21, 2026

jyejare changed the title ~~feat: [Backend] Add feature quality monitoring with native offline store compute/storage, multi-backend support, REST API, CLI, and operator config~~ feat: [Backend] DQM with native compute, multi-backend support, REST API, CLI Apr 21, 2026

jyejare mentioned this pull request May 5, 2026

[Feature] Built-in feature drift detection with alerting #6341

Open

jyejare force-pushed the monitoring_plus branch from 3da4dde to 0344087 Compare May 5, 2026 08:52

ntkathole force-pushed the monitoring_plus branch from 0344087 to 3c73a70 Compare May 6, 2026 11:59