Skip to content

feat: [Backend] DQM with native compute, multi-backend support, REST API, CLI#6202

Open
jyejare wants to merge 8 commits into
feast-dev:masterfrom
jyejare:monitoring_plus
Open

feat: [Backend] DQM with native compute, multi-backend support, REST API, CLI#6202
jyejare wants to merge 8 commits into
feast-dev:masterfrom
jyejare:monitoring_plus

Conversation

@jyejare
Copy link
Copy Markdown
Collaborator

@jyejare jyejare commented Mar 31, 2026

What this PR does / why we need it:

This PR introduces comprehensive feature quality monitoring capabilities to Feast, enabling proactive tracking of feature distributions and data quality metrics. Currently, Feast has no built-in tools for monitoring feature health in production — ML teams must build custom solutions to detect issues like distribution shifts, elevated null rates, or degraded data quality before they silently impact model performance.

What it adds:

Core Monitoring Engine

  • Hybrid computation engine — SQL push-down on the native OfflineStore as the primary compute path, with a Python-based (PyArrow/NumPy) fallback for backends that don't implement native compute. This leverages the offline store as a compute engine (same architecture as Feast materialization).
  • Fully native storage — Monitoring metrics are stored within the configured OfflineStore backend itself (no separate monitoring database). Six static methods on the OfflineStore base class (compute_monitoring_metrics, get_monitoring_max_timestamp, ensure_monitoring_tables, save_monitoring_metrics, query_monitoring_metrics, clear_monitoring_baseline) handle compute and storage.
  • PyArrow-based metrics computation (MetricsCalculator) — Backend-agnostic statistical computation as fallback, supporting:
    • Numeric features: mean, stddev, min/max, percentiles (p50/p75/p90/p95/p99), null rate, histograms
    • Categorical features: top-N value counts with other/unique counts
    • Automatic feature type classification from Feast's PrimitiveFeastType and ValueType

Multi-Backend Support (8 Offline Stores)

All 6 native monitoring methods implemented for each backend with dialect-specific SQL:

Backend Compute Storage Dialect highlights
PostgreSQL SQL push-down INSERT ON CONFLICT PERCENTILE_CONT, WIDTH_BUCKET
Snowflake SQL push-down MERGE with VARIANT JSON APPROX_PERCENTILE, WIDTH_BUCKET
BigQuery SQL push-down MERGE into BQ tables APPROX_QUANTILES, parameterized queries
Redshift SQL push-down MERGE via Data API APPROXIMATE PERCENTILE_DISC
Spark SparkSQL push-down Parquet tables PERCENTILE_APPROX, spark.sql()
Oracle SQL via Ibis MERGE FROM DUAL PERCENTILE_CONT WITHIN GROUP
DuckDB In-memory SQL Parquet files QUANTILE_CONT, HISTOGRAM
Dask PyArrow compute Parquet files pyarrow.compute + numpy

Multi-Granularity Time-Series Metrics

  • 5 granularities: daily, weekly, biweekly, monthly, quarterly
  • Auto-compute mode: Detects latest event timestamp and computes all granularities in one job
  • Pre-computed metrics stored per date + granularity for fast retrieval
  • On-demand transient compute: Fresh statistics for arbitrary date ranges (not stored)

Batch + Log Data Source Support

  • Batch source: Reads from the feature view's batch_source via OfflineStore.pull_all_from_table_or_query()
  • Log source: Reads from feature serving logs via FeatureService.logging_config destination, using __log_timestamp as event timestamp
  • Feature name normalization: Prefixed log column names (driver_stats__conv_rate) are parsed back to their original feature_view_name + feature_name for storage compatibility and drift detection
  • data_source_type column (batch / log) differentiates metrics in storage

Orchestration Service (MonitoringService)

  • Ties registry, offline store, calculator, and storage together
  • Computes and aggregates metrics at feature, feature view, and feature service levels
  • Cached OfflineStore instance for performance
  • Unified compute/timestamp methods handling both batch and log paths with SQL push-down + fallback

Shared Utilities (monitoring_utils.py)

  • Centralized table name constants, column lists, PK definitions
  • monitoring_table_meta(), opt_float(), empty_numeric_metric(), empty_categorical_metric(), normalize_monitoring_row(), build_view_aggregate()
  • Used by all 8 backends — eliminates duplication and ensures consistency

DQM Job Engine (DQMJobManager)

  • Asynchronous job abstraction for metric computation (compute, baseline, auto_compute)
  • Job status tracking in feast_monitoring_jobs table
  • Supports future integration with Ray/Spark job runners

REST API (/monitoring/)

Method Endpoint Description
POST /monitoring/compute Submit batch DQM job
POST /monitoring/auto_compute Auto-detect dates, all granularities
POST /monitoring/compute/transient On-demand compute (not stored)
POST /monitoring/compute/log Compute from serving logs
POST /monitoring/auto_compute/log Auto-detect log dates, all granularities
GET /monitoring/jobs/{job_id} DQM job status
GET /monitoring/metrics/features Per-feature metrics
GET /monitoring/metrics/feature_views Per-view aggregates
GET /monitoring/metrics/feature_services Per-service aggregates
GET /monitoring/metrics/baseline Baseline distribution retrieval
GET /monitoring/metrics/timeseries Time-series data for trend analysis

All endpoints support cascading filters: project, feature_service_name, feature_view_name, feature_name, granularity, data_source_type, date range.

RBAC enforced using existing AuthzedAction.DESCRIBE (read) and AuthzedAction.UPDATE (compute).

CLI (feast monitor run)

Options:
  --feature-view TEXT     Feature view name (omit for all)
  --feature-name TEXT     Feature name(s), repeatable
  --start-date TEXT       Start date YYYY-MM-DD (omit for auto-detect)
  --end-date TEXT         End date YYYY-MM-DD (omit for auto-detect)
  --granularity TEXT      daily | weekly | biweekly | monthly | quarterly
  --set-baseline          Mark this computation as baseline
  --source-type TEXT      batch | log | all (default: batch)

Auto-Baseline on feast apply

  • Automatically queues baseline metric computation for new features on feast apply
  • Non-blocking (async DQM job), idempotent (skips existing baselines)
  • Configurable — can be disabled via feature_store.yaml:
feature_server:
  dqm:
    distribution:
      initial:
        enabled: false

Feast Operator Support

  • New CRD types: DqmConfig, DqmDistributionConfig, DqmInitialDistributionConfig added to FeatureStoreSpec
  • Operator generates feature_server.dqm section in feature_store.yaml when DQM config is set
  • DeepCopy methods auto-generated via make generate
  • Disabling auto-baseline from operator CR:
apiVersion: feast.dev/v1
kind: FeatureStore
spec:
  feastProject: my_project
  dqm:
    distribution:
      initial:
        enabled: false

Documentation

  • How-to guide: docs/how-to-guides/feature-monitoring.md — Production setup, CLI usage, REST API reference, orchestrator integration (Airflow, KFP, cron, K8s CronJob), backend compatibility table
  • Quickstart notebook: examples/monitoring/monitoring-quickstart.ipynb — 12-step hands-on walkthrough with visualization examples
  • docs/SUMMARY.md updated with links to both

Design decisions:

  • Native OfflineStore compute + storage — Each backend implements its own SQL push-down for metrics calculation and uses its native UPSERT/MERGE for storage. No separate monitoring database needed.
  • Hybrid fallback — Backends that don't implement native compute fall back to Python/PyArrow, ensuring all offline stores are supported.
  • Separate /monitoring/ route rather than extending existing /metrics/ — The existing metrics route serves registry inventory metadata; monitoring serves statistical feature quality data with a different data path.
  • DQM Job Engine for async computation — Supports future Ray/Spark integration for distributed metric computation.

Which issue(s) this PR fixes:

Partially Fixes #5919

Checks

  • I've made sure the tests are passing.
  • My commits are signed off (git commit -s)
  • My PR title follows conventional commits format

Testing Strategy

  • Unit tests
  • Integration tests
  • Operator unit tests (Ginkgo)

Test coverage (all passing):

Test Suite Count Covers
test_metrics_calculator.py 19 Numeric/categorical computation, edge cases (empty, all-null, single value, high cardinality), type classification, PyArrow type classification
test_monitoring_integration.py 16+ End-to-end batch/log computation, baseline flow, view/service aggregation, native storage dispatch, log feature name normalization, REST API endpoints, CLI, RBAC enforcement
repo_config_test.go 92 Operator repo config generation including DQM config with initial distribution disabled, YAML serialization verification

Snyk SAST scan: 0 vulnerabilities across all new files.

@jyejare jyejare requested a review from a team as a code owner March 31, 2026 10:53
@jyejare jyejare marked this pull request as draft March 31, 2026 10:54
devin-ai-integration[bot]

This comment was marked as resolved.

@jyejare jyejare force-pushed the monitoring_plus branch 4 times, most recently from d0b45bb to c06853e Compare April 21, 2026 14:00
@jyejare jyejare marked this pull request as ready for review April 21, 2026 14:34
@jyejare jyejare requested review from a team and sudohainguyen as code owners April 21, 2026 14:34
@jyejare jyejare requested review from lokeshrangineni, robhowley and tokoko and removed request for a team April 21, 2026 14:34
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 2 new potential issues.

View 8 additional findings in Devin Review.

Open in Devin Review

Comment thread sdk/python/feast/infra/offline_stores/bigquery.py Outdated
Comment thread sdk/python/feast/infra/offline_stores/contrib/spark_offline_store/spark.py Outdated
@jyejare jyejare changed the title feat: Add feature quality monitoring with statistical metrics, REST API, and CLI feat: Add feature quality monitoring with native offline store compute/storage, multi-backend support, REST API, CLI, and operator config Apr 21, 2026
@jyejare jyejare changed the title feat: Add feature quality monitoring with native offline store compute/storage, multi-backend support, REST API, CLI, and operator config feat: [Backend] Add feature quality monitoring with native offline store compute/storage, multi-backend support, REST API, CLI, and operator config Apr 21, 2026
@jyejare jyejare changed the title feat: [Backend] Add feature quality monitoring with native offline store compute/storage, multi-backend support, REST API, CLI, and operator config feat: [Backend] DQM with native compute, multi-backend support, REST API, CLI Apr 21, 2026
@jyejare jyejare force-pushed the monitoring_plus branch from 3da4dde to 0344087 Compare May 5, 2026 08:52
Comment thread sdk/python/feast/api/registry/rest/monitoring.py Outdated
Comment thread sdk/python/feast/monitoring/monitoring_utils.py Outdated
Comment thread sdk/python/feast/monitoring/dqm_job_manager.py Outdated
Comment thread sdk/python/feast/monitoring/monitoring_service.py Outdated
Comment thread sdk/python/feast/monitoring/monitoring_store.py Outdated
Comment thread infra/feast-operator/api/v1/featurestore_types.go Outdated
(feast_feature_freshness_seconds)."""


class DqmInitialDistributionConfig(FeastConfigBaseModel):
Copy link
Copy Markdown
Member

@ntkathole ntkathole May 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these configs should also live at top instead of feature_server. It uses the offline store, not the online server. This field is more similar to materialization, which is top-level config.

#feature_store.yaml        

feature_monitoring:   
  auto_baseline: false 

This matches the pattern: materialization: spans offline+online stores, openlineage: spans apply+materialize - feature_monitoring: spans offline store (compute/storage) + apply trigger + server API.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be due to the existing metrics config is residing under server config. The metrics(even operational metrics) could also be computed for offline store. So I think its good if we move both operational metrics and dqm metrics under the parent monitoring config.

@jyejare jyejare force-pushed the monitoring_plus branch from 58795bc to a26ab3c Compare May 8, 2026 07:23
@jyejare jyejare force-pushed the monitoring_plus branch 7 times, most recently from c190315 to c45c914 Compare May 11, 2026 11:14
@Srihari1192
Copy link
Copy Markdown
Contributor

@jyejare Please find the few observation during testing

  1. The per-feature metrics query endpoint (GET /monitoring/metrics/features) rejects granularity=daily with HTTP 422 despite daily metrics being actively computed and stored by both POST /monitoring/auto_compute and feast monitor run -g daily — the data exists in storage but is unreachable via the query API due to daily being absent from the endpoint's granularity enum validator, while the CLI explicitly lists it as a valid option (-g [daily|weekly|biweekly|monthly|quarterly]).

    curl -k -s \
  '..../api/v1/monitoring/metrics/features?project=my_project&feature_view_name=driver_hourly_stats&granularity=daily'
{"status_code":422,"detail":"Out of range float values are not JSON compliant: nan","error_type":"ValueError"}%    
  1. The monitoring service has incomplete dtype support across both compute and read paths — Map/Json/Struct features are silently skipped at compute time with WARNING: Unsupported dtype log lines and never stored, while String features are computed and stored correctly but cause HTTP 422 on GET /monitoring/metrics/features due to a response schema mismatch between categorical metrics format (top-N histogram) and the numeric-only response model. In both cases the job reports Status: completed with total_features reflecting only successfully computed features rather than total registered features, giving no indication that any features were skipped or unreadable.

error logs with feast cli Auto-computing batch metrics for all granularities... WARNING:feast.monitoring.monitoring_service:Unsupported dtype 'Map' for feature 'driver_metadata', skipping WARNING:feast.monitoring.monitoring_service:Unsupported dtype 'Json' for feature 'driver_config', skipping WARNING:feast.monitoring.monitoring_service:Unsupported dtype 'Struct({name: String, age: String})' for feature 'driver_profile', skipping Status: completed Feature views computed: 1 Features computed: 15 Granularities: biweekly, daily, monthly, quarterly, weekly Duration: 293ms

jyejare and others added 7 commits May 20, 2026 20:32
Signed-off-by: Jitendra Yejare <[email protected]>
Signed-off-by: Jitendra Yejare <[email protected]>
Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Signed-off-by: Jitendra Yejare <[email protected]>
@jyejare jyejare force-pushed the monitoring_plus branch 2 times, most recently from d18c1fa to a2ec95e Compare May 20, 2026 15:20
@jyejare
Copy link
Copy Markdown
Collaborator Author

jyejare commented May 20, 2026

@Srihari1192 Both the issues are fixed and regarding the other observation about Map/Json/Struct features being silently skipped — that's working as designed. These complex types don't have meaningful statistical metrics (mean, stddev, etc.). The WARNING log is the correct behavior, and computed_features correctly reflects how many features were actually computed. No change needed there.

@jyejare jyejare force-pushed the monitoring_plus branch from a2ec95e to 79a0f96 Compare May 20, 2026 16:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Revamp Data Quality Monitoring

3 participants