fix: Harden informer cache with label selectors and memory optimizations#6242
Conversation
eab7bf4 to
aa69c5b
Compare
aa69c5b to
6a2995e
Compare
b3237d2 to
e1b57ef
Compare
| return map[string]string{ | ||
| services.NameLabelKey: authz.Handler.FeatureStore.Name, | ||
| services.ServiceTypeLabelKey: string(services.AuthzFeastType), | ||
| services.ManagedByLabelKey: services.ManagedByLabelValue, |
There was a problem hiding this comment.
🟡 removeOrphanedRoles silently skips pre-upgrade custom auth Roles due to stricter label selector
The authz.getLabels() function now includes ManagedByLabelKey (authz.go:334), and removeOrphanedRoles uses this label set as a list selector (authz.go:85). Pre-upgrade custom auth Roles only have {NameLabelKey, ServiceTypeLabelKey} without ManagedByLabelKey, so the API server's label selector will never match them. These orphaned Roles will never be cleaned up by removeOrphanedRoles.
The main feast Role and RoleBinding are still cleaned up correctly via DeleteOwnedFeastObj (which looks up by name, not labels). Only custom auth roles from KubernetesAuthz.Roles are affected. The practical impact is limited: orphaned Roles have empty rules (no security impact) and have owner references for eventual GC on FeatureStore CR deletion. The window is narrow — it requires changing the Roles list concurrently with or very shortly after the operator upgrade, before the first reconciliation adds the label to existing Roles.
Prompt for agents
In authz.go, the removeOrphanedRoles function at line 81-101 lists Roles using authz.getLabels() as the label selector. Since getLabels() now includes ManagedByLabelKey, pre-upgrade Roles without this label are invisible to this cleanup function.
To fix: either (a) use a separate label set for removeOrphanedRoles that omits ManagedByLabelKey (matching by NameLabelKey and ServiceTypeLabelKey only), or (b) run a one-time migration during reconciliation that adds ManagedByLabelKey to all existing authz Roles before removeOrphanedRoles is called.
Was this helpful? React with 👍 or 👎 to provide feedback.
Signed-off-by: Jitendra Yejare <[email protected]>
Signed-off-by: Jitendra Yejare <[email protected]>
e1b57ef to
91a3f79
Compare
…ons (feast-dev#6242) * fix: Harden informer cache with label selectors and memory optimizations Signed-off-by: Jitendra Yejare <[email protected]> * Additional Fixes on caching with PVC and HPA Signed-off-by: Jitendra Yejare <[email protected]> --------- Signed-off-by: Jitendra Yejare <[email protected]> Signed-off-by: Alex Korbonits <[email protected]>
# [0.63.0](v0.62.0...v0.63.0) (2026-05-04) ### Bug Fixes * Add project filter to apply_data_source and delete_data_source (closes [#6206](#6206)) ([#6322](#6322)) ([96562c4](96562c4)) * Add project_id filter to SnowflakeRegistry UPDATE path ([#6243](#6243)) ([6658b71](6658b71)), closes [#6208](#6208) [#6208](#6208) * Add subprocess timeouts to prevent test_e2e_local hanging on Dask atexit handler ([3de6556](3de6556)) * Ambiguous truth value of array during materialization ([#6259](#6259)) ([d0c8984](d0c8984)) * Auto-detect GCS/S3 registry store when registry is passed as string ([#6260](#6260)) ([7ebcf03](7ebcf03)) * **bigquery:** Prefer query over table in get_table_query_string ([#6360](#6360)) ([77ed779](77ed779)), closes [#6200](#6200) * correct project_id scoping in get_user_metadata and delete_project ([0c469a7](0c469a7)) * disable Redis RDB persistence in test deployments ([44cd682](44cd682)) * Disable snowflake tests temporarily in CI ([#6356](#6356)) ([31d5a98](31d5a98)) * Filter empty SQL commands at execute_snowflake_statement call sites ([#6249](#6249)) ([92ffbb9](92ffbb9)) * Fix five bugs in milvus online store ([#6275](#6275)) ([212504b](212504b)) * Fix issue with apply feature view ([835cda8](835cda8)) * Fix streaming materialization for exotic sources with lazy UDF pipelines ([c07972d](c07972d)) * Handle missing features gracefully instead of panicking ([7d00b3a](7d00b3a)) * Harden informer cache with label selectors and memory optimizations ([#6242](#6242)) ([3f11356](3f11356)) * **helm:** Avoid nil pointer for metrics.enabled inside podAnnotations ([#6251](#6251)) ([c833f1a](c833f1a)) * Include git in feast server image ([fb03c46](fb03c46)) * Include StreamFeatureView in freshness metric ([#6269](#6269)) ([463f16c](463f16c)) * Pre-create S3A event log dir before SparkContext init ([#6317](#6317)) ([9feca77](9feca77)) * Remote Online Store Type Inference Error with All-NULL Columns ([#6063](#6063)) ([de67bdd](de67bdd)) * Remove selector with kustomize overlay using a JSON 6902 patch ([9107a43](9107a43)) * Resolve multiple bugs in SnowflakeRegistry and Snowflake connection handling ([#6315](#6315)) ([7e66a2e](7e66a2e)) * **spark:** BatchFeatureView with TransformationMode.PYTHON now reads all source columns ([a310eaf](a310eaf)) * **spark:** Use SELECT * when feature_name_columns is empty in pull_all_from_table_or_query ([e1b1d2d](e1b1d2d)) * Support pandas mode in feature builder and fix dask column extraction ([863315e](863315e)) * support SQL string as entity_df in RemoteOfflineStore.get_historical_features ([c559889](c559889)) * Wrap LocalOutputNode return value in ArrowTableValue for consist… ([#6286](#6286)) ([a16cd55](a16cd55)) ### Features * Add agent skills and Cursor/Claude rules for Feast development ([312eea3](312eea3)) * Add feature view versioning support to FAISS online store ([b36acb7](b36acb7)) * Add feature view versioning support to Redis and DynamoDB online stores ([#6257](#6257)) ([edf25af](edf25af)), closes [#6164](#6164) [#6163](#6163) * Add optional 'org' in feature view ([#6288](#6288)) ([#6301](#6301)) ([608b105](608b105)) * Add RaySource, to_ray_dataset first-class method, docs, and tests ([1c98157](1c98157)) * Add TLS support for Go Feature Server ([#6229](#6229)) ([28a58d0](28a58d0)) * Add Vector Search support to MongoDBOnlineStore ([#6344](#6344)) ([c102738](c102738)) * Add versioning support to Milvus online store ([#6330](#6330)) ([3268ced](3268ced)) * Addresses performance issues in the Redis online store ([2e50da0](2e50da0)) * Allow to set gpu for ray ([5580ab4](5580ab4)) * Bump redis-py version cap from <5 to <8 ([#6339](#6339)) ([9538180](9538180)) * Expose feature_server, materialization, and openlineage configuration via FeatureStore CRD ([ec6ecfd](ec6ecfd)) * Make online_write_batch_size configurable in MaterializationConfig ([#6268](#6268)) ([d41becf](d41becf)) * Make udf optional if agg defined ([#5689](#5689)) ([#6328](#6328)) ([f630056](f630056)) * MongoDB offline store ([#6138](#6138)) ([8eebad7](8eebad7)) * Optional input_schema for ODFV ([#6308](#6308)) ([#6312](#6312)) ([f08b4e8](f08b4e8)) * Provision minimal TokenReview RBAC for OIDC auth and add SSL error logging in token parser ([#6240](#6240)) ([dca57e8](dca57e8)) * **spark:** Add compute-on-read support for BatchFeatureView in get_… ([#6357](#6357)) ([630d9f8](630d9f8))
Summary
The feast-operator's
Owns()calls create cluster-wide informers for ConfigMaps, Deployments, Services, and other resource types. On clusters with a large number of these objects, the informer cache can grow beyond the operator's 256Mi memory limit, causing OOMKill and restarts.Changes
ByObjectlabel selectors for all owned resource typesRestrict informer caches to only objects with
app.kubernetes.io/managed-by: feast-operator. Covers all 10 owned types: ConfigMap, Deployment, Service, ServiceAccount, PVC, RoleBinding, Role, CronJob, HPA, PDB. Extracted intonewCacheOptions()for clarity.DefaultTransform: cache.TransformStripManagedFields()Strip
managedFieldsfrom all cached objects, reducing per-object memory footprint by ~30-50%.GOMEMLIMIT=230MiBSet Go runtime soft memory limit (90% of 256Mi container limit). Triggers GC pressure before hard OOMKill as defense-in-depth.
Additional changes
app.kubernetes.io/managed-by: feast-operatorlabel togetLabels()so all FeatureStore-managed resources carry itgetSelectorLabels()for immutable selectors (Deploymentspec.selector, Servicespec.selector, TopologySpreadConstraints, PodAffinity) to avoid breaking existing resources on upgradeapp.kubernetes.io/managed-byservices.ManagedByLabelKey/Value) throughoutTest Results
Verified on cluster with a large number of ConfigMaps pre-loaded:
Test plan
make test) — all passgetSelectorLabels()prevents immutable selector breakage on upgradeSummary by CodeRabbit