feat: DuckDB historical retrieval without entity dataframe#6108
feat: DuckDB historical retrieval without entity dataframe#6108Vperiodt wants to merge 12 commits into
Conversation
| full_feature_names: bool = False, | ||
| **kwargs, | ||
| ) -> RetrievalJob: | ||
| start_date: Optional[datetime] = kwargs.get("start_date", None) |
There was a problem hiding this comment.
instead keep it as optional params in get_historical_features
| DEFAULT_ENTITY_DF_EVENT_TIMESTAMP_COL = "event_timestamp" | ||
|
|
||
|
|
||
| def _build_entity_df_from_sources( |
| ) | ||
|
|
||
|
|
||
| DEFAULT_ENTITY_DF_EVENT_TIMESTAMP_COL = "event_timestamp" |
There was a problem hiding this comment.
import this from from feast.infra.offline_stores.offline_utils import DEFAULT_ENTITY_DF_EVENT_TIMESTAMP_COL
|
@Vperiodt There is integration test tests/integration/offline_store/test_non_entity_mode.py, see if duckdb coverage can be included there |
| start_date = end_date - timedelta(seconds=max_ttl_seconds) | ||
| else: | ||
| start_date = end_date - timedelta(days=30) | ||
| start_date = make_tzaware(start_date) |
There was a problem hiding this comment.
Line 229 - Line 244 is common across the offline store to figure out the start_date & end_date. If its not too much, is it possible to create utility function which can be re-used?
There was a problem hiding this comment.
Yes, I’ll refactor this into a reusable utility function and apply it to similar date range calculations across the codebase in a follow-up PR
|
@Vperiodt The lint checks are failing please check and rebase the PR. |
jyejare
left a comment
There was a problem hiding this comment.
Review Summary
The approach is clean — making entity_df optional and building a synthetic entity DataFrame from the feature view sources is a good way to enable date-range retrieval for DuckDB. Test coverage is solid (553 lines of unit tests + integration test updates).
However, CI is currently failing (lint-python and unit-test-python 3.10), so this is not merge-ready yet.
Blockers
- CI failures —
lint-pythonandunit-test-python (3.10, ubuntu-latest)are red on the latest run (Apr 13). Please investigate and fix. - Unconditional ibis import in
repo_configuration.py— breaks all tests whenibisis not installed. See inline comment for the fix.
Minor Items
- Missing docstring on
_build_entity_df_from_sources(flagged by @ntkathole, still pending). - Duplicate date-range logic — the start_date/end_date defaulting is copy-pasted across offline stores. Please extract to a shared utility or reuse
compute_non_entity_date_range.
Will ACK once CI is green and the blockers are addressed.
| ) | ||
|
|
||
|
|
||
| def _build_entity_df_from_sources( |
There was a problem hiding this comment.
Missing docstring for this function. Given the non-trivial logic around join_key_map handling, field_mapping, and timestamp filtering, a brief docstring explaining the purpose ("Build a synthetic entity DataFrame from feature view sources for non-entity retrieval mode") and the return value would help future maintainers.
| make_tzaware(end_date) if end_date else datetime.now(timezone.utc) | ||
| ) | ||
|
|
||
| if start_date is None: | ||
| max_ttl_seconds = 0 | ||
| for fv in feature_views: | ||
| if fv.ttl and isinstance(fv.ttl, timedelta): | ||
| max_ttl_seconds = max( | ||
| max_ttl_seconds, int(fv.ttl.total_seconds()) | ||
| ) | ||
| if max_ttl_seconds > 0: | ||
| start_date = end_date - timedelta(seconds=max_ttl_seconds) | ||
| else: | ||
| start_date = end_date - timedelta(days=30) | ||
| start_date = make_tzaware(start_date) |
There was a problem hiding this comment.
This start_date/end_date derivation logic (defaulting end_date to now, computing start_date from max TTL, falling back to 30 days) is duplicated across the DuckDB, Postgres, and File offline stores. The Postgres PR (#6057) has the same pattern inlined.
Please extract this into a shared utility (e.g. in offline_utils.py or reuse the existing compute_non_entity_date_range from feast.utils). This would keep the stores consistent and avoid divergence over time.
| "duckdb": ( | ||
| "local", | ||
| importlib.import_module( | ||
| "tests.universal.feature_repos.duckdb_repo_configuration" | ||
| ).DuckDBDataSourceCreator, | ||
| ), |
There was a problem hiding this comment.
This importlib.import_module call runs at module load time, which transitively imports ibis (via duckdb_repo_configuration → feast.infra.offline_stores.duckdb). If ibis is not installed, this breaks all tests that import from repo_configuration.py — not just DuckDB-specific tests.
Other non-core offline stores (BigQuery, Redshift, Snowflake at lines 116-137) are added conditionally behind environment variable checks. Please wrap this in a try/except ImportError:
try:
from tests.universal.feature_repos.duckdb_repo_configuration import DuckDBDataSourceCreator
OFFLINE_STORE_TO_PROVIDER_CONFIG["duckdb"] = ("local", DuckDBDataSourceCreator)
except ImportError:
pass713e92e to
7eaff77
Compare
Signed-off-by: Vanshika Vanshika <[email protected]> rh-pre-commit.version: 2.3.2 rh-pre-commit.check-secrets: ENABLED
Signed-off-by: Vanshika Vanshika <[email protected]> rh-pre-commit.version: 2.3.2 rh-pre-commit.check-secrets: ENABLED
Signed-off-by: Vanshika Vanshika <[email protected]> rh-pre-commit.version: 2.3.2 rh-pre-commit.check-secrets: ENABLED
Signed-off-by: Vanshika Vanshika <[email protected]> rh-pre-commit.version: 2.3.2 rh-pre-commit.check-secrets: ENABLED
Signed-off-by: Vanshika Vanshika <[email protected]> rh-pre-commit.version: 2.3.2 rh-pre-commit.check-secrets: ENABLED
Signed-off-by: Vanshika Vanshika <[email protected]> rh-pre-commit.version: 2.3.2 rh-pre-commit.check-secrets: ENABLED
Signed-off-by: Vanshika Vanshika <[email protected]> rh-pre-commit.version: 2.3.2 rh-pre-commit.check-secrets: ENABLED
Signed-off-by: Vanshika Vanshika <[email protected]> rh-pre-commit.version: 2.3.2 rh-pre-commit.check-secrets: ENABLED
Signed-off-by: Vanshika Vanshika <[email protected]> rh-pre-commit.version: 2.3.2 rh-pre-commit.check-secrets: ENABLED
Signed-off-by: Vanshika Vanshika <[email protected]> rh-pre-commit.version: 2.3.2 rh-pre-commit.check-secrets: ENABLED
Signed-off-by: Vanshika Vanshika <[email protected]> rh-pre-commit.version: 2.3.2 rh-pre-commit.check-secrets: ENABLED
f35adb9 to
da63740
Compare
Signed-off-by: Vanshika Vanshika <[email protected]> rh-pre-commit.version: 2.3.2 rh-pre-commit.check-secrets: ENABLED
What this PR does / why we need it:
Adds date-range historical retrieval for the DuckDB offline store when entity_df is omitted.
Which issue(s) this PR fixes:
fixes #5832 related to #1611
Misc