Skip to content

Materializes all rows as Python proto objects — high memory usage at scale #6160

@cutoutsy

Description

@cutoutsy

Is your feature request related to a problem? Please describe.
When materializing features to an online store via LocalOutputNode, the current implementation converts the entire Arrow Table into a Python list of ValueProto objects before any writing occurs. At hundreds of thousands of rows, this causes severe memory pressure and can OOM in practice.

Root Cause

The call chain in LocalOutputNode.execute():

rows_to_write = _convert_arrow_to_proto(
    input_table, self.feature_view, join_key_to_value_type
)
online_store.online_write_batch(..., data=rows_to_write, ...)

_convert_arrow_to_proto (utils.py:325) performs three full-data copies sequentially:

  1. Arrow → NumPy (to_numpy(zero_copy_only=False)) — necessary to bridge Arrow nulls to Python type system
  2. NumPy → List[ValueProto] — each scalar becomes an independent Python protobuf heap object (~200 bytes overhead per value vs 4–8 bytes raw)
  3. Column-wise → row-wise (list(zip(...))) — full materialization into a Python list

Describe the solution you'd like

Chunk iteration in LocalOutputNode (minimal, low-risk):

BATCH_SIZE = 10_000
for batch in input_table.to_batches(max_chunksize=BATCH_SIZE):
    rows_to_write = _convert_arrow_to_proto(
        batch, self.feature_view, join_key_to_value_type
    )
    online_store.online_write_batch(
        config=context.repo_config,
        table=self.feature_view,
        data=rows_to_write,
        progress=lambda x: None,
    )
    # rows_to_write eligible for GC after each iteration

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions