Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

to_batches can return wrong batch_size with filter #3220

Closed
fecet opened this issue Dec 9, 2024 · 5 comments
Closed

to_batches can return wrong batch_size with filter #3220

fecet opened this issue Dec 9, 2024 · 5 comments
Labels
good first issue Good for newcomers python rust Rust related tasks

Comments

@fecet
Copy link
Contributor

fecet commented Dec 9, 2024

import pyarrow as pa
import lance
table = pa.Table.from_pylist([{"s": "one"}, {"s": "two"}] * 100)
ds_test = lance.write_dataset(table, "memory://test")

next(ds_test.to_batches(batch_size=4, with_row_id=True, filter="_rowid > 0"))

Expecting with 4 rows, but get 7 rows, the following batch will have correct size.

I'm building lance from source with 0.21.beta1

@westonpace
Copy link
Contributor

Yes, this is by design at the moment. Recombining the batches can be expensive and so it should be an opt-in feature. It should probably have a few choices:

  • Best performance - current behavior
  • Avoid small batches - if batches are below a certain threshold then combine, otherwise leave as is
  • Force exact size batches - always combine if batch size does not equal target

Datafusion already has a CoalesceBatchesExec which does this so the work required is:

  • Add new option(s) to the scanner (rust & python)
  • Modify create_plan to add the CoalesceBatchesExec based on the new option(s).

@westonpace westonpace added good first issue Good for newcomers rust Rust related tasks python labels Dec 9, 2024
@fecet
Copy link
Contributor Author

fecet commented Dec 10, 2024

Yes, this is by design at the moment. Recombining the batches can be expensive and so it should be an opt-in feature. It should probably have a few choices:是的,目前这是出于设计考虑。重新组合批次可能会很昂贵,因此这应该是一个自愿选择的功能。它可能应该有几个选项:

  • Best performance - current behavior最佳性能 - 当前行为
  • Avoid small batches - if batches are below a certain threshold then combine, otherwise leave as is避免小批量 - 如果批量低于某个阈值,则合并,否则保持不变
  • Force exact size batches - always combine if batch size does not equal target强制精确大小批次 - 如果批次大小不等于目标,则始终合并

Datafusion already has a CoalesceBatchesExec which does this so the work required is:Datafusion 已经有一个 CoalesceBatchesExec 可以做到这一点,因此所需的工作是:

  • Add new option(s) to the scanner (rust & python)为扫描仪添加新选项(rust 和 python)
  • Modify create_plan to add the CoalesceBatchesExec based on the new option(s).修改 create_plan 以根据新选项添加 CoalesceBatchesExec

However the batch_size suggest it's

        batch_size: int, default None
            The max size of batches returned.

maybe this is a doc problem :)

@westonpace
Copy link
Contributor

Oh. I read your original comment backwards. I thought your concern was the batch size was too small :)

I agree this seems wrong. I will investigate.

@westonpace
Copy link
Contributor

So this is arising from our use of https://docs.rs/datafusion/latest/datafusion/physical_plan/coalesce_batches/struct.CoalesceBatchesExec.html

We use this when we are doing late materialization (which only happens when filtering) to ensure that take is not run too often. We do set the target_batch_size property of that relation correctly (4 in your example) but the relation treats that as a "minimum batch size" and output batches can have up to 2 * target_batch_size rows. We can fiddle with things but I don't see a great quick fix. I suspect there are other places that DF could be messing with our batch sizes as well.

Longer term we should probably just have an exact_batch_size property that we apply with a custom relation at the very end of the scan plan with a note that it may impact performance.

@westonpace westonpace removed their assignment Dec 14, 2024
@westonpace
Copy link
Contributor

I've proposed some updated wording in #3246

@fecet fecet closed this as completed Dec 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers python rust Rust related tasks
Projects
None yet
Development

No branches or pull requests

2 participants