Skip to content

performance problem: various table operations #575

@WinPlay02

Description

@WinPlay02

Describe the bug

Some Table operations take multiple hours to complete on large tables, although they should be much faster. When using pandas loc functionality, these operations complete in a few seconds.
These operations include:

  • filter_rows
  • group_rows_by
  • add_rows(Table)
  • sort_rows (probably, untested)
  • transform_column (probably, untested)

The common problem of these operations are, that the first step is boxing the rows to a pandas Series and wrapping them in a Row object (which both leads to a huge overhead).
Most of the time, the single row objects are only needed to be passed to the lambda function that is provided to these operations.

Possible resolutions:

  • don't guarantee that a Row is passed to the lambda functions (for filter_rows, group_rows_by, sort_rows, transform_column). Instead only pass a general type that is indexable with Strings (column names), like Mapping[str, Any] to access the values. Maybe also pass the schema, as the other needed operations on a Row object can be handled with a schema.
  • add_rows could handle adding a whole table as a special case, so they are not split into Rows first, avoiding the problem without changing the interface

To Reproduce

  1. Load a huge dataset with Table.from_csv_file
  2. Run any of the previously mentioned operations
  3. Observe that nothing happens and after a few hours the pipeline still runs

Expected behavior

Faster execution of the listed operations

Screenshots (optional)

No response

Additional Context (optional)

No response

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

Status

✔️ Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions