-
Notifications
You must be signed in to change notification settings - Fork 5
Closed
Labels
performance 🏃Speed things upSpeed things up
Description
Describe the bug
Some Table operations take multiple hours to complete on large tables, although they should be much faster. When using pandas loc functionality, these operations complete in a few seconds.
These operations include:
-
filter_rows -
group_rows_by -
add_rows(Table) -
sort_rows(probably, untested) -
transform_column(probably, untested)
The common problem of these operations are, that the first step is boxing the rows to a pandas Series and wrapping them in a Row object (which both leads to a huge overhead).
Most of the time, the single row objects are only needed to be passed to the lambda function that is provided to these operations.
Possible resolutions:
- don't guarantee that a
Rowis passed to the lambda functions (forfilter_rows,group_rows_by,sort_rows,transform_column). Instead only pass a general type that is indexable with Strings (column names), likeMapping[str, Any]to access the values. Maybe also pass the schema, as the other needed operations on aRowobject can be handled with a schema. add_rowscould handle adding a whole table as a special case, so they are not split intoRowsfirst, avoiding the problem without changing the interface
To Reproduce
- Load a huge dataset with
Table.from_csv_file - Run any of the previously mentioned operations
- Observe that nothing happens and after a few hours the pipeline still runs
Expected behavior
Faster execution of the listed operations
Screenshots (optional)
No response
Additional Context (optional)
No response
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
performance 🏃Speed things upSpeed things up
Type
Projects
Status
✔️ Done