Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Dec 23, 2025

📄 23,403% (234.03x) speedup for find_last_node in src/algorithms/graph.py

⏱️ Runtime : 81.2 milliseconds 345 microseconds (best of 250 runs)

📝 Explanation and details

The optimized code is significantly faster because it eliminates a quadratic time complexity nested loop by preprocessing the edges into a set data structure.

Key optimization:

  • Original approach: For each node, the code calls all(e["source"] != n["id"] for e in edges), which iterates through ALL edges to check if the node is a source. This results in O(n × m) complexity where n = number of nodes and m = number of edges.
  • Optimized approach: First builds a set of all source IDs (source_ids = {e["source"] for e in edges}), then checks membership with n["id"] not in source_ids. Set membership checking is O(1), reducing overall complexity to O(n + m).

Why this matters:
The speedup is most dramatic on large graphs with many edges:

  • test_large_linear_chain: 18ms → 54μs (333x faster) - 1000 nodes in a chain
  • test_large_graph_no_terminal_nodes: 18ms → 54μs (332x faster) - 1000 nodes in a cycle
  • test_large_graph_some_disconnected: 4.5ms → 27.5μs (162x faster) - 500 node chain

Even small graphs show 50-86% speedups, with minimal overhead. The only case showing slight slowdown is test_empty_nodes_and_edges (9-14% slower), where the set construction overhead isn't amortized, but this is negligible in absolute terms (< 1 microsecond difference).

Performance characteristics:

  • Excellent for graphs with many nodes/edges (polynomial → linear complexity)
  • Still efficient for small graphs due to Python's optimized set implementation
  • Set construction happens once upfront, then all node checks are O(1)
  • Duplicate edges are automatically deduplicated by the set, which can provide additional efficiency

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 39 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
🌀 Click to see Generated Regression Tests
from __future__ import annotations

# imports
import pytest  # used for our unit tests
from src.algorithms.graph import find_last_node

# unit tests

# ----------------
# Basic Test Cases
# ----------------


def test_single_node_no_edges():
    # Single node with no edges should be returned as last node
    nodes = [{"id": 1}]
    edges = []
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 1.25μs -> 1.00μs (25.0% faster)


def test_two_nodes_one_edge():
    # Two nodes, one edge from 1->2, node 2 should be last node
    nodes = [{"id": 1}, {"id": 2}]
    edges = [{"source": 1, "target": 2}]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 1.71μs -> 1.12μs (51.8% faster)


def test_three_nodes_chain():
    # Three nodes in a chain: 1->2->3, node 3 should be last node
    nodes = [{"id": 1}, {"id": 2}, {"id": 3}]
    edges = [{"source": 1, "target": 2}, {"source": 2, "target": 3}]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 2.21μs -> 1.21μs (82.8% faster)


def test_multiple_terminal_nodes():
    # Graph with two terminal nodes: 1->2, 1->3, so 2 and 3 are last nodes
    nodes = [{"id": 1}, {"id": 2}, {"id": 3}]
    edges = [{"source": 1, "target": 2}, {"source": 1, "target": 3}]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 1.88μs -> 1.17μs (60.8% faster)


# ----------------
# Edge Test Cases
# ----------------


def test_empty_nodes_and_edges():
    # Both nodes and edges empty, should return None
    nodes = []
    edges = []
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 750ns -> 875ns (14.3% slower)


def test_nodes_but_edges_empty():
    # Multiple nodes, no edges, should return first node as all are terminal
    nodes = [{"id": 1}, {"id": 2}, {"id": 3}]
    edges = []
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 1.21μs -> 958ns (26.2% faster)


def test_all_nodes_have_outgoing_edges():
    # All nodes have outgoing edges (cycle), so no terminal node
    nodes = [{"id": 1}, {"id": 2}, {"id": 3}]
    edges = [
        {"source": 1, "target": 2},
        {"source": 2, "target": 3},
        {"source": 3, "target": 1},
    ]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 2.17μs -> 1.21μs (79.4% faster)


def test_disconnected_nodes():
    # Some nodes are completely disconnected (no edges)
    nodes = [{"id": 1}, {"id": 2}, {"id": 3}, {"id": 4}]
    edges = [{"source": 1, "target": 2}]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 1.83μs -> 1.08μs (69.1% faster)


def test_edges_with_nonexistent_nodes():
    # Edges refer to node ids not in nodes list
    nodes = [{"id": 1}, {"id": 2}]
    edges = [{"source": 3, "target": 4}]
    # Both nodes have no outgoing edges, so first node should be returned
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 1.42μs -> 1.08μs (30.7% faster)


def test_duplicate_edges():
    # Duplicate edges should not affect result
    nodes = [{"id": 1}, {"id": 2}]
    edges = [{"source": 1, "target": 2}, {"source": 1, "target": 2}]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 1.88μs -> 1.17μs (60.7% faster)


def test_nodes_with_additional_attributes():
    # Nodes have extra fields, function should return full node dict
    nodes = [{"id": 1, "name": "A"}, {"id": 2, "name": "B"}]
    edges = [{"source": 1, "target": 2}]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 1.79μs -> 1.12μs (59.3% faster)


def test_edges_with_additional_attributes():
    # Edges have extra fields, function should ignore them
    nodes = [{"id": 1}, {"id": 2}]
    edges = [{"source": 1, "target": 2, "weight": 5}]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 1.79μs -> 1.12μs (59.3% faster)


def test_node_ids_are_strings():
    # Node ids are strings instead of integers
    nodes = [{"id": "A"}, {"id": "B"}]
    edges = [{"source": "A", "target": "B"}]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 1.88μs -> 1.08μs (73.0% faster)


# ----------------
# Large Scale Test Cases
# ----------------


def test_large_linear_chain():
    # 1000 nodes in a chain: 0->1->2->...->999
    N = 1000
    nodes = [{"id": i} for i in range(N)]
    edges = [{"source": i, "target": i + 1} for i in range(N - 1)]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 18.1ms -> 54.2μs (33289% faster)


def test_large_star_graph():
    # 1 central node with 999 outgoing edges to leaves
    N = 1000
    nodes = [{"id": i} for i in range(N)]
    edges = [{"source": 0, "target": i} for i in range(1, N)]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 37.8μs -> 19.9μs (89.9% faster)


def test_large_graph_no_terminal_nodes():
    # 1000 nodes in a cycle: 0->1->...->999->0
    N = 1000
    nodes = [{"id": i} for i in range(N)]
    edges = [{"source": i, "target": (i + 1) % N} for i in range(N)]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 18.0ms -> 54.0μs (33194% faster)


def test_large_graph_all_disconnected():
    # 1000 nodes, no edges, all are terminal, should return first node
    N = 1000
    nodes = [{"id": i} for i in range(N)]
    edges = []
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 1.29μs -> 1.04μs (24.1% faster)


def test_large_graph_some_disconnected():
    # 500 nodes in a chain, 500 disconnected
    N = 1000
    nodes = [{"id": i} for i in range(N)]
    edges = [{"source": i, "target": i + 1} for i in range(499)]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 4.47ms -> 27.5μs (16170% faster)


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
from __future__ import annotations

# imports
import pytest  # used for our unit tests
from src.algorithms.graph import find_last_node

# unit tests

# 1. Basic Test Cases


def test_single_node_no_edges():
    # One node, no edges: should return the node itself
    nodes = [{"id": 1, "label": "A"}]
    edges = []
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 1.25μs -> 959ns (30.3% faster)


def test_two_nodes_one_edge():
    # Two nodes, one edge from 1 to 2: last node is node 2
    nodes = [{"id": 1}, {"id": 2}]
    edges = [{"source": 1, "target": 2}]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 1.83μs -> 1.12μs (62.9% faster)


def test_three_nodes_linear():
    # 1 -> 2 -> 3: last node is 3
    nodes = [{"id": 1}, {"id": 2}, {"id": 3}]
    edges = [{"source": 1, "target": 2}, {"source": 2, "target": 3}]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 2.25μs -> 1.21μs (86.1% faster)


def test_multiple_last_nodes():
    # 1 -> 2, 1 -> 3: last nodes are 2 and 3, should return 2 (first in nodes list)
    nodes = [{"id": 1}, {"id": 2}, {"id": 3}]
    edges = [{"source": 1, "target": 2}, {"source": 1, "target": 3}]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 1.88μs -> 1.17μs (60.7% faster)


def test_no_last_node():
    # 1 <-> 2 (cycle): all nodes are sources, so should return None
    nodes = [{"id": 1}, {"id": 2}]
    edges = [{"source": 1, "target": 2}, {"source": 2, "target": 1}]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 1.83μs -> 1.17μs (57.1% faster)


# 2. Edge Test Cases


def test_empty_nodes_and_edges():
    # No nodes, no edges: should return None
    nodes = []
    edges = []
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 792ns -> 875ns (9.49% slower)


def test_nodes_with_non_integer_ids():
    # Node IDs are strings
    nodes = [{"id": "A"}, {"id": "B"}]
    edges = [{"source": "A", "target": "B"}]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 1.92μs -> 1.17μs (64.4% faster)


def test_edges_with_extra_keys():
    # Edges contain extra keys; should be ignored
    nodes = [{"id": 1}, {"id": 2}]
    edges = [{"source": 1, "target": 2, "weight": 5}]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 1.75μs -> 1.08μs (61.6% faster)


def test_disconnected_nodes():
    # Some nodes not connected at all; should return first node with no outgoing edges
    nodes = [{"id": 1}, {"id": 2}, {"id": 3}]
    edges = [{"source": 1, "target": 2}]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 1.75μs -> 1.12μs (55.6% faster)


def test_node_with_self_loop():
    # Node with edge to itself, should not be a last node
    nodes = [{"id": 1}, {"id": 2}]
    edges = [{"source": 1, "target": 1}]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 1.75μs -> 1.12μs (55.6% faster)


def test_multiple_edges_from_one_node():
    # One node connects to all others
    nodes = [{"id": 1}, {"id": 2}, {"id": 3}, {"id": 4}]
    edges = [
        {"source": 1, "target": 2},
        {"source": 1, "target": 3},
        {"source": 1, "target": 4},
    ]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 1.92μs -> 1.25μs (53.4% faster)


def test_duplicate_edges():
    # Duplicate edges should not affect result
    nodes = [{"id": 1}, {"id": 2}, {"id": 3}]
    edges = [
        {"source": 1, "target": 2},
        {"source": 1, "target": 2},
        {"source": 2, "target": 3},
    ]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 2.29μs -> 1.29μs (77.4% faster)


def test_node_with_incoming_but_no_outgoing():
    # Node 2 has incoming edge but no outgoing
    nodes = [{"id": 1}, {"id": 2}]
    edges = [{"source": 1, "target": 2}]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 1.75μs -> 1.12μs (55.6% faster)


def test_all_nodes_have_outgoing_edges():
    # All nodes have outgoing edges, no last node
    nodes = [{"id": 1}, {"id": 2}]
    edges = [{"source": 1, "target": 2}, {"source": 2, "target": 1}]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 1.79μs -> 1.21μs (48.3% faster)


def test_nodes_with_extra_fields():
    # Node dicts have extra fields, should be returned as-is
    nodes = [{"id": 1, "label": "A", "color": "red"}, {"id": 2, "label": "B"}]
    edges = [{"source": 1, "target": 2}]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 1.83μs -> 1.12μs (62.9% faster)


# 3. Large Scale Test Cases


def test_large_linear_chain():
    # 1000 nodes in a chain: 1->2->...->1000, last node is 1000
    N = 1000
    nodes = [{"id": i} for i in range(1, N + 1)]
    edges = [{"source": i, "target": i + 1} for i in range(1, N)]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 18.0ms -> 54.0μs (33284% faster)


def test_large_star_graph():
    # Node 1 connects to all others, last nodes are 2..1000, should return 2
    N = 1000
    nodes = [{"id": i} for i in range(1, N + 1)]
    edges = [{"source": 1, "target": i} for i in range(2, N + 1)]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 37.2μs -> 20.0μs (85.9% faster)


def test_large_all_cycles():
    # 1000 nodes in a cycle, all nodes have outgoing edges, so result is None
    N = 1000
    nodes = [{"id": i} for i in range(N)]
    edges = [{"source": i, "target": (i + 1) % N} for i in range(N)]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 18.0ms -> 53.3μs (33703% faster)


def test_large_disconnected_graph():
    # 500 nodes connected in a chain, 500 isolated nodes
    N = 500
    nodes = [{"id": i} for i in range(1, N + N + 1)]
    edges = [{"source": i, "target": i + 1} for i in range(1, N)]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 4.48ms -> 27.5μs (16164% faster)


def test_large_sparse_graph():
    # 1000 nodes, only 10 edges
    N = 1000
    nodes = [{"id": i} for i in range(N)]
    edges = [{"source": i, "target": i + 1} for i in range(10)]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 6.46μs -> 1.75μs (269% faster)


# Additional: Check that function does not modify input lists


def test_function_is_pure():
    nodes = [{"id": 1}, {"id": 2}]
    edges = [{"source": 1, "target": 2}]
    nodes_copy = [dict(n) for n in nodes]
    edges_copy = [dict(e) for e in edges]
    find_last_node(nodes, edges)  # 1.75μs -> 1.04μs (68.1% faster)


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-find_last_node-mjj0n8x5 and push.

Codeflash Static Badge

The optimized code is significantly faster because it eliminates a **quadratic time complexity nested loop** by preprocessing the edges into a set data structure.

**Key optimization:**
- **Original approach:** For each node, the code calls `all(e["source"] != n["id"] for e in edges)`, which iterates through ALL edges to check if the node is a source. This results in O(n × m) complexity where n = number of nodes and m = number of edges.
- **Optimized approach:** First builds a set of all source IDs (`source_ids = {e["source"] for e in edges}`), then checks membership with `n["id"] not in source_ids`. Set membership checking is O(1), reducing overall complexity to O(n + m).

**Why this matters:**
The speedup is most dramatic on **large graphs with many edges**:
- `test_large_linear_chain`: 18ms → 54μs (333x faster) - 1000 nodes in a chain
- `test_large_graph_no_terminal_nodes`: 18ms → 54μs (332x faster) - 1000 nodes in a cycle  
- `test_large_graph_some_disconnected`: 4.5ms → 27.5μs (162x faster) - 500 node chain

Even small graphs show 50-86% speedups, with minimal overhead. The only case showing slight slowdown is `test_empty_nodes_and_edges` (9-14% slower), where the set construction overhead isn't amortized, but this is negligible in absolute terms (< 1 microsecond difference).

**Performance characteristics:**
- Excellent for graphs with many nodes/edges (polynomial → linear complexity)
- Still efficient for small graphs due to Python's optimized set implementation
- Set construction happens once upfront, then all node checks are O(1)
- Duplicate edges are automatically deduplicated by the set, which can provide additional efficiency
@codeflash-ai codeflash-ai bot requested a review from KRRT7 December 23, 2025 20:07
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Dec 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant