Code Execution Verifier for Programming Benchmarks

#324 · LLM · Medium

Problem

Implement a code execution verifier for programming benchmarks. Given a candidate code solution and a list of test cases (input/expected output pairs), execute the code safely and verify correctness.

Solution

from typing import Dict, List, Tuple
import sys
import io

def code_execution_verifier(
    code: str,
    test_cases: List[Tuple[str, str]],
    function_name: str = "solution",
    timeout: float = 5.0
) -> Dict:
    results = []
    passed = 0

    try:
        namespace = {}
        exec(code, namespace)
    except Exception as e:
        return {
            "compile_error": str(e),
            "passed": 0,
            "total": len(test_cases),
            "results": []
        }

    func = namespace.get(function_name)
    if func is None:
        return {
            "compile_error": f"Function '{function_name}' not found",
            "passed": 0,
            "total": len(test_cases),
            "results": []
        }

    for inp, expected in test_cases:
        old_stdout = sys.stdout
        sys.stdout = io.StringIO()
        try:
            args = eval(inp)
            if not isinstance(args, tuple):
                args = (args,)
            result = func(*args)
            actual = repr(result)
            stdout_output = sys.stdout.getvalue()
        except Exception as e:
            actual = None
            stdout_output = ""
            results.append({
                "input": inp,
                "expected": expected,
                "actual": f"ERROR: {e}",
                "passed": False
            })
            continue
        finally:
            sys.stdout = old_stdout

        test_passed = str(result) == expected or actual == expected
        if test_passed:
            passed += 1
        results.append({
            "input": inp,
            "expected": expected,
            "actual": actual,
            "passed": test_passed
        })

    return {
        "passed": passed,
        "total": len(test_cases),
        "pass_rate": round(passed / len(test_cases), 4) if test_cases else 0,
        "results": results
    }

Explanation

First compile and execute the code string to define the solution function in a namespace.
If compilation fails or the function is not found, return an error immediately.
For each test case, parse the input, call the function, capture the output.
Compare the actual result against the expected output (string comparison).
Return a summary with pass count, total, pass rate, and per-test results.

Complexity

Time: O(T * F) where T is the number of test cases and F is the runtime of each function call
Space: O(T) for storing results

← #323 #325 →