← back

ASR Real-Time Factor for Parallel Chunk Transcription

#444 · Machine Learning · Medium

⊣ Solve on deep-ml.com

Problem

Compute the real-time factor (RTF) for an ASR (Automatic Speech Recognition) system that processes audio by splitting it into overlapping chunks and transcribing them in parallel. Given the total audio duration, chunk size, overlap, number of parallel workers, and per-chunk processing time, calculate the RTF and effective wall-clock time.

Solution

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
def asr_realtime_factor(
    audio_duration_sec: float,
    chunk_size_sec: float,
    overlap_sec: float,
    num_workers: int,
    per_chunk_process_sec: float
) -> dict:
    step = chunk_size_sec - overlap_sec
    if step <= 0:
        return {"error": "overlap must be less than chunk_size"}

    num_chunks = 1
    covered = chunk_size_sec
    while covered < audio_duration_sec:
        covered += step
        num_chunks += 1

    num_batches = (num_chunks + num_workers - 1) // num_workers

    wall_clock_sec = num_batches * per_chunk_process_sec

    rtf = wall_clock_sec / audio_duration_sec if audio_duration_sec > 0 else 0.0

    return {
        "num_chunks": num_chunks,
        "num_batches": num_batches,
        "wall_clock_sec": round(wall_clock_sec, 4),
        "rtf": round(rtf, 4)
    }

Explanation

  1. Determine the step size between chunk starts: chunk_size - overlap.
  2. Calculate the number of chunks needed to cover the full audio duration using the step size.
  3. With num_workers parallel processors, the chunks are processed in ceil(num_chunks / num_workers) batches.
  4. Wall-clock time is the number of batches times the per-chunk processing time.
  5. RTF (Real-Time Factor) = wall-clock time / audio duration. An RTF < 1 means faster-than-real-time processing.

Complexity

  • Time: O(num_chunks) for the counting loop
  • Space: O(1)