← back

TTS Concurrent Real-Time Stream Capacity

#445 · Machine Learning · Medium

⊣ Solve on deep-ml.com

Problem

Determine the maximum number of concurrent text-to-speech (TTS) streams a server can sustain in real time. Given per-stream compute cost (time to synthesize one second of audio), the number of available compute units (e.g., GPU cores or inference threads), and a target real-time constraint, compute the maximum concurrent stream count and the aggregate audio throughput.

Solution

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
def tts_concurrent_capacity(
    synthesis_time_per_sec_audio: float,
    num_compute_units: int,
    target_rtf: float = 1.0
) -> dict:
    if synthesis_time_per_sec_audio <= 0:
        return {"error": "synthesis time must be positive"}

    rtf_per_unit = synthesis_time_per_sec_audio

    streams_per_unit = target_rtf / rtf_per_unit

    max_concurrent = int(streams_per_unit * num_compute_units)

    aggregate_audio_per_sec = max_concurrent * 1.0

    utilization = (max_concurrent * rtf_per_unit) / num_compute_units if num_compute_units > 0 else 0

    return {
        "max_concurrent_streams": max_concurrent,
        "aggregate_audio_sec_per_sec": round(aggregate_audio_per_sec, 2),
        "compute_utilization": round(min(utilization, 1.0), 4),
        "rtf_per_stream": round(rtf_per_unit, 4)
    }

Explanation

  1. The RTF per compute unit is the time to synthesize 1 second of audio. If it takes 0.25s to produce 1s of audio, the RTF is 0.25.
  2. Each compute unit can handle target_rtf / rtf_per_unit streams concurrently while maintaining the real-time target.
  3. Multiply streams per unit by the number of compute units to get the total concurrent capacity.
  4. Aggregate throughput is the total seconds of audio produced per real-time second across all streams.
  5. Compute utilization indicates what fraction of the total compute budget is consumed.

Complexity

  • Time: O(1)
  • Space: O(1)