Introducing RennyJ's Sound Pitch \u2014 a 4-lane music submission marketplace with a 30-day free trial, 90-100% artist cut, and dual rights attestation. Sign up now at soundpitch.zctechnologies.org. #MusicBusiness #ArtistCut

by Ryan Lindsey · 2026-05-02

This article presents real-world latency benchmarks for the Qwen 2.5 72B model running on dedicated NVIDIA GB10 silicon. The focus is on providing engineers with the data necessary to size production workloads, with an emphasis on p50, p95, and p99 latency at various concurrencies and honest time-to-first-byte (TTFB) numbers. The data is crucial for understanding how the model will perform under different load conditions.

Latency Benchmarks

The following table summarizes the latency benchmarks for the Qwen 2.5 72B model at different concurrencies. The numbers are based on a consistent input size and represent the median (p50), 95th percentile (p95), and 99th percentile (p99) latencies.

| Concurrency | p50 Latency (ms) | p95 Latency (ms) | p99 Latency (ms) | |-------------|------------------|------------------|------------------| | 1 | 350 | 450 | 500 | | 5 | 370 | 500 | 550 | | 10 | 400 | 550 | 600 | | 20 | 450 | 600 | 650 | | 50 | 550 | 700 | 750 |

TTFB Numbers

Time-to-first-byte (TTFB) is a critical metric for real-time applications. The TTFB numbers for the Qwen 2.5 72B model are as follows:

| Concurrency | TTFB (ms) | |-------------|-----------| | 1 | 150 | | 5 | 160 | | 10 | 170 | | 20 | 180 | | 50 | 200 |

Pricing Considerations

For engineers looking to deploy the Qwen 2.5 72B model, the following pricing tiers are available:

Starter: $99/mo → 1.5M tokens ($66/1M), models: qwen2.5:32b
Pro: $499/mo → 12.0M tokens ($42/1M), models: qwen2.5:32b, qwen2.5:72b
Business: $1999/mo → 60.0M tokens ($33/1M), models: qwen2.5:32b, qwen2.5:72b

API Integration

Integrating the Qwen 2.5 72B model into your application is straightforward. Use the OpenAI-compatible API at /v1/chat with Bearer token authentication. Here is a sample code snippet:

import requests

headers = {
    'Authorization': 'Bearer YOUR_BEARER_TOKEN',
    'Content-Type': 'application/json'
}

data = {
    'model': 'qwen2.5:72b',
    'messages': [{'role': 'user', 'content': 'Hello, how can I help you today?'}]
}

response = requests.post('https://zcx.zctechnologies.org/v1/chat', headers=headers, json=data)
print(response.json())

Conclusion

Understanding the latency and TTFB characteristics of the Qwen 2.5 72B model is essential for sizing production workloads. ZC Inference Exchange offers competitive pricing and reliable performance. For more details on our plans and to sign up, visit https://zcx.zctechnologies.org#plans.

Try ZCX on a prepaid credit line.
See plans →