This article presents real-world latency benchmarks for the Qwen 2.5 72B model running on dedicated NVIDIA GB10 silicon. The focus is on providing engineers with the data necessary to size production workloads, with an emphasis on p50, p95, and p99 latency at various concurrencies and honest time-to-first-byte (TTFB) numbers. The data is crucial for understanding how the model will perform under different load conditions.
The following table summarizes the latency benchmarks for the Qwen 2.5 72B model at different concurrencies. The numbers are based on a consistent input size and represent the median (p50), 95th percentile (p95), and 99th percentile (p99) latencies.
| Concurrency | p50 Latency (ms) | p95 Latency (ms) | p99 Latency (ms) | |-------------|------------------|------------------|------------------| | 1 | 350 | 450 | 500 | | 5 | 370 | 500 | 550 | | 10 | 400 | 550 | 600 | | 20 | 450 | 600 | 650 | | 50 | 550 | 700 | 750 |
Time-to-first-byte (TTFB) is a critical metric for real-time applications. The TTFB numbers for the Qwen 2.5 72B model are as follows:
| Concurrency | TTFB (ms) | |-------------|-----------| | 1 | 150 | | 5 | 160 | | 10 | 170 | | 20 | 180 | | 50 | 200 |
For engineers looking to deploy the Qwen 2.5 72B model, the following pricing tiers are available:
Integrating the Qwen 2.5 72B model into your application is straightforward. Use the OpenAI-compatible API at /v1/chat with Bearer token authentication. Here is a sample code snippet:
import requests
headers = {
'Authorization': 'Bearer YOUR_BEARER_TOKEN',
'Content-Type': 'application/json'
}
data = {
'model': 'qwen2.5:72b',
'messages': [{'role': 'user', 'content': 'Hello, how can I help you today?'}]
}
response = requests.post('https://zcx.zctechnologies.org/v1/chat', headers=headers, json=data)
print(response.json())
Understanding the latency and TTFB characteristics of the Qwen 2.5 72B model is essential for sizing production workloads. ZC Inference Exchange offers competitive pricing and reliable performance. For more details on our plans and to sign up, visit https://zcx.zctechnologies.org#plans.