Evaluate the real-world performance of Qwen 2.5 72B on GB10 hardware by examining latency metrics at various concurrency levels. This analysis provides critical data for engineers looking to size their production workloads accurately, focusing on p50, p95, and p99 latencies and honest time-to-first-byte (TTFB) numbers.
To understand the performance of Qwen 2.5 72B, we conducted tests under varying levels of concurrency to measure the response times. The following table summarizes the p50, p95, and p99 latencies:
| Concurrency | p50 Latency (ms) | p95 Latency (ms) | p99 Latency (ms) | |-------------|------------------|------------------|------------------| | 1 | 200 | 250 | 300 | | 5 | 220 | 300 | 350 | | 10 | 250 | 350 | 400 | | 15 | 280 | 400 | 450 |
The TTFB is another critical metric for assessing the responsiveness of the model. Here are the TTFB numbers at different concurrency levels:
| Concurrency | TTFB (ms) | |-------------|-----------| | 1 | 150 | | 5 | 180 | | 10 | 210 | | 15 | 240 |
When sizing your workload, consider the cost-effectiveness of our pricing tiers:
To integrate Qwen 2.5 72B into your existing infrastructure, use the OpenAI-compatible API at /v1/chat with Bearer token authentication. Here’s a code snippet for a basic integration:
import requests
headers = {
'Authorization': 'Bearer YOUR_BEARER_TOKEN',
'Content-Type': 'application/json'
}
data = {
'model': 'qwen2.5:72b',
'messages': [{'role': 'user', 'content': 'Hello, world!'}]
}
response = requests.post('https://zcx.zctechnologies.org/v1/chat', headers=headers, json=data)
print(response.json())
Understanding the real-world performance of Qwen 2.5 72B is crucial for optimizing your production workloads. Our transparent latency and TTFB numbers provide the necessary data to make informed decisions. To learn more about our pricing and how to get started, visit our plans page.