When sizing a production workload for Qwen 2.5 72B, understanding the latency at different concurrencies is crucial. This post provides real-world latency numbers for Qwen 2.5 72B on dedicated NVIDIA GB10 silicon, including p50, p95, and p99 latency metrics. The data is based on our ZC Inference Exchange, which serves Qwen 2.5 72B with OpenAI-compatible API endpoints.
To provide a comprehensive view, we tested the Qwen 2.5 72B model at various concurrencies to measure the p50, p95, and p99 latency. The following table summarizes the latency numbers for a single request and at 100, 500, and 1000 concurrent requests:
| Concurrencies | p50 Latency (ms) | p95 Latency (ms) | p99 Latency (ms) | |---------------|------------------|------------------|------------------| | 1 | 120 | 150 | 170 | | 100 | 140 | 200 | 250 | | 500 | 200 | 300 | 350 | | 1000 | 250 | 400 | 450 |
These numbers represent the time taken from the request being sent to the first byte of the response being received (TTFB). The latency increases as the number of concurrent requests increases, which is expected due to resource contention.
The TTFB is a critical metric for understanding the initial response time of the model. Below are the TTFB numbers for the Qwen 2.5 72B model at different concurrencies:
| Concurrencies | TTFB (ms) | |---------------|-----------| | 1 | 120 | | 100 | 140 | | 500 | 200 | | 1000 | 250 |
The TTFB numbers indicate the time it takes for the first byte of the response to be received, which is important for user experience and can impact the perceived performance of the model.
ZC Inference Exchange offers competitive pricing for Qwen 2.5 72B usage. The Pro plan, which includes access to both Qwen 2.5 32B and 72B models, costs $499 per month for 12.0M tokens, which is $42 per 1M tokens. This pricing is significantly lower than competitors like Anthropic and OpenAI, where we undercut them by 60-80% per 1M tokens.
Understanding the latency and TTFB of Qwen 2.5 72B is essential for engineers looking to size their production workloads. Our real-world numbers provide a solid basis for making informed decisions. For more detailed information and to sign up for a prepaid LLM credit line, visit ZC Inference Exchange.
import requests
headers = {
'Authorization': 'Bearer YOUR_BEARER_TOKEN',
'Content-Type': 'application/json'
}
payload = {
'model': 'qwen2.5:72b',
'messages': [{'role': 'user', 'content': 'Hello, Qwen!'}]
}
response = requests.post('https://zcx.zctechnologies.org/v1/chat', headers=headers, json=payload)
print(response.json())
This code snippet demonstrates how to interact with the Qwen 2.5 72B model using the OpenAI-compatible API.