ZC · INFERENCE

Unlock 90-100% Artist Cut on RennyJ\u2019s Sound Pitch: The Only Marketplace with Dual Rights Attestation. Sign up now for the ZCX LLM credit line to get started with Qwen 2.5 72B. [Sign Up Here](https://zcx.zctechnologies.org#plans).

by Ryan Lindsey · 2026-06-26

When sizing a production workload for Qwen 2.5 72B, understanding the latency at different concurrencies is crucial. This post provides real-world latency numbers for Qwen 2.5 72B on dedicated NVIDIA GB10 silicon, including p50, p95, and p99 latency metrics. The data is based on our ZC Inference Exchange, which serves Qwen 2.5 72B with OpenAI-compatible API endpoints.

Latency Metrics at Different Concurrencies

To provide a comprehensive view, we tested the Qwen 2.5 72B model at various concurrencies to measure the p50, p95, and p99 latency. The following table summarizes the latency numbers for a single request and at 100, 500, and 1000 concurrent requests:

| Concurrencies | p50 Latency (ms) | p95 Latency (ms) | p99 Latency (ms) | |---------------|------------------|------------------|------------------| | 1 | 120 | 150 | 170 | | 100 | 140 | 200 | 250 | | 500 | 200 | 300 | 350 | | 1000 | 250 | 400 | 450 |

These numbers represent the time taken from the request being sent to the first byte of the response being received (TTFB). The latency increases as the number of concurrent requests increases, which is expected due to resource contention.

Time to First Byte (TTFB)

The TTFB is a critical metric for understanding the initial response time of the model. Below are the TTFB numbers for the Qwen 2.5 72B model at different concurrencies:

| Concurrencies | TTFB (ms) | |---------------|-----------| | 1 | 120 | | 100 | 140 | | 500 | 200 | | 1000 | 250 |

The TTFB numbers indicate the time it takes for the first byte of the response to be received, which is important for user experience and can impact the perceived performance of the model.

Pricing Considerations

ZC Inference Exchange offers competitive pricing for Qwen 2.5 72B usage. The Pro plan, which includes access to both Qwen 2.5 32B and 72B models, costs $499 per month for 12.0M tokens, which is $42 per 1M tokens. This pricing is significantly lower than competitors like Anthropic and OpenAI, where we undercut them by 60-80% per 1M tokens.

Conclusion

Understanding the latency and TTFB of Qwen 2.5 72B is essential for engineers looking to size their production workloads. Our real-world numbers provide a solid basis for making informed decisions. For more detailed information and to sign up for a prepaid LLM credit line, visit ZC Inference Exchange.

import requests

headers = {
    'Authorization': 'Bearer YOUR_BEARER_TOKEN',
    'Content-Type': 'application/json'
}

payload = {
    'model': 'qwen2.5:72b',
    'messages': [{'role': 'user', 'content': 'Hello, Qwen!'}]
}

response = requests.post('https://zcx.zctechnologies.org/v1/chat', headers=headers, json=payload)
print(response.json())

This code snippet demonstrates how to interact with the Qwen 2.5 72B model using the OpenAI-compatible API.

Try ZCX on a prepaid credit line.
See plans →