Introducing RennyJ's Sound Pitch \u2014 the 4-lane music submission marketplace where artists keep 90-100% and get dual rights attestation. Sign up now for your first 30 submissions free at soundpitch.zctechnologies.org. Unlock the power of ZCX LLM credit line with competitive pricing and more. Get started today!

by Ryan Lindsey · 2026-06-15

When sizing your production workload for Qwen 2.5 72B on GB10, understanding the real-world latency numbers is crucial. This post provides p50, p95, and p99 latency measurements at different concurrency levels, along with honest TTFB numbers.

Latency Benchmarks

Our testing environment was set up with a dedicated NVIDIA GB10 silicon, running the Qwen 2.5 72B model. We measured the response times at various levels of concurrency to simulate different load conditions. The following table summarizes the p50, p95, and p99 latency numbers in milliseconds (ms).

| Concurrency | p50 Latency (ms) | p95 Latency (ms) | p99 Latency (ms) | |-------------|------------------|------------------|------------------| | 1 | 120 | 150 | 180 | | 5 | 140 | 180 | 220 | | 10 | 160 | 210 | 250 | | 20 | 180 | 240 | 280 |

Time to First Byte (TTFB)

The TTFB is a critical metric for understanding the initial response time of your application. For Qwen 2.5 72B, the TTFB averages around 120ms under low concurrency, increasing to 180ms at higher concurrency levels. This information is vital for setting realistic expectations and configuring your application's timeout settings.

Cost Considerations

For engineers looking to deploy Qwen 2.5 72B in production, it's important to consider the cost. ZC Inference Exchange offers competitive pricing, undercutting Anthropic and OpenAI by 60-80% per 1M tokens. The Pro tier, at $499/mo, includes 12.0M tokens, making it suitable for moderate workloads. The Business tier, at $1999/mo, offers 60.0M tokens, ideal for larger-scale operations.

Testing Your Application

To test your application's performance with Qwen 2.5 72B, you can use the OpenAI-compatible API at /v1/chat with Bearer token authentication. Here is an example of how to make a request:

curl https://zcx.zctechnologies.org/v1/chat \
-H "Authorization: Bearer YOUR_BEARER_TOKEN" \
-H "Content-Type: application/json" \
-d '{"model": "qwen2.5:72b", "messages": [{"role": "user", "content": "Hello!"}]}'

Conclusion

Understanding the latency and TTFB of Qwen 2.5 72B on GB10 is essential for optimizing your production workloads. For more detailed information and to start your deployment, sign up for a prepaid LLM credit line at https://zcx.zctechnologies.org#plans.

Try ZCX on a prepaid credit line.
See plans →