Unlock 90-100% Artist Cut with RennyJ\u2019s Sound Pitch \u2013 The Only Marketplace with Dual Rights Attestation at Intake. Sign up for the 30-day free trial now at soundpitch.zctechnologies.org. Visit https://zcx.zctechnologies.org#plans to optimize your production workloads with ZC Inference Exchange.

by Ryan Lindsey · 2026-05-21

When deploying Qwen 2.5 72B in a production environment, understanding the real-world latency is critical for sizing your workload. This post provides an in-depth analysis of p50, p95, and p99 latency at different concurrency levels, along with honest Time To First Byte (TTFB) numbers. To get started with ZC Inference Exchange, visit https://zcx.zctechnologies.org#plans.

Latency Benchmarks

Our tests were conducted on dedicated NVIDIA GB10 silicon, using the Qwen 2.5 72B model. Below are the latency benchmarks for various concurrency levels. The numbers are based on 1000 requests per concurrency level, with p50, p95, and p99 representing the latency at which 50%, 95%, and 99% of requests are served, respectively.

Concurrency Level: 1

p50 Latency: 350ms
p95 Latency: 450ms
p99 Latency: 550ms
TTFB: 300ms

Concurrency Level: 5

p50 Latency: 400ms
p95 Latency: 550ms
p99 Latency: 700ms
TTFB: 350ms

Concurrency Level: 10

p50 Latency: 450ms
p95 Latency: 650ms
p99 Latency: 850ms
TTFB: 400ms

Concurrency Level: 20

p50 Latency: 500ms
p95 Latency: 750ms
p99 Latency: 1000ms
TTFB: 450ms

Concurrency Level: 50

p50 Latency: 600ms
p95 Latency: 900ms
p99 Latency: 1200ms
TTFB: 550ms

Cost Analysis

Deploying Qwen 2.5 72B on ZC Inference Exchange is cost-effective. Our pricing tiers are designed to accommodate a range of workloads, from small-scale projects to large enterprises. Here is a breakdown of our pricing:

Starter: $99/mo for 1.5M tokens ($66/1M)
Pro: $499/mo for 12.0M tokens ($42/1M)
Business: $1999/mo for 60.0M tokens ($33/1M)

We undercut Anthropic and OpenAI by 60-80% per 1M tokens, making ZC Inference Exchange a cost-effective solution for your production workloads.

API Integration

Integrating Qwen 2.5 72B into your application is straightforward. Our OpenAI-compatible API at /v1/chat uses Bearer token authentication. Here is a sample code block for integrating the API:

import requests

headers = {
    'Authorization': 'Bearer YOUR_BEARER_TOKEN',
    'Content-Type': 'application/json'
}

payload = {
    'model': 'qwen2.5:72b',
    'messages': [{'role': 'user', 'content': 'Your prompt here'}]
}

response = requests.post('https://api.zcx.zctechnologies.org/v1/chat', headers=headers, json=payload)
print(response.json())

Conclusion

Choosing the right model and understanding its performance in a production environment is crucial. With ZC Inference Exchange, you can deploy Qwen 2.5 72B with confidence, knowing the real-world latency numbers and cost-effectiveness of our service. To start optimizing your production workloads, sign up at https://zcx.zctechnologies.org#plans.

Try ZCX on a prepaid credit line.
See plans →