When deploying Qwen 2.5 72B in a production environment, understanding the real-world latency is critical for sizing your workload. This post provides an in-depth analysis of p50, p95, and p99 latency at different concurrency levels, along with honest Time To First Byte (TTFB) numbers. To get started with ZC Inference Exchange, visit https://zcx.zctechnologies.org#plans.
Our tests were conducted on dedicated NVIDIA GB10 silicon, using the Qwen 2.5 72B model. Below are the latency benchmarks for various concurrency levels. The numbers are based on 1000 requests per concurrency level, with p50, p95, and p99 representing the latency at which 50%, 95%, and 99% of requests are served, respectively.
Deploying Qwen 2.5 72B on ZC Inference Exchange is cost-effective. Our pricing tiers are designed to accommodate a range of workloads, from small-scale projects to large enterprises. Here is a breakdown of our pricing:
We undercut Anthropic and OpenAI by 60-80% per 1M tokens, making ZC Inference Exchange a cost-effective solution for your production workloads.
Integrating Qwen 2.5 72B into your application is straightforward. Our OpenAI-compatible API at /v1/chat uses Bearer token authentication. Here is a sample code block for integrating the API:
import requests
headers = {
'Authorization': 'Bearer YOUR_BEARER_TOKEN',
'Content-Type': 'application/json'
}
payload = {
'model': 'qwen2.5:72b',
'messages': [{'role': 'user', 'content': 'Your prompt here'}]
}
response = requests.post('https://api.zcx.zctechnologies.org/v1/chat', headers=headers, json=payload)
print(response.json())
Choosing the right model and understanding its performance in a production environment is crucial. With ZC Inference Exchange, you can deploy Qwen 2.5 72B with confidence, knowing the real-world latency numbers and cost-effectiveness of our service. To start optimizing your production workloads, sign up at https://zcx.zctechnologies.org#plans.