Multi-tenant environments often introduce latency tails that can significantly impact the performance of your large language models (LLMs). At ZC Technologies, we address this issue by offering dedicated NVIDIA GB10 silicon for our Qwen 2.5 models, ensuring that your applications do not suffer from the noisy-neighbor effect. This post will explain the advantages of hardware isolation and why focusing on p99 latency is crucial for maintaining high performance.
In a multi-tenant environment, the performance of your application can be unpredictably affected by the usage patterns of other tenants. This is known as the noisy neighbor problem. When one tenant experiences a spike in usage, it can consume a disproportionate amount of shared resources, leading to increased latency for other tenants. This is particularly problematic for LLMs, where consistent performance is critical.
At ZC Technologies, we offer dedicated hardware for our Qwen 2.5 models, which includes both the 32B and 72B versions. By isolating your application on its own NVIDIA GB10 silicon, we eliminate the risk of noisy neighbors, ensuring that your application has consistent access to the full capacity of the hardware.
While average latency can provide a general sense of performance, it does not account for the outliers that can significantly impact user experience. The p99 latency, which represents the latency experienced by 99% of requests, is a better indicator of the worst-case performance. In a multi-tenant environment, p99 latency can be significantly higher than the average due to the noisy neighbor effect. By focusing on p99 latency, we ensure that your application performs well even in the worst-case scenarios.
Our pricing model is designed to provide cost-effective access to dedicated hardware. We offer three plans: Starter, Pro, and Business. The Starter plan is priced at $99 per month, providing 1.5M tokens at a rate of $66 per 1M tokens. The Pro plan costs $499 per month and includes 12.0M tokens at $42 per 1M tokens. Our Business plan is $1999 per month, offering 60.0M tokens at $33 per 1M tokens. These plans undercut Anthropic and OpenAI by 60-80% per 1M tokens.
Our service is accessible via an OpenAI-compatible API at /v1/chat, with authentication through a Bearer token. This makes it easy for you to integrate our service into your existing infrastructure.
By choosing ZC Technologies, you can ensure that your LLMs run on dedicated hardware, providing consistent performance and eliminating the noisy neighbor effect. For more details on our plans and to sign up, visit our website.