The day before yesterday, on December 17, 2024, Microsoft's Bing search team announced significant improvements in their search infrastructure through the implementation of TensorRT-LLM optimization techniques. The enhancement marks a strategic shift in how the search engine processes queries, combining both Large Language Models (LLMs) and Small Language Models (SLMs) to balance performance and efficiency.
According to the technical announcement, Bing's engineering team has achieved a substantial reduction in model inference time. The original Transformer model exhibited a 95th percentile latency of 4.76 seconds per batch, processing 4.2 queries per second per instance. Following the integration of TensorRT-LLM, these metrics improved markedly, with the 95th percentile latency decreasing to 3.03 seconds per batch while increasing throughput to 6.6 queries per second per instance.
The optimization's impact extends beyond speed improvements. According to the announcement, the implementation has resulted in a 57% reduction in operational costs associated with running these large models. This efficiency gain stems from the integration of SmoothQuant, a specialized quantization technique detailed in research paper, which enables INT8 inference for both activations and weights while maintaining model accuracy.
Deep Search, one of the primary applications benefiting from this optimization, utilizes SLMs during runtime to enhance web result quality. The technical implementation involves multiple processing steps, including query intent analysis and result relevance assessment. The optimization particularly focuses on reducing the time required for these multiple sequential steps without compromising the quality of search results.
The technical specifications reveal that each batch processes 20 queries simultaneously. The comparison between vLLM (V0.2.1) FP16 and TensorRT-LLM (V0.9.0) demonstrates significant performance differences. The labeling process, which previously showed metrics of 2.96/4.76 (P50%/P95%), improved to 1.99/3.03 after optimization.
The technical foundation of this improvement lies in the SmoothQuant technique's implementation. The method requires specific preprocessing of model weights, for which TensorRT-LLM provides dedicated scripts. This preprocessing step ensures the maintenance of network accuracy while enabling the performance benefits of INT8 quantization.
Microsoft's decision to employ SLMs alongside LLMs addresses several technical challenges inherent in large model deployment. While LLMs offer sophisticated processing capabilities, they present challenges in terms of serving costs and response times. SLMs, by contrast, provide approximately 100 times improvement in throughput compared to their larger counterparts.
The optimization process focuses on the Nvidia A100 GPU platform, where TensorRT-LLM serves as the primary optimization tool. This hardware-software combination enables the significant performance improvements observed in the deployment.
The technical implementation details indicate that the optimization affects multiple aspects of the search infrastructure. The preprocessing requirements for weight modification, the integration with existing search architecture, and the maintenance of result quality all required careful consideration during the deployment process.
This development occurs against a backdrop of increasing demands on search engine performance. The growing complexity of search queries necessitates more sophisticated processing capabilities, while users continue to expect rapid response times. The balance between these competing demands shaped the technical approach taken in this optimization effort.
The metrics provided in the announcement suggest that the optimization's benefits extend beyond raw performance improvements. The reduction in operational costs, combined with maintained accuracy levels, indicates a successful technical implementation that addresses both computational and business requirements.
The significance of this development lies not only in the immediate performance improvements but also in demonstrating the viability of optimizing large-scale language models for production environments. The successful implementation of INT8 quantization while maintaining model accuracy represents a technical achievement with broader implications for the field of search technology.