Kari Briski
2024-08-21 12:00:55
blogs.nvidia.com
Developers of generative AI typically face a tradeoff between model size and accuracy. But a new language model released by NVIDIA delivers the best of both, providing state-of-the-art accuracy in a compact form factor.
Mistral-NeMo-Minitron 8B — a miniaturized version of the open Mistral NeMo 12B model released by Mistral AI and NVIDIA last month — is small enough to run on an NVIDIA RTX-powered workstation while still excelling across multiple benchmarks for AI-powered chatbots, virtual assistants, content generators and educational tools. Minitron models are distilled by NVIDIA using NVIDIA NeMo, an end-to-end platform for developing custom generative AI.
“We combined two different AI optimization methods — pruning to shrink Mistral NeMo’s 12 billion parameters into 8 billion, and distillation to improve accuracy,” said Bryan Catanzaro, vice president of applied deep learning research at NVIDIA. “By doing so, Mistral-NeMo-Minitron 8B delivers comparable accuracy to the original model at lower computational cost.”
Unlike their larger counterparts, small language models can run in real time on workstations and laptops. This makes it easier for organizations with limited resources to deploy generative AI capabilities across their infrastructure while optimizing for cost, operational efficiency and energy use. Running language models locally on edge devices also delivers security benefits, since data doesn’t need to be passed to a server from an edge device.
Developers can get started with Mistral-NeMo-Minitron 8B packaged as an NVIDIA NIM microservice with a standard application programming interface (API) — or they can download the model from Hugging Face. A downloadable NVIDIA NIM, which can be deployed on any GPU-accelerated system in minutes, will be available soon.
State-of-the-Art for 8 Billion Parameters
For a model of its size, Mistral-NeMo-Minitron 8B leads on nine popular benchmarks for language models. These benchmarks cover a variety of tasks including language understanding, common sense reasoning, mathematical reasoning, summarization, coding and ability to generate truthful answers.
Packaged as an NVIDIA NIM microservice, the model is optimized for low latency, which means faster responses for users, and high throughput, which corresponds to higher computational efficiency in production.
In some cases, developers may want an even smaller version of the model to run on a smartphone or an embedded device like a robot. To do so, they can download the 8-billion-parameter model and, using NVIDIA AI Foundry, prune and distill it into a smaller, optimized neural network customized for enterprise-specific applications.
The AI Foundry platform and service offers developers a full-stack solution for creating a customized foundation model packaged as a NIM microservice. It includes popular foundation models, the NVIDIA NeMo platform and dedicated capacity on NVIDIA DGX Cloud. Developers using NVIDIA AI Foundry can also access NVIDIA AI Enterprise, a software platform that provides security, stability and support for production deployments.
Since the original Mistral-NeMo-Minitron 8B model starts with a baseline of state-of-the-art accuracy, versions downsized using AI Foundry would still offer users high accuracy with a fraction of the training data and compute infrastructure.
Harnessing the Perks of Pruning and Distillation
To achieve high accuracy with a smaller model, the team used a process that combines pruning and distillation. Pruning downsizes a neural network by removing model weights that contribute the least to accuracy. During distillation, the team retrained this pruned model on a small dataset to significantly boost accuracy, which had decreased through the pruning process.
The end result is a smaller, more efficient model with the predictive accuracy of its larger counterpart.
This technique means that a fraction of the original dataset is required to train each additional model within a family of related models, saving up to 40x the compute cost when pruning and distilling a larger model compared to training a smaller model from scratch.
Read the NVIDIA Technical Blog and a technical report for details.
NVIDIA also announced this week Nemotron-Mini-4B-Instruct, another small language model optimized for low memory usage and faster response times on NVIDIA GeForce RTX AI PCs and laptops. The model is available as an NVIDIA NIM microservice for cloud and on-device deployment and is part of NVIDIA ACE, a suite of digital human technologies that provide speech, intelligence and animation powered by generative AI.
Experience both models as NIM microservices from a browser or an API at ai.nvidia.com.
See notice regarding software product information.
Support Techcratic
If you find value in our blend of original insights (Techcratic articles and Techs Got To Eat), up-to-date daily curated articles, and the extensive technical work required to keep everything running smoothly, consider supporting Techcratic with Bitcoin. Your support helps me, as a solo operator, continue delivering high-quality content while managing all the technical aspects, from server maintenance to future updates and improvements. I am committed to continually enhancing the site and staying at the forefront of trends to provide the best possible experience. Your generosity and commitment are deeply appreciated. Thank you!
Bitcoin Address:
bc1qlszw7elx2qahjwvaryh0tkgg8y68enw30gpvge
Please verify this address before sending any funds to ensure your donation is directed correctly.
Bitcoin QR Code
Your contribution is vital in supporting my efforts to deliver valuable content and manage the technical aspects of the site. To donate, simply scan the QR code below. Your generosity allows me to keep providing insightful articles and maintaining the server infrastructure that supports them.
Privacy and Security Disclaimer
- No Personal Information Collected: We do not collect any personal information or transaction details when you make a donation via Bitcoin. The Bitcoin address provided is used solely for receiving donations.
- Data Privacy: We do not store or process any personal data related to your Bitcoin transactions. All transactions are processed directly through the Bitcoin network, ensuring your privacy.
- Security Measures: We utilize industry-standard security practices to protect our Bitcoin address and ensure that your donations are received securely. However, we encourage you to exercise caution and verify the address before sending funds.
- Contact Us: If you have any concerns or questions about our donation process, please contact us via the Techcratic Contact form. We are here to assist you.
Disclaimer: As an Amazon Associate, Techcratic may earn from qualifying purchases.