Small (Micro) Language Models
Small-scale models are compressed versions of large language models developed to fit specific domains. These models are designed to run in resource-constrained environments such as embedded systems, or computers that consume less power computing devices.
Examples of tasks they can perform include;
Small Language Models (SMLs) use patterns from the text they are trained on to predict the next word in a sequence. This approach is common to all language models that use transformer architecture to understand and generate language.
Transformers are often referred to as the "brain" behind language models. They use self-attention to identify relationships between words in a sentence, allowing the model to understand context.
SLMs are designed to be small but high-performing. They use fewer parameters, ranging from millions to a few billions, compared to LLMs, which can have hundreds of billions of parameters. This makes SLMs require less computational power and data to train, and they can process input and generate output more quickly.
There are several techniques involved to shrink a language model into a smaller size to become faster and efficient, including:
Distillation: This process involves transferring knowledge from a larger, pre-trained model(teacher) to a smaller model(student) by compressing what the larger model has learnt into the smaller model without reducing performance. Think of the larger model providing compact knowledge to the smaller model.
Distillation is classified into different methods, including;
Response-based: The student model learns to replicate the output of the teacher model. It is trained to produce outputs similar to the teachers using soft labels (probability distribution classes), allowing students to learn the process of decision making.
Feature-based: This method involves replicating features of the teacher model to extract matching patterns from the data.
Relation-based: This method focuses on training the student model to understand the relationships between components of the teacher model and then imitate the process of complex reasoning.
Pruning: This involves removing parts of a model that are not important. This process shrinks the model's size without affecting its performance or accuracy. Pruning must be done with caution because removing excess information can affect the model's performance.
Quantization: This involves using fewer bits to store the model's numerical values (weights), which reduces the model's size significantly and improves its speed. This makes the model more efficient for devices with limited computational power and memory.
There are several applications of small language models, including;
Personalised AI: SLMs can be customised for customers specifically. For example, customizing a chatbot for customers to assist them when they have questions or other needs.
On-device AI: On-device AI refers to AI features that run directly on devices such as smartphones or smart appliances without needing to connect to a cloud-based server. These applications can function without internet access. For example, Google Translates offline capabilities are powered by SLM.
Internet of Things: Small Language Models run on smart home systems, smart home appliances, and other smart gadgets. This enables the devices to process tasks locally.
Examples of small language models include;
Phi-3-mini
Gemma 2
Minstral 7B
GPT-4o mini
Large (Macro) Language Models
Large Language models are deep learning models trained with large datasets. They use transformers, which are a type of neural network architecture that includes encoders and decoders to extract patterns and relationships within text.
LLMs are trained using hundreds of billions of parameters to learn unsupervised, which means they can train themselves by learning patterns from data without human labelling.
The infrastructure of LLMs consists of key components:
Hardware: High-performance computing systems (HPC), Graphics Processing Units (GPUs), Tensor Processing Units (TPUs), and AI accelerators that are used to run LLMs due to their computationally intensive and parallel processing requirements.
Software: Frameworks such as TensorFlow and Pytorch, including custom-built solutions, support model training and deployment, performance computing, scalable cloud services, data management, and networking
Data Storage: LLMs require storage systems capable of handling large amounts of datasets. Training data is typically stored in a distributed file system. After training, the model parameters(weights) are stored in files that can range from ten to hundreds of gigabytes.
Networking: Connecting different components in distributed computing environments requires high-bandwidth, low-latency networks to ensure performance.
Data Management: Management tools and practices such as data processing, annotation, and versioning help maintain data quality and trackability throughout the model's lifecycle.
Security: Encryption, access controls, and secure data transfer protocols ensure data privacy and model integrity.
LLMs are widely used across industries, For example they power chatbots, automate customer support systems, assist in medical research and diagnostics in healthcare, enhance fraud detection and risk management in finance sector, enhance automated grading systems in education sector, and so on.
The infrastructure of LLMs offers key benefits, including:
Efficiency: Advanced hardware and software enhances training and inference, reducing development time, speeding up time to market.
Reliability: A robust infrastructure ensures high availability and minimal downtime, which is essential for applications.
Cost-Effectiveness: Efficient resources management helps reduce operational costs and also maintains high performance of the model.
Security and Compliance: Security features and adherence to industry regulations ensure that sensitive data is protected and remains compliant with legal standards.
Examples of common large language models include;
Gemini
LLaMA 2/3
Bloom
GPT 3
Grok
Large Language Model vs Small Language Model
Let’s explore the difference between Large Language Models and Small Language Models.
Solving Complex Tasks
Complex tasks like deep search, solving complex problems can be handled by an SLM and an LLM; however, they each perform differently.
They are great at handling general and complex tasks. They also provide better accuracy and performance. They can maintain context over long messages and provide a logical response.
LLMs are more suitable for general-purpose chatbots that handle general and complex queries. They are great for tasks that require broad knowledge, deep language understanding, complex language tasks, and long-range context understanding.
Smaller models are better suited for simpler tasks. They are great for specialised applications and domain-specific tasks.
For example, a small model is ideal for a customer service bot that responds to queries about a specific product, as its training is more focused.
Resource Requirements
Large Language Models and Small Language Models require computational power to train and deploy responses.
LLMs require a significant amount of computational power and memory. They need specialized GPUs for inference, and the operational cost is high due to resource demands.
SLMs require far less computational power. They consume fewer resources, they can run on standard hardware like smartphones, and have shorter training times, making them faster to deploy.
Deployment Environment
Large Language Models (LLMs) and Small Language Models are deployed in different environments.
LLMs are best suited for cloud environments with high computational power; they are not suitable for on-device AI because they require more significant computing resources.
SLMs can be used in cloud environments, but they are better suited for applications that require limited computational resources.
SLMs are efficient for handling smaller tasks and well-suited for on-device AI, especially in products that have offline functionality. SLMs are commonly used in applications like voice recognition and real-time translation and other applications that do not require internet connection.