Kanwal Mehreen
2024-10-07 08:00:04
www.kdnuggets.com
Image by Author
We all know the popular Scikit-Learn package available in Python. The basic machine learning package is still widely used for building models and classifiers for industrial use cases. Nonetheless, the package lacked the ability for language understanding and still depended on the TF-IDF and frequency-based methods for natural language tasks. With the rising popularity of LLMs, the Scikit-LLM library aims to bridge this gap. It combines large language models to build classifiers for text-based inputs using the same functional API as the traditional scikit-learn models.
In this article, we explore the Scikit-LLM library and implement a zero-shot text classifier on a demo dataset.
Setup and Installation
The Scikit-LLM package is available as a PyPI package, making it easy to install using pip. Run the command below to install the package.
Backend LLM Supports
The Scikit-LLM currently supports API integrations and locally supported large language models. We can also integrate custom APIs hosted on-premise or on cloud platforms. We review how to set up each of these in the next sections.
 
OpenAI
The GPT models are the most widely used language models worldwide and have multiple applications built on top of them. To set up an OpenAI model using the Scikit-LLM package, we need to configure the API credentials and set the model name we want to use.
from skllm.config import SKLLMConfig
SKLLMConfig.set_openai_key("")
SKLLMConfig.set_openai_org("")
Once the API credentials are configured, we can use the zero-shot classifier from the Scikit-LLM package that will use the OpenAI model by default.
from skllm.models.gpt.classification.zero_shot import ZeroShotGPTClassifier
clf = ZeroShotGPTClassifier(model="gpt-4")
 
LlamaCPP and GGUF models
Even though OpenAI is significantly popular, it can be expensive and impractical to use in some cases. Hence, the Scikit-LLM package provides in-built support for locally running quantized GGUF or GGML models. We need to install supporting packages that help in using the llama-cpp package to run the language models.
Run the below commands to install the required packages:
pip install 'scikit-llm[gguf]' --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cpu --no-cache-dir
pip install 'scikit-llm[llama-cpp]'
Now, we can use the same zero-shot classifier model from Scikit-LLM to load GGUF models. Note, that only a few models are supported currently. Find the list of supported models here.
We use the GGUF-quantized version of Gemma-2B for our purpose. The general syntax follows gguf::
Use the below code to load the model:
from skllm.models.gpt.classification.zero_shot import ZeroShotGPTClassifier
clf = ZeroShotGPTClassifier(model="gguf::gemma2-2b-q6")
 
External Models
Lastly, we can use self-hosted models that follow the OpenAI API standard. It can be running locally or hosted on the cloud. All we have to do is provide the API URL for the model.
Load the model from a custom URL using the given code:
from skllm.config import SKLLMConfig
SKLLMConfig.set_gpt_url("https://localhost:8000/")
clf = ZeroShotGPTClassifier(model="custom_url::")
Model and Inference Using the Basic Scikit-Learn API
We can now train the model on a classification dataset using the Scikit-Learn API. We will see a basic implementation using a demo dataset of sentiment predictions on movie reviews.
 
Dataset
The dataset is provided by the scikit-llm package. It contains 100 samples of movie reviews and their associated labels as positive, neutral, or negative sentiment. We will load the dataset and split it into train and test datasets for our demo.
We can use the traditional scikit-learn methods to load and split the dataset.
from skllm.datasets import get_classification_dataset
X, y = get_classification_dataset()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
Fit and Predict
The training and prediction using the large language model follows the same scikit-learn API. First, we fit the model on our training dataset, and then we can use it to make predictions on unseen test data.
clf.fit(X_train, y_train)
predictions = clf.predict(X_test)
On the test set, we get 100% accuracy using the Gemma2-2B model as it is a relatively simple dataset.
For examples, refer to the below examples for test samples:
Sample Review: "Under the Same Sky was an okay movie. The plot was decent, and the performances were fine, but it lacked depth and originality. It is not a movie I would watch again."
Predicted Sentiment: ['neutral']
Sample Review: "The cinematography in Awakening was nothing short of spectacular. The visuals alone are worth the ticket price. The storyline was unique and the performances were solid. An overall fantastic film."
Predicted Sentiment: ['positive']
Sample Review: "I found Hollow Echoes to be a complete mess. The plot was non-existent, the performances were overdone, and the pacing was all over the place. Not worth the hype."
Predicted Sentiment: ['negative']
Wrapping Up
The scikit-llm package is gaining popularity due to its familiar API making it easy to integrate it into existing pipelines. It offers enhanced responses for text-based models improving upon the basic frequency-based methods used originally. The integration of language models adds reasoning and understanding of the textual input that can boost the performance of standard models.
Moreover, it provides options to train few-shot and chain-of-thought classifiers alongside other textual modeling tasks like summarization. Explore the package and documentation available on the official site to see what suits your purpose.
Kanwal Mehreen Kanwal is a machine learning engineer and a technical writer with a profound passion for data science and the intersection of AI with medicine. She co-authored the ebook “Maximizing Productivity with ChatGPT”. As a Google Generation Scholar 2022 for APAC, she champions diversity and academic excellence. She’s also recognized as a Teradata Diversity in Tech Scholar, Mitacs Globalink Research Scholar, and Harvard WeCode Scholar. Kanwal is an ardent advocate for change, having founded FEMCodes to empower women in STEM fields.
Our Top 3 Partner Recommendations
1. Best VPN for Engineers – 3 Months Free – Stay secure online with a free trial
2. Best Project Management Tool for Tech Teams – Boost team efficiency today
4. Best Password Management Tool for Tech Teams – zero-trust and zero-knowledge security
Support Techcratic
If you find value in Techcratic’s insights and articles, consider supporting us with Bitcoin. Your support helps me, as a solo operator, continue delivering high-quality content while managing all the technical aspects, from server maintenance to blog writing, future updates, and improvements. Support Innovation! Thank you.
Bitcoin Address:
bc1qlszw7elx2qahjwvaryh0tkgg8y68enw30gpvge
Please verify this address before sending funds.
Bitcoin QR Code
Simply scan the QR code below to support Techcratic.
Please read the Privacy and Security Disclaimer on how Techcratic handles your support.
Disclaimer: As an Amazon Associate, Techcratic may earn from qualifying purchases.