Cornellius Yudha Wijaya
2024-12-02 10:00:00
www.kdnuggets.com
Image by Editor | Ideogram
Let’s learn how to use mBERT from Hugging Face Transformers for Cross-Lingual Transfer Learning.
Preparation
You must install the package below for this tutorial, so use the provided code.
pip install transformers datasets
Then, you must install the PyTorch package, which could work in your environment.
With the package installed, we will get into the next part.
Cross-Lingual Transfer Learning with mBERT
You may already know the BERT model, one of the first language models for understanding human language, which has been used in many language-related tasks. mBERT is a unique BERT that has been trained in 104 different languages. That makes the mBERT model capable of understanding different languages while training in another language.
Let’s understand mBERT’s capabilities with this tutorial for cross-lingual tasks. We would go through with fine-tuning mBERT in English and applying it to classification tasks in French.
First, we would download the dataset in English and preprocess it.
from transformers import BertTokenizer, BertForSequenceClassification, Trainer, TrainingArguments
from datasets import load_dataset, DatasetDict
import torch
#Using XNLI dataset
dataset = load_dataset('xnli', 'en')
tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-cased')
def tokenize_function(examples):
premise = [ex if isinstance(ex, str) else " ".join(ex) for ex in examples['premise']]
hypothesis = [ex if isinstance(ex, str) else " ".join(ex) for ex in examples['hypothesis']]
return tokenizer(premise, hypothesis, padding="max_length", truncation=True)
tokenized_datasets = dataset.map(tokenize_function, batched=True)
tokenized_datasets.set_format('torch', columns=['input_ids', 'attention_mask', 'label'])
For the sake of a quick training process, I would only use the subset of the dataset.
import random
random.seed(42)
train_indices = random.sample(range(len(tokenized_datasets['train'])), 1000)
val_indices = random.sample(range(len(tokenized_datasets['validation'])), 500)
train_dataset = tokenized_datasets['train'].select(train_indices)
val_dataset = tokenized_datasets['validation'].select(val_indices)
Then, we would download the mBERT model.
model = BertForSequenceClassification.from_pretrained('bert-base-multilingual-cased', num_labels=3)
Once the model is ready, we will fine-tune mBERT with the English dataset.
training_args = TrainingArguments(
output_dir="./results",
evaluation_strategy="epoch",
learning_rate=2e-5,
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
num_train_epochs=3,
weight_decay=0.01,
fp16=True,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=val_dataset,
)
trainer.train()
With the model ready, we would evaluate them against the French language dataset instead of the English one.
french_dataset = load_dataset('xnli', 'fr')
tokenized_french_dataset = french_dataset.map(tokenize_function, batched=True)
tokenized_french_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'label'])
french_val_dataset = tokenized_french_dataset['validation']
results = trainer.evaluate(french_val_dataset)
print(results)
Output>>
{'eval_loss': 1.0408061742782593, 'eval_runtime': 9.4173, 'eval_samples_per_second': 264.406, 'eval_steps_per_second': 16.565, 'epoch': 3.0}
The result seems promising, and the model can generalize well into another language into which it has yet to be trained.
Master the mBERT model to handle tasks involving multiple languages.
Additional Resources
Cornellius Yudha Wijaya is a data science assistant manager and data writer. While working full-time at Allianz Indonesia, he loves to share Python and data tips via social media and writing media. Cornellius writes on a variety of AI and machine learning topics.
Transform your cleaning routine with the Shark AI Ultra Voice Control Robot Vacuum! This high-tech marvel boasts over 32,487 ratings, an impressive 4.2 out of 5 stars, and has been purchased over 900 times in the past month. Perfect for keeping your home spotless with minimal effort, this vacuum is now available for the unbeatable price of $349.99!
Don’t miss out on this limited-time offer. Order now and let Shark AI do the work for you!
Support Techcratic
If you find value in Techcratic’s insights and articles, consider supporting us with Bitcoin. Your support helps me, as a solo operator, continue delivering high-quality content while managing all the technical aspects, from server maintenance to blog writing, future updates, and improvements. Support Innovation! Thank you.
Bitcoin Address:
bc1qlszw7elx2qahjwvaryh0tkgg8y68enw30gpvge
Please verify this address before sending funds.
Bitcoin QR Code
Simply scan the QR code below to support Techcratic.
Please read the Privacy and Security Disclaimer on how Techcratic handles your support.
Disclaimer: As an Amazon Associate, Techcratic may earn from qualifying purchases.