Batched reward model inference and Best-of-N sampling

2024-11-19 01:19:00
raw.sh

Reward models have been a key part of reinforcement learning on top of LLMs, used broadly in techniques like RLHF and as LLM-as-a-judge critics in evals. They have also been used in the data preparation phase of preference optimization methods like SimPO, where a reward model was used to create the preference data used to train models like princeton-nlp/gemma-2-9b-it-SimPO.

I recently had some trouble figuring out how to run high throughput reward model inference. Offline, you can just collect everything you need to score in advance and go through all of your data in batches. But for a lot of use cases, such as for search methods like tree search or MCTS, this is hard to do efficiently. Unfortunately, easy to set up inference servers like vllm and llama.cpp don’t support sequence classification models¹ out of the box (yet²)

Dynamic batching with `batched`

Thankfully, Mixedbread has a great generic dynamic batching library that makes it really easy to inference reward models³.

Here’s how we can host an endpoint on Modal⁴. First, let’s add some helper functions and setup the image:

from typing import List, Dict
import modal
image = modal.Image.debian_slim().pip_install([
    "torch", "transformers", "accelerate", "batched", "hf_transfer"
]).env({"HF_HUB_ENABLE_HF_TRANSFER": "1"})
app = modal.App("reward-api", image=image)
MODEL_NAME = "RLHFlow/ArmoRM-Llama3-8B-v0.1"
with image.imports():
    import torch
    from transformers import AutoModelForSequenceClassification, AutoTokenizer
    from batched import inference
    def validate_messages(messages: List[Dict[str, str]]):
        if not messages or len(messages) 2:
            raise ValueError("Messages must contain at least a user and assistant message")
        if not all(isinstance(m, dict) and 'role' in m and 'content' in m for m in messages):
            raise ValueError("Each message must have 'role' and 'content' fields")
    class RewardModelHelper:
        def __init__(self, model):
            self.model = model
        @inference.dynamically(batch_size=8, timeout_ms=100.0)
        def score_batch(self, features: dict[str, torch.Tensor]) -> torch.Tensor:
            with torch.no_grad():
                # Move input to same device as model
                inputs = {k: v.to(self.model.device) for k, v in features.items()}
                return self.model(inputs["input_ids"]).score.float()

from typing import List, Dict
import modal
image = modal.Image.debian_slim().pip_install([
    "torch", "transformers", "accelerate", "batched", "hf_transfer"
]).env({"HF_HUB_ENABLE_HF_TRANSFER": "1"})
app = modal.App("reward-api", image=image)
MODEL_NAME = "RLHFlow/ArmoRM-Llama3-8B-v0.1"
with image.imports():
    import torch
    from transformers import AutoModelForSequenceClassification, AutoTokenizer
    from batched import inference
    def validate_messages(messages: List[Dict[str, str]]):
        if not messages or len(messages) 2:
            raise ValueError("Messages must contain at least a user and assistant message")
        if not all(isinstance(m, dict) and 'role' in m and 'content' in m for m in messages):
            raise ValueError("Each message must have 'role' and 'content' fields")
    class RewardModelHelper:
        def __init__(self, model):
            self.model = model
        @inference.dynamically(batch_size=8, timeout_ms=100.0)
        def score_batch(self, features: dict[str, torch.Tensor]) -> torch.Tensor:
            with torch.no_grad():
                # Move input to same device as model
                inputs = {k: v.to(self.model.device) for k, v in features.items()}
                return self.model(inputs["input_ids"]).score.float()

I’m using RLHFlow/ArmoRM-Llama3-8B-v0.1, the same reward model used by the SimPO team, which uses Mixture-of-Experts aggregation of reward objectives with each expert corresponding to a human-interpretable objective (such as helpfulness, safety, coherence…)

Check out the blog post from RLHFlow for more details⁵.

Now we can add the meat of the code, a web endpoint to score a LLM completion using dynamic batching under the hood.

@app.cls(
    gpu=modal.gpu.A10G(),
    allow_concurrent_inputs=1000,
    container_idle_timeout=120,
)
class Model:
    def load_model(self):
        model = AutoModelForSequenceClassification.from_pretrained(
            MODEL_NAME,
            device_map="cuda",
            trust_remote_code=True,
            torch_dtype=torch.bfloat16,
            use_safetensors=True,
        )
        return model
    @modal.build()
    def build(self):
        self.load_model()
    @modal.enter()
    def setup(self):
        self.model = self.load_model()
        self.tokenizer = AutoTokenizer.from_pretrained(
            MODEL_NAME,
            use_fast=True,
        )
        self.score_batch = RewardModelHelper(self.model).score_batch
    @modal.web_endpoint(method="POST")
    async def score(self, messages_dict: Dict[str, List[Dict[str, str]]]):
        messages = messages_dict["messages"]
        validate_messages(messages)
        inputs = self.tokenizer.apply_chat_template(
            messages,
            return_tensors="pt",
            padding=True,
            truncation=True,
        )
        score = await self.score_batch.acall({"input_ids": inputs})
        return {"score": score[0].item()}

@app.cls(
    gpu=modal.gpu.A10G(),
    allow_concurrent_inputs=1000,
    container_idle_timeout=120,
)
class Model:
    def load_model(self):
        model = AutoModelForSequenceClassification.from_pretrained(
            MODEL_NAME,
            device_map="cuda",
            trust_remote_code=True,
            torch_dtype=torch.bfloat16,
            use_safetensors=True,
        )
        return model
    @modal.build()
    def build(self):
        self.load_model()
    @modal.enter()
    def setup(self):
        self.model = self.load_model()
        self.tokenizer = AutoTokenizer.from_pretrained(
            MODEL_NAME,
            use_fast=True,
        )
        self.score_batch = RewardModelHelper(self.model).score_batch
    @modal.web_endpoint(method="POST")
    async def score(self, messages_dict: Dict[str, List[Dict[str, str]]]):
        messages = messages_dict["messages"]
        validate_messages(messages)
        inputs = self.tokenizer.apply_chat_template(
            messages,
            return_tensors="pt",
            padding=True,
            truncation=True,
        )
        score = await self.score_batch.acall({"input_ids": inputs})
        return {"score": score[0].item()}

The allow_concurrent_inputs setting prevents Modal from starting multiple servers so we can take advantage of dynamic batching. Now we can use this reward model endpoint with a hosted API or an open source model + vLLM.

Evaluating the reward model

Since my cloud GPU bill this month is kind of ridiculous, I decided to evaluate using a small random 100 question subset of TruthfulQA, a multiple choice benchmark for LLMs.

One way to benchmark the reward model on multiple choice data is to check the score for each possible answer to the question⁶. Here are the results:

Average rank of correct answer: 2.10
Times correct answer ranked first: 53/100
Average score difference from best: 0.0163
Average correct answer score: 0.0659
Average best score: 0.0821

Average rank of correct answer: 2.10
Times correct answer ranked first: 53/100
Average score difference from best: 0.0163
Average correct answer score: 0.0659
Average best score: 0.0821

The reward model scores the correct answer the highest 53% of the time. Next we will evaluate Best-of-N sampling, choosing the best LLM completion using the reward model.

Best-of-N-Sampling

Now we can test LLM + reward model verifier, a super simple way to add more test time compute. Best-of-N-Sampling doesn’t add much latency (we can sample and score hundreds of completions in parallel with batching) and is super easy to implement⁷.

I get the following zero-shot accuracy with Llama-3.1-8B-Instruct:

# completions	1	2	4	8	16	32	64
accuracy (%)	58	63	64	63	70	67	67

Using the reward model with Best-of-N-Sampling shows a 20.7% increase in accuracy from n=1 to n=16, with performance plateauing with more than 16 generations.

Source Link

Support Techcratic

If you find value in Techcratic’s insights and articles, consider supporting us with Bitcoin. Your support helps me, as a solo operator, continue delivering high-quality content while managing all the technical aspects, from server maintenance to blog writing, future updates, and improvements. Support Innovation! Thank you.

Bitcoin Address:

bc1qlszw7elx2qahjwvaryh0tkgg8y68enw30gpvge

Please verify this address before sending funds.

Bitcoin QR Code

Simply scan the QR code below to support Techcratic.

Please read the Privacy and Security Disclaimer on how Techcratic handles your support.

Disclaimer: As an Amazon Associate, Techcratic may earn from qualifying purchases.

Tags: HACKER NEWS

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Batched reward model inference and Best-of-N sampling

Dynamic batching with batched

Evaluating the reward model

Best-of-N-Sampling

Support Techcratic

Bitcoin QR Code

Satoshi Action Fund Open-Sources Strategic Bitcoin Reserve Model, Hints at Executive Order Implementation

Today’s NYT Connections: Sports Edition Hints, Answers for Nov. 19, #57

Related Posts

Leave a Reply Cancel reply

Tech News

Tech News

Tech News

Tech News​

Site Links

Dynamic batching with `batched`

Tech News