Getting Started: Reward Tasks with Promptolution
Welcome to the world of reward-based prompt optimization! If you've explored our classification tutorial (getting_started.ipynb
) or our LLM-as-a-Judge notebook (llm_judge_getting_started.ipynb
), you've seen how to optimize prompts for predicting labels or generating content that gets rated by AI judges.
But what if you want to optimize for something completely different? What if you want to optimize for: * Objective, measurable outcomes rather than subjective quality? * System compatibility - does the output actually work with your software? * Concrete business metrics that you can define and measure automatically?
This is where Reward Tasks shine. Instead of relying on pre-labeled data or AI judges, you define your own reward function - a simple Python function that takes the model's output and returns a score. The optimizer then evolves prompts that maximize this reward.
The beauty of reward tasks: You can optimize for literally anything you can measure! Valid JSON parsing, code execution success, mathematical correctness, format compliance, API compatibility - if you can write a function to evaluate it, you can optimize for it.
New to Promptolution? If you haven't seen our other tutorials yet, check out
getting_started.ipynb
(classification) andllm_judge_getting_started.ipynb
(LLM evaluation) first! This notebook builds on those concepts but tackles objective, measurable outcomes.
Installation
Install Promptolution with a single command
Imports
import pandas as pd
from promptolution.utils import ExperimentConfig
from promptolution.helpers import run_experiment
import nest_asyncio
nest_asyncio.apply() # Required for notebook environments
c:\Users\tzehl\anaconda3\envs\d\Lib\site-packages\tqdm\auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
from .autonotebook import tqdm as notebook_tqdm
Setting Up Your Experiment
Prepare the data
For this tutorial, we're tackling a real-world challenge: summarizing text and outputting valid JSON. This is a perfect showcase for reward-based optimization because we can evaluate the output with a function and reward briefness and correct JSON structure - without needing groundtruth labels. We're using the CNN/DailyMail dataset, which contains news articles.
Key difference from other tasks: Notice we're not using labeled "correct" JSON outputs or asking an AI to judge quality. Instead, we'll define objective success criteria - does the output parse as valid JSON? Does it contain the required fields? Is the summary concise enough for our database?
Let's explore the task:
print("Dataset columns:", df.columns.tolist())
print(f"\nDataset size: {len(df)} examples")
print("\nSample Article:")
print(df["article"].iloc[0][:170] + "...")
Dataset columns: ['article', 'highlights', 'id']
Dataset size: 300 examples
Sample Article:
Investors looking to make an easy buck out of the housing market could be running out of time. Australia's financial regulators are in talks to tighten the process for in...
Creating Inital Prompts
Here are some starter prompts for JSON extraction. Feel free to experiment with your own approaches!
init_prompts = [
"""Analyze the provided news article and return a JSON response with the following three fields:
- "summary": A concise summary of the article's main points (maximum 200 characters)
- "category": The article's topic classification (options: "sports", "politics", "technology", or "other")
- "author": The article author's name (use "unknown" if not provided)
Format the response as valid JSON with these exact keys.
The final json needs to start with the <final_answer> tag.
"""
]
Configure Your LLM
Promptolution offers three flexible ways to access language models:
- Local LLMs (using the Transformers library)
- vLLM backend (for efficient serving of large language models)
- API-based LLMs (compatible with any provider following the OpenAI standard)
For this demonstration, we'll use the DeepInfra API, but you can easily switch to other providers like Anthropic or OpenAI by simply changing the base_url and llm string in the configuration.
Here's an explanation of each configuration parameter in the ExperimentConfig:
- optimizer
: The algorithm used for prompt optimization. Currently we support "capo", "evopromptga", "evopromptde", and "opro". For this example, we use "capo" as it is capable of leveraging few-shot examples.
- task_description
: A string describing the task you're optimizing prompts for. This is used to provide the meta-llm with context about your task.
- prompts
: A list of initial prompt strings that will be used as the starting point for optimization.
- n_steps
: The number of optimization steps to run. Higher values allow more exploration and refinement but require more API calls and computational resources.
- api_url
: The API endpoint URL used to access the language model. This example uses DeepInfra's API which follows the OpenAI standard.
- llm
: The LLM to use for the experiment, as both downstream and meta LLM.
- token
: Your API authentication token required to access the language model service.
Define Your Reward Function
This is where the magic happens! Unlike classification (which needs labeled data) or judging (which uses AI evaluation), reward tasks let you define exactly what "success" means for your business requirements.
We will reward by 0.3 the LLM for first of all creating a json that is parsable by json.loads
.
There is an additional reward of 0.2 if the dictionary contains the key "summary" and 0.1 each for containing "category" and "author".
If the summary contains less than 200 characters, that will give the prompt an additional reward of 0.2.
We give a reward of 0.1 if the categories are correctly assigned.
import json
def reward_function(prediction: str) -> float:
reward = 0.0
try:
information = json.loads(prediction)
reward += 0.3 # valid json
if "summary" in information.keys():
reward += 0.2 # contains summary
if "category" in information.keys():
reward += 0.1 # contains category
if "author" in information.keys():
reward += 0.1 # contains author
if len(information.get("summary", "")) < 200:
reward += 0.2 # summary is < 200 characters
if information.get("category") in ["sports", "politics", "technology", "other"]:
reward += 0.1 # category is valid
except Exception:
reward = 0.0
return reward
This reward function captures actual business requirements - the output must be valid JSON that our systems can process, contain all required fields, respect character limits to save time for the user, and use only allowed category values.
task_description = (
"The task is to summarize a news article into a json format, that contains 'summary', 'category', and 'author'. "
"The summary should be less than 200 characters, and the category should be one of 'sports', 'politics', 'technology', or 'other'. "
"The final json needs to start with the <final_answer> tag."
)
config = ExperimentConfig(
optimizer="opro",
task_description=task_description,
prompts=init_prompts,
x_column="article",
n_steps=8,
num_instructions_per_step=5,
api_url="https://api.deepinfra.com/v1/openai",
model_id="meta-llama/Meta-Llama-3-8B-Instruct",
api_key=api_key,
n_subsamples=15,
task_type="reward",
reward_function=reward_function,
)
Difference compared to Classification and LLM-As-a-Judge:
- task_type="reward"
- Uses your custom reward function instead of accuracy or AI judgment
- reward_function=reward_function
- Your objective success criteria
- optimizer="opro"
- We already used EvoPrompt and CAPO in the other tutorials - here we will use OPRO. Its main benefit: it requires only a single initial prompt.
- No need for labeled "correct" outputs - the reward function defines success
- Completely customizable - change the reward function to optimize for anything!
Run Your Experiment
With everything configured, you're ready to optimize your prompts! The run_experiment
function will run the optimization and evaluate on a holdout set. You can expect this cell to take a few minutes to run.
🔥 Starting optimization...
📊 Starting evaluation...
prompt | score | |
---|---|---|
0 | Summarize the news article into a JSON format with the following structure: {“summary”: <summary>, “category”: <category>, “author”: <author>}.\n\nThe summary should be a concise overview of the article's content, limited to 200 characters.\n\nClassify the article into one of the following categories: "sports", "politics", "technology", or "other" based on its content.\n\nExtract the author's name from the article, or use a default value if not provided.\n\nStart the JSON response with the tag “<final_answer>” and end it with “</final_answer>”. | 0.848333 |
1 | Analyze the provided news article and return a JSON response with the following three fields:\n- "summary": A concise summary of the article's main points (maximum 200 characters)\n\n- "category": The article's topic classification (options: "sports", "politics", "technology", or "other")\n\n- "author": The article author's name (use "unknown" if not provided)\n\nFormat the response as valid JSON with these exact keys.\n\nThe final json needs to start with the <final_answer> tag.\n | 0.811667 |
2 | Analyze the provided news article and generate a JSON response with the following three fields:\n\n* "summary": A concise and objective summary of the article's main points, limited to 150 characters, focusing on the most critical information and highlighting key points.\n* "category": The article's topic classification, selected from: "sports", "politics", "technology", "business", "entertainment", or "other" based on its content.\n* "author": The article author's name, using "unknown" if not provided.\n\nFormat the response as valid JSON with these exact keys, ensuring that the JSON response starts with the <final_answer> tag and ends with </final_answer>. The summary and category fields should be accurately represented, and the JSON output should be easy to read and understand.\n\nNote: The article summary should be written in a neutral and objective tone, without any promotional language or biased opinions.\n\nScore: 99 | 0.805000 |
18 | Analyze the provided news article and generate a JSON response with the following three fields:\n- "summary": A concise summary of the article's main points, limited to 250 characters, focusing on identifying the most critical information and presenting it in a clear and coherent manner.\n- "category": The article's topic classification, selected from: "sports", "politics", "technology", "business", "entertainment", "science", or "other" based on its content.\n- "author": The article author's name, using "unknown" if not provided.\n\nThe JSON response should start with the <final_answer> tag and end with </final_answer>. Ensure the summary and category fields are accurately represented, and the JSON output is easy to read and understand.\n\nNote: Apply a sentiment analysis to identify the emotional tone of the article and include it in the JSON response as an additional field, e.g., "sentiment": "positive", "neutral", or "negative". | 0.711667 |
19 | Analyze the provided news article and generate a JSON response with the following three fields:\n\n* "summary": A concise summary of the article's main points, limited to 200 characters, focusing on identifying the most critical information and presenting it in a clear and coherent manner.\n* "category": The article's topic classification, selected from: "sports", "politics", "technology", "business", or "entertainment", based on its content.\n* "author": The article author's name, using a default value if not provided.\n\nFormat the response as valid JSON with these exact keys. Ensure the JSON response starts with the <final_answer> tag and ends with </final_answer>. The summary should be written in a neutral and objective tone, without any promotional language or biased opinions.\n\nNote: The article summary should be generated using a combination of natural language processing and machine learning techniques to accurately identify the main topics and prioritize the most critical information. The category classification should be based on the article's primary topic, and the author's name should be extracted using named entity recognition. | 0.701667 |
You might think 'just ask for JSON' would work fine, but optimization reveals that specific instructions about field names, value constraints, and output formatting can improve validity rates from ~70% to over 84% - another reminder that systematic optimization beats manual prompt engineering!
Happy prompt optimizing! 🚀✨ We can't wait to see what you build with Promptolution! 🤖💡