v2.1.0

Release v2.1.0

We added Reward and LLM-as-a-Judge to our task family
- Reward allows you to write a custom function that scores the prediction, without requiring groundtruth
- LLM-as-a-Judge allows you to deligate the task of scoring a prediction to a Judge-LLM, optionally accepting groundtruth
Changes to CAPO, to make it applicable to the new tasks:
- CAPO now accepts input parameter "check_fs_accuracy" (default True) - in case of reward tasks the accuracy cannot be evaluated, so we will take the prediction of the downstream_llm as target of fs.
- CAPO also accepts "create_fs_reasoning" (default is True): if set to false, just use input-output pairs from df_few_shots
introduces tag-extraction function, to centralize repeated code for extractions like "5"

We now utilize mypy for automated type checking
core functionalities of classification task has been moved to base task to prevent code duplication for other tasks
test coverage is now boosted to >90%

Full Changelog: here