Update README.md

49e57fb verified 10 months ago

4.1 kB

	---
	license: apache-2.0
	datasets:
	- kolerk/TON-Math-SFT
	language:
	- en
	metrics:
	- accuracy
	base_model:
	- Qwen/Qwen2.5-VL-3B-Instruct
	pipeline_tag: image-text-to-text
	---


	# TON-Math
	TON is a series of large language models trained using our efficient algorithm, which automatically decides whether to think or not, based on Qwen2.5-VL.
	We apply Group Relative Policy Optimization (GRPO) for reinforcement learning with "thought dropout" supervised finetuning as a preliminary step.
	## Introduction

	Reinforcement Learning (RL) has proven to be an effective post-training strategy for enhancing reasoning in vision–language models (VLMs). Group Relative Policy Optimization (GRPO) is a recent prominent method that encourages models to generate complete reasoning traces before answering, leading to increased token usage and computational cost. Inspired by the human-like thinking process—where people skip reasoning for easy questions but think carefully when needed—we explore how to enable VLMs to first decide when reasoning is necessary. To realize this, we propose TON, a two-stage training strategy:

	1. (i) A supervised fine-tuning (SFT) stage with a simple yet effective “thought dropout” operation, where reasoning traces are randomly replaced with empty thoughts. This introduces a think-or-not format that serves as a cold start for selective reasoning.
	2. (ii) A GRPO stage that enables the model to freely explore when to think or not, while maximizing task-aware outcome rewards.

	Experimental results show that TON can reduce the completion length by up to 90%* compared to vanilla GRPO, without sacrificing performance or even improving it. Further evaluations across diverse vision-language tasks—covering a range of reasoning difficulties under both 3B and 7B models—consistently reveal that the model progressively learns to bypass unnecessary reasoning steps as training advances*. These findings shed light on the path toward human-like reasoning patterns in reinforcement learning approaches.

	## Quickstart

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer


	example={
	"image": "./Geo170K/images/test/0.png", ### your image path
	"problem": "As shown in the figure, in triangle ABC, it is known that angle A = 80.0, angle B = 60.0, DE parallel BC, then the size of angle CED is ()",

	}

	def make_conversation_image(example):
	return {
	'image': example['image'], # Store path instead of loaded image
	'prompt': [{
	'role': 'user',
	'content': [
	{'type': 'image', 'text': None},
	{'type': 'text', 'text': example['problem']}
	]
	}]
	}

	model_name = "kolerk/TON-3B-AITZ"

	model = AutoModelForCausalLM.from_pretrained(
	model_name,
	torch_dtype="auto",
	device_map="auto"
	)
	tokenizer = AutoTokenizer.from_pretrained(model_name)


	text = tokenizer.apply_chat_template(
	make_conversation_image(example),
	tokenize=False,
	add_generation_prompt=True
	)
	model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

	generated_ids = model.generate(
	**model_inputs,
	max_new_tokens=4096,
	top_p=0.95,
	top_k=1,
	temperature=0.6
	)
	generated_ids = [
	output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
	]

	response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
	print(response)
	```

	## Evaluation

	Run our test Python file in the [code repository](https://github.com/kokolerk/TON/blob/main/src/eval/test_qwen25vl_geoqa.py) for more details.


	## Citation

	If you find our work helpful, feel free to give us a cite.

	```
	@misc{wang2025think,
	title={Think or Not? Selective Reasoning via Reinforcement Learning for Vision-Language Models},
	author={Jiaqi Wang and Kevin Qinghong Lin and James Cheng and Mike Zheng Shou},
	year={2025},
	eprint={2505.16854},
	archivePrefix={arXiv},
	primaryClass={cs.AI}
	}
	```