:<|im_end|>\n<|im_start|>assistant generated despite predict model used

Hello,

I have an interested behavior. I use the following environment:

1- model: TxGemma-2b predict

2- dataset in completion mode, aka {‘prompt’, ‘completion’}

3- a SFTConfig that does not change any default parameter, in particular nothing changed about the packing, dataset template, loss (I have not specified that I want to compute the loss on the completion only, as I assume that the Trainer will do it by itself because of the dataset template), eos token, …)

The training goes well. But when I use the model, the inference contains the :<|im_end|>\n<|im_start|>assistant sequence of tokens which is expected for the chat model but not for the predict model.

Gemini told me to:

  • check the version of the model used. Done, I can confirm that it is a predict model not a chat one
  • check the dataset. Done, the dataset is clean “completion” dataset
  • check the SFTConfig to remove any choice that can induce a transparent reformating of the dataset. Done

So, now, I do not understand why the fine-tuned model behaves as a chat model.
Any idea?

Thanks

Best regards

Jerome

1 Like

Hmm… for now.


Your predict model behaves like a chat model because the training pipeline quietly turned your {prompt, completion} data into chat-formatted text with <|im_start|> / <|im_end|>, and you’re now decoding that raw structure at inference.

So the model is doing what it was trained to do. The surprising part is what TRL + the tokenizer did under the hood, not the weights themselves.

I’ll walk through:

  • What TxGemma “predict” actually is
  • What SFTTrainer does with {prompt, completion}
  • Why you see :<|im_end|>\n<|im_start|>assistant
  • How to fix decoding, or how to reconfigure training if you really want a pure completion model

1. TxGemma “predict” vs “chat”: what that split really means

From Google’s docs and model card, TxGemma is a family of models (predict and chat variants) built on Gemma 2, aimed at therapeutic tasks. (Google Developers Blog)

Important points:

  • “Predict” variants (like txgemma-2b-predict) are optimized for prediction tasks and typically exposed as “base-style” models.
  • “Chat” variants add extra instruction-tuning data to support multi-turn conversation and explanations, at a small cost in raw predictive performance. (Google Developers Blog)

However:

  • Predict vs chat is not enforced at the tokenizer/vocabulary level.
  • The tokenizer for Gemma-family models still usually defines a chat template and includes special tokens like <|im_start|> and <|im_end|> (see, for example, google/gemma-2-2b-it which exposes a chat template using those tokens). (Hugging Face)

So even a “predict” checkpoint can:

  • Understand those special tokens, and
  • Learn to use them if your fine-tuning data includes them.

The key question is: who put those tokens into your data? That’s usually not you, it’s TRL’s SFTTrainer plus the tokenizer.


2. What TRL’s SFTTrainer actually does with {prompt, completion}

You’re using:

  • Model: TxGemma-2b predict
  • Data: {"prompt": ..., "completion": ...}
  • SFTConfig: mostly defaults

This matches what TRL calls a prompt–completion dataset type. The TRL docs say:

SFT supports both language modeling and prompt-completion datasets. The SFTTrainer is compatible with both standard and conversational dataset formats. When provided with a conversational dataset, the trainer will automatically apply the chat template to the dataset. (Hugging Face)

There is also an explicit statement in an earlier TRL issue about instruction-style {prompt, completion} data:

The SFTTrainer will then format the dataset for you using the defined format from the model’s tokenizer with the apply_chat_template method. (GitHub)

Putting this together:

  1. TRL recognizes your dataset as prompt–completion.

  2. It converts each pair into an internal “conversation” like:

    • user: prompt
    • assistant: completion
  3. Then, if the tokenizer has a chat_template, it runs:

    tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=<something>
    )
    
  4. For Gemma-style templates, that expands to a string with markers such as:

    <|im_start|>user
    <PROMPT><|im_end|>
    <|im_start|>assistant
    <COMPLETION><|im_end|>
    

So even though your original dataset was “clean completion data”, the preprocessed text the model actually sees in training is chat-style with <|im_start|> / <|im_end|>.

This behavior (auto ChatML / chat-template formatting) has been discussed multiple times:

  • “SFTTrainer: Why do we always switch to chatML?” where a user noticed that _prepare_dataset keeps trying to convert data into chat format via maybe_convert_to_chatml and maybe_apply_chat_template. (GitHub)
  • HF forum thread “SFT Trainer and chat templates”, asking explicitly whether SFTTrainer automatically applies the tokenizer’s chat_template for standard formats, and the answer is effectively “yes, if the template exists.” (Hugging Face Forums)

So even if you never mention <|im_start|> / <|im_end|> in your code, SFTTrainer + tokenizer may inject them in your training text.


3. How that training setup explains your inference artifacts

During training, given the above behavior, the model sees something like:

<|im_start|>user
<prompt><|im_end|>
<|im_start|>assistant
<completion><|im_end|>

plus whatever padding/EOS logic SFTTrainer adds. In particular:

  • The input context always contains user and assistant segments wrapped by these tags.
  • With prompt–completion type, SFTTrainer usually uses completion-only loss, so the loss is computed only on the assistant’s tokens, not the prompt’s, but the model still conditions on the full tagged context. (Hugging Face)

Two effects at inference time:

3.1 Prompt format mismatch

At inference you likely do something like:

inputs = tokenizer(user_text, return_tensors="pt")
output = model.generate(**inputs, max_new_tokens=...)
decoded = tokenizer.decode(output[0], skip_special_tokens=False)

So the model receives bare text, not:

<|im_start|>user
user_text<|im_end|>
<|im_start|>assistant

But during SFT it always saw that structure. It learned:

  • After <|im_start|>user ... <|im_end|> comes <|im_start|>assistant.
  • After the assistant content ends, it often emits <|im_end|>.

When you now ask it to generate from a raw prompt, it will often:

  • First “repair” the prompt into the expected chat structure by generating something that includes <|im_end|> and <|im_start|>assistant.
  • Then start your actual answer.

This is why you see sequences like:

:<|im_end|>\n<|im_start|>assistant ...

Those markers are just the structural boundary it learned.

3.2 Raw decoding of special tokens

By default, tokenizer.decode(..., skip_special_tokens=False) will leave special tokens in the output string. For chat/instruction models, standard practice is to:

  • Either decode with skip_special_tokens=True, or
  • Manually split at a sentinel like <|im_end|> and drop everything afterward.

This pattern appears in multiple chat examples and in discussions about EOS handling, including Qwen2.5 issues where double <|im_end|> + newline appear because of template + EOS logic. (GitHub)

So the presence of these raw tags in your log is not proof the model is a “chat” checkpoint; it’s just proof you’re looking at the unfiltered chat-formatted text it was fine-tuned on.


4. Why the “predict” vs “chat” label does not prevent this

The TxGemma pages and model card treat “predict” and “chat” as different variants in the suite. (Google DeepMind)

But TRL doesn’t have any special casing like:

  • “If model is predict, do not apply chat_template.”

Instead, its main triggers are:

  • Does the tokenizer have a chat_template?
  • Is the dataset in a standard instruction format (messages, prompt/completion)?

If yes, it will try to build conversations and call apply_chat_template. (Hugging Face)

So from TRL’s perspective:

  • TxGemma-2b-predict + {prompt, completion} + chat_template present
    ⇒ “Great, this is a chat training setup.”

Your expectation:

  • “Predict model + completion dataset + default SFTConfig ⇒ pure completion training”

does not match current design. That mismatch is what you’re experiencing.


5. What you can do now

5.1 If you are okay treating this as a chat model

This is the lowest-friction route.

  1. Use the chat template at inference.

    Instead of feeding bare text, construct messages and apply the template:

    from transformers import AutoTokenizer, AutoModelForCausalLM
    
    tok = AutoTokenizer.from_pretrained("google/txgemma-2b-predict")
    model = AutoModelForCausalLM.from_pretrained("path/to/your/finetuned/model")
    
    messages = [{"role": "user", "content": user_prompt}]
    prompt_text = tok.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True
    )
    
    inputs = tok(prompt_text, return_tensors="pt").to(model.device)
    out = model.generate(**inputs, max_new_tokens=256)
    raw = tok.decode(out[0], skip_special_tokens=False)
    
  2. Strip at <|im_end|> or use skip_special_tokens=True.

    For example:

    answer_part = raw.split("<|im_end|>")[0]
    # optionally also split off the assistant prefix:
    answer = answer_part.split("<|im_start|>assistant")[-1].strip()
    

    or simply:

    answer = tok.decode(out[0], skip_special_tokens=True)
    

As long as you align inference format with training format, the weird :<|im_end|>\n<|im_start|>assistant fragments disappear from the user-facing text.

5.2 If you want a true “completion-only” predict-style model

Then you need to prevent the automatic chat formatting.

Two main approaches:

  1. Disable chat_template before building SFTTrainer.

    Load the tokenizer and wipe its template so maybe_apply_chat_template has nothing to use:

    tok = AutoTokenizer.from_pretrained("google/txgemma-2b-predict")
    tok.chat_template = None
    if "chat_template" in getattr(tok, "init_kwargs", {}):
        tok.init_kwargs.pop("chat_template")
    

    Then pass this tokenizer into SFTTrainer. With no chat template defined, SFTTrainer will not wrap your examples in <|im_start|> / <|im_end|> and will instead use a simpler prompt + completion + EOS scheme for prompt–completion. (GitHub)

  2. Skip SFTTrainer’s automatic formatting entirely.

    • Build a "text" field yourself:

      def build_text(example):
          return example["prompt"] + example["completion"]
      
      ds = ds.map(lambda ex: {"text": build_text(ex)})
      
    • Either:

      • use SFTTrainer with dataset_text_field="text" and no chat_template, or
      • fall back to plain Trainer plus a DataCollatorForCompletionOnlyLM if you want explicit control of loss masking. (Hugging Face)

In either case, you will need to retrain from the base TxGemma-2b-predict checkpoint, because the current fine-tuned weights have already learned that chat structure.


6. Sanity checks you can run on your side

To be fully convinced:

  1. Print the tokenizer’s chat template.

    tok = AutoTokenizer.from_pretrained("google/txgemma-2b-predict")
    print(tok.chat_template)
    

    If you see a Jinja template producing <|im_start|>user, <|im_start|>assistant, <|im_end|>, that’s exactly where those tokens come from. Similar templates exist for Gemma-2 IT models. (Hugging Face)

  2. Inspect the processed training dataset from SFTTrainer.

    After constructing the trainer, look at trainer.train_dataset[0] or the tokenized batch. You should see that the "text" or token sequence already contains <|im_start|> / <|im_end|>. This confirms the trainer, not your raw dataset, injected them. (GitHub)

  3. Compare decoding options.

    Generate once and examine:

    out = model.generate(**inputs, max_new_tokens=64)
    raw = tok.decode(out[0], skip_special_tokens=False)
    clean = tok.decode(out[0], skip_special_tokens=True)
    print("RAW:", raw)
    print("CLEAN:", clean)
    

    If RAW shows :<|im_end|>\n<|im_start|>assistant but CLEAN does not, then the tags are indeed just special tokens that the default decoding is exposing.


Compressed summary

  • TRL’s SFTTrainer treats {prompt, completion} as instruction data and, when a tokenizer has a chat_template, auto-applies that template using apply_chat_template. (Hugging Face)

  • For Gemma/TxGemma, that template is ChatML-like and introduces <|im_start|> / <|im_end|> and role tags.

  • Your TxGemma-2b-predict checkpoint was therefore fine-tuned in a chat-style format, even though your raw dataset was plain completion.

  • At inference you are feeding bare prompts and decoding raw text, so the model generates the missing chat structure and you see :<|im_end|>\n<|im_start|>assistant.

  • To resolve it:

    • Either treat the model as chat-style: use apply_chat_template at inference and strip at <|im_end|> or use skip_special_tokens=True.
    • Or retrain from the base predict model with the chat template disabled or with a fully manual completion-only formatting pipeline.

Thanks John, super instructive as usual.

I am using decoded = tokenizer.decode(output[0], skip_special_tokens=True) so I was anticipating the special tokens to be removed in the final output. It is why I was first surprise to some of them popping out in the string.

By the way, it should be interesting to understand if the skip_special_tokens = True takes into account the entire collection of special tokens, or is reducing it to the kind of dataset template originally used for training. In the second case, it could be seen as a mistake. Thanks to your detailed explanations, it is clear that the “chat” template is the one always used by the trainer.

At this time, I will not switch to the chat-style format as my training workflow works fine and I control the loss. I will manage the extra special tokens. However still interested by the reason why the skip_special_tokens = True does not do the expected job. I may send a bug about it :slight_smile:

Thanks a lot

Best regards

Jerome

1 Like

Seems like a fifty-fifty chance whether it’s a real bug or not.


You are seeing <|im_end|> and <|im_start|>assistant in the decoded string despite skip_special_tokens=True because those tokens are almost certainly not registered as “special tokens” in the tokenizer config for your TxGemma checkpoint.

skip_special_tokens=True only touches tokens that are marked special in the tokenizer. It does not know anything about chat templates, training mode, or SFT.

I will walk through:

  • what skip_special_tokens actually does
  • how special tokens are defined
  • why ChatML-style tags often slip through
  • how this answers your “does it depend on the dataset template?” question
  • what you can do and when this would count as a real bug

1. What skip_special_tokens=True really does

In Hugging Face tokenizers, decoding is conceptually:

  1. Convert ids → tokens.
  2. If skip_special_tokens=True, filter out tokens whose ids are in tokenizer.all_special_ids.
  3. Convert remaining tokens to string. (Hugging Face)

The important part:

  • The list of “special” ids is built from:

    • core special tokens: bos_token_id, eos_token_id, pad_token_id, cls_token_id, sep_token_id, unk_token_id, mask_token_id
    • plus any additional_special_tokens. (Hugging Face)
  • Anything not in that set is treated as a normal token and will never be removed by skip_special_tokens.

So:

  • skip_special_tokens=True means “drop tokens whose ids are marked special in the tokenizer configuration.”

  • It does not check:

    • which dataset template was used,
    • how SFTTrainer formatted your data,
    • or whether a token “looks like” a control token.

This is why some users see <pad> remain in decoded text when the tokenizer’s pad token is misconfigured: the pad token string exists, but the id is not actually treated as special, so skip_special_tokens simply ignores it. (GitHub)


2. How ChatML-style tokens are typically implemented

For Gemma-family models, the ChatML-like tokens often come from reusing existing placeholder ids, not from adding new special tokens.

Example: Phil Schmid’s “ChatML tokenizer for Gemma” explicitly says:

  • He took google/gemma-7b’s tokenizer.
  • He replaced the string values of two existing tokens with ids 106 and 107 by <|im_start|> and <|im_end|>.
  • He did not add new tokens. (Hugging Face)

Implications:

  • Those ids were already normal vocab entries.

  • Changing their string representation does not automatically mark them as “special” in the tokenizer config.

  • Unless the tokenizer config explicitly sets them as:

    • bos_token or eos_token, or
    • entries in additional_special_tokens,

    they will not be in all_special_tokens or all_special_ids.

So from the tokenizer’s point of view:

  • <|im_start|> and <|im_end|> are just ordinary tokens that happen to be used by the chat template.
  • skip_special_tokens=True does not touch them, because they are not in the “special tokens” set.

Google’s own Gemma docs use control tokens like <start_of_turn> and <end_of_turn> to mark dialogue boundaries. (Google AI for Developers)
Whether those are marked as special in a given checkpoint depends on how the Hugging Face tokenizer for that checkpoint was configured. The chat template and the “special” flags are related but separate.

For TxGemma, it is very likely something similar:

  • ChatML-style strings exist in the vocab.
  • The tokenizer has a chat_template that uses them.
  • They are not all declared as special tokens.

3. Why you still see <|im_end|> with skip_special_tokens=True

Given that:

  1. SFTTrainer is using the tokenizer’s chat_template and injects <|im_start|> / <|im_end|> into the training text. ((note Subtitles))
  2. Those strings are probably not registered as special tokens in the tokenizer config (they were reused placeholder ids). (Hugging Face)
  3. skip_special_tokens=True only removes tokens whose ids are in all_special_ids. (Hugging Face)

Then:

  • The decode call:

    decoded = tokenizer.decode(output[0], skip_special_tokens=True)
    

    leaves <|im_end|> and <|im_start|>assistant untouched, because those tokens are not in the “special” set.

This is exactly the same pattern as the GitHub issue where <pad> was not removed:

  • The string <pad> existed.
  • But the tokenizer had not wired pad_token and additional_special_tokens correctly.
  • So skip_special_tokens=True did not remove it. (GitHub)

From the library’s point of view, this is expected behavior, not an error.


4. Answer to your conceptual question

You asked:

Does skip_special_tokens=True take into account the entire collection of special tokens, or does it reduce to the kind of dataset template originally used for training?

Clear answer:

  1. It uses the entire set of special tokens defined in the tokenizer:

    • core special tokens (bos, eos, pad, etc.), plus
    • all additional_special_tokens. (Hugging Face)
  2. It is completely agnostic to:

    • the dataset template used in training (chat vs completion),
    • SFTTrainer formatting,
    • or which tokens appear in a chat template string.

So there is no dependency on the “dataset template originally used for training.”
The only dependency is the tokenizer configuration at decode time.

If a token is conceptually part of your “template” but is not marked as special, skip_special_tokens will not remove it. That is the situation you are in.


5. How to check your TxGemma tokenizer

To see exactly what is going on in your environment, inspect the tokenizer:

from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained("google/txgemma-2b-predict")

print("special_tokens_map:", tok.special_tokens_map)
print("all_special_tokens:", tok.all_special_tokens)
print("all_special_ids:", tok.all_special_ids)

print("im_start id:", tok.convert_tokens_to_ids("<|im_start|>"))
print("im_end id:", tok.convert_tokens_to_ids("<|im_end|>"))

You are looking for two things:

  • Are <|im_start|> and <|im_end|> in all_special_tokens?
  • Are their ids in all_special_ids?

Expected in your case:

  • They exist as tokens, but are not in the special lists.
  • Therefore skip_special_tokens=True ignores them.

If, instead, you find that they are in all_special_tokens and still survive decoding with skip_special_tokens=True, then you have a real bug or an edge case similar to previous tokenizer issues. (GitHub)


6. How to manage the extra tokens without changing your workflow

You said you want to keep your current training setup and “manage the extra special tokens” yourself. Here are realistic approaches.

6.1. Pure post-processing

Treat these markers as plain delimiters and strip them manually, independent of tokenizer internals:

  • Remove at the first <|im_end|>:

    decoded = tokenizer.decode(output[0], skip_special_tokens=True)
    decoded = decoded.split("<|im_end|>")[0]
    
  • Optionally strip ChatML headers:

    decoded = decoded.replace("<|im_start|>assistant", "")
    decoded = decoded.replace("<|im_start|>user", "")
    decoded = decoded.strip()
    

This is simple and robust. It does not depend on whether the tokenizer flags those tokens as special.

6.2. Register ChatML markers as special tokens

If you want skip_special_tokens=True to handle them, you can try declaring them as additional special tokens:

extra = {"additional_special_tokens": ["<|im_start|>", "<|im_end|>"]}
tok.add_special_tokens(extra)

According to the docs, additional_special_tokens are “ensured to be skipped when decoding with skip_special_tokens=True.” (Hugging Face)

You must test this carefully with a trained model:

  • If those ids are already in the vocab, add_special_tokens usually just flags them as special rather than adding new ids, but you want to confirm it does not alter embedding sizes or other properties.

If this works, future decodes with skip_special_tokens=True will filter out those tags without extra string handling.


7. When is this worth reporting as a bug?

There are two very different situations.

7.1. Misconfiguration or design choice (not a bug)

If your inspection shows:

  • <|im_start|> and <|im_end|> are not in all_special_tokens,

then skip_special_tokens=True ignoring them is exactly what the docs say it should do. (Hugging Face)

In that scenario:

  • This is not a library bug.
  • It is a model-tokenizer design choice: ChatML markers are present but not wired as special tokens.

If you open an issue then, the resolution will probably be:

  • “Working as documented; if you want them dropped, add them as special tokens or post-process them.”

7.2. True bug / surprising behavior

If you find that:

  • <|im_end|> appears in all_special_tokens and its id is in all_special_ids,
  • yet decoding with skip_special_tokens=True still returns <|im_end|>,

then you have a genuine inconsistency, similar to:

  • the “pad token not removed” issue caused by a broken pad_token mapping, or
  • differences between slow and fast tokenizers where one respects skip_special_tokens and the other does not. (GitHub)

In that case a bug report is reasonable. You would want to include:

  • model id (google/txgemma-2b-predict),
  • special_tokens_map, all_special_tokens, all_special_ids,
  • a minimal code snippet that reproduces “token is special but not skipped.”

Condensed summary

  • skip_special_tokens=True only removes tokens whose ids are marked as special in the tokenizer (bos, eos, pad, unk, cls, sep, mask, plus additional_special_tokens). It does not care about the dataset template or SFTTrainer’s behavior. (Hugging Face)

  • In Gemma-style ChatML setups, tags like <|im_start|> and <|im_end|> are often created by reusing existing vocab ids, without marking them as special. That makes them invisible to skip_special_tokens, so you still see them after decoding. (Hugging Face)

  • In your TxGemma workflow, SFTTrainer always uses the chat template, so those markers are in the training data and in the model’s outputs. Because they are not configured as special tokens, skip_special_tokens=True does not remove them.

  • To handle them, either:

    • post-process strings (split on <|im_end|>, strip <|im_start|>assistant), or
    • explicitly register them as additional_special_tokens so the tokenizer can skip them automatically.