Questions about the model instruct format

#1
by llIlllI - opened

Sorry to bother you, my comprehension is too poor and I am not an English speaker.
What exactly does the format of Mistral (Non-Tekken), i.e., Mistral v3 + [SYSTEM_PROMPT] refer to?

  1. [INST][SYSTEM_PROMPT] {system_prompt}[/SYSTEM_PROMPT][/INST] Understood.[INST] {prompt}[/INST] {Model Response}

  2. [SYSTEM_PROMPT] {system_prompt}[/SYSTEM_PROMPT] Understood.[INST] {prompt}[/INST] {Model Response}

  3. [INST][SYSTEM_PROMPT] {system_prompt}[/INST] Understood.[INST] {prompt}[/INST] {Model Response}

Which format is correct?
I'm confused because this model barely understands what I'm writing. However, all other Behemoth v1.x and merged models I've used have responded fine, so I believe it might be a formatting issue.
Thank you for taking the time to answer and help.

BeaverAI org

Tekken refers to the tokenizer mistral uses nowadays. Previously, they used sentencepiece.
Read about it here : https://github.com/LostRuins/koboldcpp/pull/1659

In a nutshell, from pandora:

SentencePiece (older methods) VS Tekken (used my most of our recent models):

SentencePiece:
Used in most of our older models, its usually the not tekken ones, at first we only had v1 and v2, then we had v3 that also introduced the first v3-tekken variant.
SentencePiece adds a defaulr whitespace at each encode("example"), becoming "_example" instead.
This is the source of the trailing whitespaces, but this also means that the model is the one that wants to generate a token with the white space, becoming like this:

<s>[INST]_user message[/INST]_assistant message</s>[INST]_user new message[/INST]
WITHOUT a last whitespace, because the model will generate a token starting with the whitespace. If you add the whitespace you will mess up the distribution.
Again this is only for the models using SentencePiece (not Tekken, if u go to one of our repos, if u see a tekken file its Tekken, if no tekken file its SentencePiece)

Tekken
However tekken doesnt have this issue of default whitespaces being added making it very simple.

<s>[INST]user message[/INST]assistant message</s>[INST]user new message[/INST]

BeaverAI org

As for which is to be used, it depends. Mistral Small 25xx will be tekken based, 2409 is sentencepiece based. If you use koboldcpp, the auto format should handle it for you assuming the chat template is correctly set in the model.

Sign up or log in to comment