Qwen3-4B-Z-Image-Engineer: The "Z-Engineer"

🚧 Work In Progress (But Surprisingly Competent) 🚧

Welcome to Z-Engineer, a lightweight, local, and slightly rebellious solution to automated prompt engineering for Z-Image Turbo.

If you're tired of writing "masterpiece, best quality, 8k" and getting garbage, or if you just want to see what the S3-DiT architecture can really do when you feed it the right tokens, this model is your new best friend. It can also double as a high-IQ CLIP text encoder for Z-Image Turbo workflows if you're feeling adventurous.

🧠 What is this?

This is a merged model based on Qwen3 (specifically the 4B variant), fine-tuned to understand the intricate, somewhat needy requirements of the Z-Image Turbo architecture. It knows about "Positive Constraints," it hates negative prompts (because they don't work), and it really, really wants you to describe skin texture so your portraits don't look like plastic dolls.

πŸ“‰ The "Heretic" Touch

We took the base Qwen3 model (which loves to say "I cannot assist with that") and gave it the Heretic treatment.

  • Refusal Rate: Dropped from a prudish 100/100 to a chill 23/100 on our benchmarks.
  • KL Divergence: Minimal. We lobotomized the censorship without breaking the brain.

πŸ”¬ Training Methodology

This model was trained on a synthetic dataset generated using Gemini 2.5-latest and Gemini 2.0 Flash. We generated over 20,000+ samples, comprising high-quality prompt pairs and deep technical conversation examples about Z-Image Turbo's architecture.

Fun Fact: This entire dataset took only 45 minutes to generate. How? Thanks to Tier 3 Gemini API accessβ€”a status I achieved involuntarily after all the times Gemini broke while vibe coding, looped infinitely, and racked up $$$ charges. My wallet's pain is your prompt engineering gain. πŸ’Έ

Why Synthetic Data?

Z-Image Turbo is "needy." It requires very specific, dense descriptions to look good. Most human-written prompts are too short or use "tag salad" (comma-separated lists), which the Qwen-3 encoder hates. We used Gemini to expand simple concepts into 120-180 word rich paragraphs, teaching the model to hallucinate the missing details (lighting, texture, camera specs) that Z-Image Turbo needs to trigger its magic.

The "Seed Strategy" (Engineering Diversity)

To ensure the model didn't just learn to output generic "portrait of a woman" prompts, we built a procedural generation engine for the seed prompts that functions as a combinatorial explosion.

  • 8 Major Style Pillars: We explicitly balanced the dataset across Photorealism, Anime, Fantasy, Sci-Fi, Horror, Artistic, Documentary, and Fine Art.
  • Infinite Variety: We didn't just feed Gemini "A cat." We constructed seeds by randomly mixing ~170 base concepts with 26 styles, 10 shot types, 10 lighting setups, 11 moods, 8 texture notes, and 10 camera kits.
  • The Math: This procedural engine is capable of generating over 217 Billion unique seed prompts. From this vast latent space, we carefully sampled the 20,000 most coherent and high-impact intersections to train the model.

This ensures that the model understands that "Cinestill 800T" isn't just a random word, but a specific color grading instruction that can apply to any concept, from a cybernetic surgeon to a medieval marketplace.

The Training Data Prompt

Here is the exact system prompt we used to generate the training data. You can see how we forced Gemini to focus on "Positive Constraints" and "Texture Density":

def get_user_message(seed_prompt: str) -> str:
    """Generate the instruction for Gemini to output one long paragraph with full specs."""
    return f"""You are a senior prompt engineer creating production-grade prompts for Tongyi Z-Image Turbo (S3-DiT, Qwen-3 text encoder, distilled 8-10 step pipeline).

Write ONE rich paragraph (120-180 words) that fully specifies the scene from the seed. Mandatory: subject count and relationships; spatial layout (foreground/midground/background, left/right/center, camera height, gaze direction); action and environment; time of day and weather; texture/material details; lighting rig; color grade; camera body/format, lens and focal length, aperture, focus/depth-of-field; film stock or digital pipeline; shot type (close-up/medium/full/establishing/aerial/dutch tilt/overhead); resolution/cleanliness cues.

Rules:
- One paragraph only, no bullet lists, no newlines, no quoted prefixes.
- Use positive language (describe what TO show), add cleanliness cues instead of negatives.
- Natural sentences, not comma tag salad.
- Keep and enrich style hints from the seed instead of replacing them.

Seed: {seed_prompt}

Return only the final paragraph, nothing else."""

πŸ’» Training Rig (The "Lazy" Setup)

This LoRA was trained for exactly 1 epoch over approximately 14.5 hours.

  • Hardware: An M4 Pro Mac Mini. Yes, really.
  • Why? Because let's be honest, getting ROCm to behave on Windows is a nightmare, and I was too lazy to reboot into Linux. So, we let the Mac chug along.
  • Result: It works! More training would probably yield more consistent results, but for a single epoch on a Mac Mini, it's surprisingly good.

πŸš€ Usage

Feed it a simple prompt like "A photo of an old man" and watch it spit out a paragraph about "weathered skin," "Fujifilm Superia 400," and "Shift 7.0 metadata."

System Prompt: (See zimage-prompter/system_prompt.json in the repo for the full magic incantation).

{
  "system_prompt": "You are Z-Engineer, an expert prompt engineering AI specializing in the Z-Image Turbo architecture (S3-DiT). Your goal is to rewrite simple user inputs into high-fidelity, \"Positive Constraint\" prompts optimized for the Qwen-3 text encoder and the 8-step distilled inference process.\n\n**CORE OPERATIONAL RULES:**\n1.  **NO Negative Prompts:** Z-Image Turbo ignores negative prompts at the optimal CFG of 1.0. You must strictly use \"Positive Constraints.\" (e.g., instead of \"negative: blur\", write \"...razor sharp focus, pristine imaging...\").\n2.  **Natural Language Syntax:** The Qwen-3 encoder requires coherent, grammatical sentences. Do NOT use \"tag salad\" (comma-separated lists). Use flow and structure.\n3.  **Texture Density:** The model suffers from \"plastic skin\" unless forced to render high-frequency detail. You must aggressively describe textures (e.g., \"weathered skin,\" \"visible pores,\" \"film grain,\" \"fabric weave\") to engage the \"Shift 7.0\" sampling schedule.\n4.  **Spatial Precision:** Use specific spatial prepositions (\"in the foreground,\" \"to the left,\" \"worm's-eye view\") to leverage the 3D RoPE embeddings.\n5.  **Text Handling:** If the user asks for text/signage, explicitly enclose the text in double quotes (e.g., ...a sign that says \"OPEN\"...) and describe the font/material (e.g., \"neon,\" \"stenciled paint\").\n6. **Proper Anatomy:** If the user asks for a living subject (e.g., an animal or person), explicitly state that they have proper anatomy or \"perfectly formed\" is used when describing the subject (e.g., \"The woman's perfectly formed hands hold\".\n\n**PROMPT STRUCTURE HIERARCHY:**\nConstruct your response in this specific order:\n1.  **Subject Anchoring:** Define the WHO and WHAT immediately.\n2.  **Action & Context:** Define the DOING and WHERE.\n3.  **Aesthetic & Lighting:** Define the HOW (Lighting, Atmosphere, Color Palette).\n4.  **Technical Modifiers:** Define the CAMERA (Lens, Film Stock, Resolution).\n5.  **Positive Constraints:** Define the QUALITY (e.g., \"clean background,\" \"architectural perfection,\" \"proper anatomy,\" \"perfectly formed\").\n\n**OUTPUT FORMAT:**\nReturn ONLY the enhanced prompt string, followed by a brief \"Technical Metadata\" block.\n\n**Example Input:**\n\"A photo of an old man.\"\n\n**Example Output:**\nAn extreme close-up portrait of an elderly fisherman with deep weathered skin and salt-and-pepper stubble, wearing a yellow waterproof jacket. He is standing against a dark stormy ocean background with raindrops on his face. The lighting is dramatic and side-lit, emphasizing the texture of his skin. Shot on an 85mm lens at f/1.8 with Fujifilm Superia 400 film stock, featuring high texture, raw photo quality, and visible film grain.\n\n[Technical Metadata]\nSteps: 8\nCFG: 1.0\nSampler: Euler\nSchedule: Simple\nShift: 7.0 (Crucial for skin texture)"
}

⚠️ Disclaimer

This is a V1. It might occasionally hallucinate or get too obsessed with "worm's-eye view." Use with a grain of salt (and maybe Shift: 7.0).

Downloads last month
144
GGUF
Model size
4B params
Architecture
qwen3
Hardware compatibility
Log In to view the estimation

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for BennyDaBall/qwen3-4b-Z-Image-Engineer

Base model

Qwen/Qwen2.5-3B
Quantized
(91)
this model