Exceeded inference provider monthly usage

While taking the agents course, I exceeded the monthly usage for 2 inference providers: novita and together. This happened while running the multi agent code. Not sure what to do here.

Thanks!

1 Like

For now, the cause, workaround, and alternative solution.


You hit a quota wall because multi-agent code multiplies LLM calls fast. Each “agent step” is usually at least one chat-completion. Multi-agent setups add more agents plus coordination. Agentic RAG loops can also re-run retrieval and verification, which adds even more calls. (Hugging Face)

Below are the common causes, then the practical workarounds, then good alternative APIs and local options.


What “Exceeded inference provider monthly usage” usually means

There are two common interpretations:

  1. You exhausted Hugging Face monthly included credits for Inference Providers.
    HF allocates a small monthly credit bucket. Free accounts get a small amount, PRO gets more, and Team/Enterprise differs. HF bills provider usage against that bucket when you route through HF. (Hugging Face)

  2. You are using that provider directly (or via a BYO key) and exhausted the provider’s own monthly quota.
    If you use “custom provider keys,” billing and limits are handled by the provider, not HF credits. (Hugging Face)

Either way, the trigger is the same pattern: too many requests and tokens.


Causes in short

  • Multi-step agent loops. Default max_steps can be high (often 20). Each step can be a model call. (Hugging Face)
  • Multi-agent multiplication. A coordinator agent can call other agents, and each sub-agent has its own multi-step loop.
  • Agentic RAG retries. The agent may retrieve, judge results, then retrieve again with a refined query.
  • Long context. If you keep appending conversation history, token usage spikes. Tokens drive cost and can hit limits quickly.
  • Large max_tokens / max_new_tokens. More output tokens means more cost per call.
  • Errors cause retries. Code agents that fail a tool call often try again until max_steps is reached.

Solutions in short

  • Hard-cap steps. Lower max_steps aggressively (start with 3 to 6). (Hugging Face)
  • Hard-cap output tokens. Set max_tokens low (start 200 to 500). (Hugging Face)
  • Use smaller models for “internal thinking.” Save bigger models for the final response.
  • Pick cheaper providers or “cheapest” variants in the HF model catalog. The catalog surfaces cheapest and fastest options per model. (Hugging Face)
  • Bring your own key. If HF credits are the issue, use a provider API key directly or via HF “custom provider keys” so you are not blocked by HF’s included credits. (Hugging Face)
  • Switch to local inference for the course exercises (Ollama, vLLM, llama.cpp). No monthly API quota. (Ollama Documentation)

Practical “make it stop costing money” knobs for the Agents Course

1) Put a strict budget on the agent loop

In smolagents, the multi-step agent loop is controlled by max_steps. Lower it first. (Hugging Face)

Example pattern (conceptual):

  • Coordinator agent: max_steps=4
  • Each sub-agent: max_steps=2 to 4
  • Disable fancy planning unless you need it (if your code uses a “planning interval,” increase interval or turn planning off)

Why this works: even if the model is confused, it cannot burn 20 calls per query.

2) Limit output tokens and rate-limit requests

InferenceClientModel supports request throttling and standard generation controls. (Hugging Face)

Key levers:

  • max_tokens: cap output length
  • requests_per_minute: slow down bursty loops so they do not blow a quota instantly (Hugging Face)
  • Use smaller/cheaper model IDs for routine steps

Also align with the general guidance: set budgets and token limits early.

3) Reduce multi-agent chatter

In multi-agent demos, the biggest hidden cost is agents talking to each other with long transcripts.

Tactics:

  • Keep inter-agent messages short. Summaries, not full logs.
  • Reset history between tasks unless you truly need memory.
  • Avoid “reflect” or “self-critique” loops for routine tasks. Those are extra calls.

4) Choose cheaper serving options inside HF routing

HF’s inference model catalog shows multiple providers for the same model and labels like cheapest and fastest. Use that as a quick way to avoid accidentally picking a pricey combo. (Hugging Face)


What to do right now if you are blocked today

If it is HF included credits

  • Wait for the monthly reset (if you want to stay on free credits).
  • Upgrade to PRO if you want more included credits and fewer interruptions. HF users commonly hit this during the course. (Hugging Face Forums)
  • Use a different backend (local or another API) for the course notebooks.

If it is provider monthly usage (Novita or Together account limits)

  • Check that provider dashboard for quota reset date.
  • Add payment method or upgrade plan on that provider.
  • Switch providers for the remainder of the course.

Good alternative APIs with free tiers or low-cost testing

“Free API” almost always means “small free tier with strict rate limits.” For learning and small demos, that is often enough.

1) OpenRouter

Pros:

  • One API for many models, including :free variants.
  • Clear published free plan limits.

Published limits:

  • Free plan: 50 requests/day and 20 RPM. (OpenRouter)

Best use:

  • Course exercises, small agents, quick tests.

2) Cloudflare Workers AI

Pros:

  • Real daily free allocation.
  • Good for lightweight agents and utility calls.

Published free allocation:

Best use:

  • Small chat agents, classification, embeddings, simple pipelines.

3) Google Gemini API

Pros:

  • Has an “unpaid quota” concept and documented rate limits.
  • Good general model quality.

Docs:

  • Gemini API has published quota and rate-limit docs, and their terms explicitly reference “unpaid quota.” (Google AI for Developers)

Best use:

  • General agent reasoning and tool-calling experiments.

4) Groq API

Pros:

  • Very fast inference.
  • OpenAI-compatible base URL, which makes integration easier.

Docs:

  • Groq publishes rate-limit docs and model pages listing rate limits. (GroqCloud)

Best use:

  • Multi-step agents where latency matters.

5) Mistral API

Pros:

  • Solid models for agentic tasks.
  • A free API tier exists, but it is restrictive.

Docs:

  • Mistral states there is a free API tier with restrictive rate limits. (Mistral AI)

Best use:

  • Evaluation, prototyping, smaller-scale projects.

Best local options for the Agents Course

Local is the most reliable workaround because there is no monthly API quota. You trade money limits for compute limits.

Option A: Ollama (easy local server)

Pros:

  • One-command model pulls.
  • OpenAI-compatible endpoints, so many agent frameworks can connect with minimal changes. (Ollama Documentation)

Best for:

  • Running Llama-family or Qwen-family instruct models locally on CPU or GPU.

Option B: vLLM (best throughput if you have a GPU)

Pros:

  • High performance serving for transformer models.
  • Good for multi-agent workloads on a single machine.

Best for:

  • If you have an NVIDIA GPU and want speed.

Option C: llama.cpp or GGUF-based stacks (CPU-friendly)

Pros:

  • Runs well on CPU, supports quantized models.
  • Great for laptops.

Best for:

  • Lightweight agent demos and learning.

A simple decision rule

  • If you want the course notebooks to “just run” without billing surprises: use local (Ollama).
  • If you want hosted but low friction: use OpenRouter :free models, accept the daily cap. (OpenRouter)
  • If you want a clean commercial API with a free tier: Gemini or Mistral, but expect strict limits. (Mistral AI)
  • If you want very fast agent loops: Groq. (GroqCloud)

Curated references

  • HF Inference Providers pricing and credit model. (Hugging Face)
  • smolagents InferenceClientModel controls (requests_per_minute, etc.). (Hugging Face)
  • smolagents multi-step agent loop and max_steps. (Hugging Face)
  • OpenRouter free plan limits and :free model limits. (OpenRouter)
  • Cloudflare Workers AI free allocation. (Cloudflare Docs)
  • Ollama OpenAI compatibility for local serving. (Ollama Documentation)
  • Mistral free API tier note (restrictive). (Mistral AI)

Summary

  • This happened because multi-agent and agentic loops multiply LLM calls and tokens fast. (Hugging Face)
  • Fix fastest: lower max_steps, lower max_tokens, use smaller models, reduce inter-agent chatter. (Hugging Face)
  • If you are blocked by monthly credits: switch backend. Best “no surprises” choice is local Ollama. (Ollama Documentation)
  • Good hosted alternatives with real free tiers: OpenRouter and Cloudflare Workers AI. (OpenRouter)