Groq + Open-Source Models: Fast API Inference Without Hosting
Open-source models are free, but hosting them is not. Learn how to use Groq to run Llama and other open models via API, understand free-tier limits, and ship your first Groq-powered call in Python.
This is Part 6 of the AI Agents series. Parts 1–5 covered how LLMs work, practical usage, open-source vs paid models, making API calls, and controlling output with parameters. This post solves a specific problem: how do you use open-source models via API without hosting them yourself?
1. The infrastructure problem with open-source models
Open-source models are free to download. Running them is not.
A 70B parameter model needs significant GPU memory, serious compute, and continuous uptime if your app is live. If you spin up a cloud GPU instance for that, you pay for it around the clock — whether you have traffic or not. At low request volumes, that cost doesn’t make sense.
You need a middle path: open-source models, API access, someone else’s infrastructure.
2. What Groq is and why the speed matters
Groq is an inference company that hosts open-source models and exposes them via API. You send a request, they run it on their hardware, you get a response.
What makes Groq worth knowing about is token generation speed. Groq runs models on LPUs (Language Processing Units) — custom chips designed specifically for token generation. The result is noticeably faster output compared to GPU-based inference at the same model size.
For a user watching words appear on screen, the difference between 50 tokens/sec and 200 tokens/sec is the difference between “this feels fast” and “this feels broken.”
Groq also offers a free tier, which means you can get started without a credit card.
3. Setup: API key and environment variable
Create an account at console.groq.com — sign in with Google or GitHub. Generate an API key from the dashboard.
Never put the key directly in your code. Store it as an environment variable:
export GROQ_API_KEY="your-key-here"
Then read it in Python:
import os
from groq import Groq
client = Groq(api_key=os.environ["GROQ_API_KEY"])
4. Available models
Groq’s production lineup as of mid-2026:
| Model ID | Parameters | Use case |
|---|---|---|
llama-3.1-8b-instant | 8B | Fast, lightweight |
llama-3.3-70b-versatile | 70B | Strong general-purpose |
openai/gpt-oss-20b | 20B | OpenAI open-weight, efficient |
openai/gpt-oss-120b | 120B | OpenAI open-weight flagship |
whisper-large-v3 | — | Speech-to-text |
Preview models (not production-stable, may be removed without notice): Llama 4 Scout, Llama Prompt Guard 2, Qwen3-32B.
Note on model versions: The models above are accurate as of May 2026. Groq’s lineup changes frequently — models get added, deprecated, or moved between preview and production tiers. Always verify the current list at console.groq.com/docs/models before hardcoding a model name in production code.
5. Your first Groq API call
pip install groq
import os
from groq import Groq
client = Groq(api_key=os.environ["GROQ_API_KEY"])
response = client.chat.completions.create(
model="llama-3.3-70b-versatile",
messages=[
{"role": "user", "content": "Tell me about Baahubali"}
]
)
print(response.choices[0].message.content)
The response object is identical in shape to what OpenAI returns. Same .choices[0].message.content path.
6. OpenAI compatibility: migrate existing code in minutes
Groq exposes an OpenAI-compatible endpoint. If you already have code written against the OpenAI SDK, you can point it at Groq instead by changing the base_url and swapping the key:
import os
from openai import OpenAI
client = OpenAI(
api_key=os.environ["GROQ_API_KEY"],
base_url="https://api.groq.com/openai/v1"
)
response = client.chat.completions.create(
model="llama-3.3-70b-versatile",
messages=[
{"role": "user", "content": "Tell me about Baahubali"}
]
)
print(response.choices[0].message.content)
Your request shape, your message format, your response parsing — all stays the same. This makes it easy to benchmark Groq vs OpenAI on the same task without rewriting your app.
7. Rate limits on the free tier
For llama-3.3-70b-versatile on the free tier:
| Limit | Value |
|---|---|
| Requests per minute | 30 |
| Requests per day | 1,000 |
| Tokens per minute | 12,000 |
| Tokens per day | 100,000 |
These are hard limits. Hit them and you get a 429 Too Many Requests response.
Design for this from the start:
- Set
max_tokenson every call to cap output length and avoid burning through your TPM budget on one long response - Add retry logic with exponential backoff for 429 errors
- If you need more headroom, Groq’s paid Developer plan has significantly higher limits
The exact numbers can change — always verify at console.groq.com/settings/limits.
8. Streaming responses
Part 4 covered streaming with the OpenAI SDK. The pattern is identical on Groq:
import os
from groq import Groq
client = Groq(api_key=os.environ["GROQ_API_KEY"])
stream = client.chat.completions.create(
model="llama-3.3-70b-versatile",
messages=[
{"role": "user", "content": "Tell me about Baahubali"}
],
stream=True
)
for chunk in stream:
delta = chunk.choices[0].delta.content
if delta:
print(delta, end="", flush=True)
With Groq’s fast token generation, streaming feels especially responsive — words appear almost immediately. For any user-facing interface, always stream.
9. Evaluating speed fairly
Groq benchmarks often show faster token generation than other providers. That’s real — LPUs are purpose-built for this.
But the comparison is only fair when you’re looking at the same model. A Llama 70B on Groq versus GPT-4o is not a fair speed comparison — those models have very different capacities.
Evaluate on two dimensions independently:
- Speed — where Groq often wins for the same model size
- Quality — which depends on the model, not the infrastructure
Groq is an excellent choice when you need open-source model quality at API convenience and fast latency. It is not a replacement for frontier closed models on tasks that need them.
What’s next
Part 7 goes one step further: running open-source models entirely locally on your laptop — no API, no cloud, no data leaving your machine. That post covers the VRAM math, quantization, Ollama, and what’s realistically runnable on consumer hardware.
Full video walkthrough is embedded above.