Groq + Open-Source Models: Fast API Inference Without Hosting

This is Part 6 of the AI Agents series. Parts 1–5 covered how LLMs work, practical usage, open-source vs paid models, making API calls, and controlling output with parameters. This post solves a specific problem: how do you use open-source models via API without hosting them yourself?

1. The infrastructure problem with open-source models

Open-source models are free to download. Running them is not.

A 70B parameter model needs significant GPU memory, serious compute, and continuous uptime if your app is live. If you spin up a cloud GPU instance for that, you pay for it around the clock — whether you have traffic or not. At low request volumes, that cost doesn’t make sense.

You need a middle path: open-source models, API access, someone else’s infrastructure.

2. What Groq is and why the speed matters

Groq is an inference company that hosts open-source models and exposes them via API. You send a request, they run it on their hardware, you get a response.

What makes Groq worth knowing about is token generation speed. Groq runs models on LPUs (Language Processing Units) — custom chips designed specifically for token generation. The result is noticeably faster output compared to GPU-based inference at the same model size.

For a user watching words appear on screen, the difference between 50 tokens/sec and 200 tokens/sec is the difference between “this feels fast” and “this feels broken.”

Groq also offers a free tier, which means you can get started without a credit card.

3. Setup: API key and environment variable

Create an account at console.groq.com — sign in with Google or GitHub. Generate an API key from the dashboard.

Never put the key directly in your code. Store it as an environment variable:

export GROQ_API_KEY="your-key-here"

Then read it in Python:

import os
from groq import Groq

client = Groq(api_key=os.environ["GROQ_API_KEY"])

4. Available models

Groq’s production lineup as of mid-2026:

Model ID	Parameters	Use case
`llama-3.1-8b-instant`	8B	Fast, lightweight
`llama-3.3-70b-versatile`	70B	Strong general-purpose
`openai/gpt-oss-20b`	20B	OpenAI open-weight, efficient
`openai/gpt-oss-120b`	120B	OpenAI open-weight flagship
`whisper-large-v3`	—	Speech-to-text

Preview models (not production-stable, may be removed without notice): Llama 4 Scout, Llama Prompt Guard 2, Qwen3-32B.

Note on model versions: The models above are accurate as of May 2026. Groq’s lineup changes frequently — models get added, deprecated, or moved between preview and production tiers. Always verify the current list at console.groq.com/docs/models before hardcoding a model name in production code.

5. Your first Groq API call

pip install groq

import os
from groq import Groq

client = Groq(api_key=os.environ["GROQ_API_KEY"])

response = client.chat.completions.create(
    model="llama-3.3-70b-versatile",
    messages=[
        {"role": "user", "content": "Tell me about Baahubali"}
    ]
)

print(response.choices[0].message.content)

The response object is identical in shape to what OpenAI returns. Same .choices[0].message.content path.

6. OpenAI compatibility: migrate existing code in minutes

Groq exposes an OpenAI-compatible endpoint. If you already have code written against the OpenAI SDK, you can point it at Groq instead by changing the base_url and swapping the key:

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["GROQ_API_KEY"],
    base_url="https://api.groq.com/openai/v1"
)

response = client.chat.completions.create(
    model="llama-3.3-70b-versatile",
    messages=[
        {"role": "user", "content": "Tell me about Baahubali"}
    ]
)

print(response.choices[0].message.content)

Your request shape, your message format, your response parsing — all stays the same. This makes it easy to benchmark Groq vs OpenAI on the same task without rewriting your app.

7. Rate limits on the free tier

For llama-3.3-70b-versatile on the free tier:

Limit	Value
Requests per minute	30
Requests per day	1,000
Tokens per minute	12,000
Tokens per day	100,000

These are hard limits. Hit them and you get a 429 Too Many Requests response.

Design for this from the start:

Set max_tokens on every call to cap output length and avoid burning through your TPM budget on one long response
Add retry logic with exponential backoff for 429 errors
If you need more headroom, Groq’s paid Developer plan has significantly higher limits

The exact numbers can change — always verify at console.groq.com/settings/limits.

8. Streaming responses

Part 4 covered streaming with the OpenAI SDK. The pattern is identical on Groq:

import os
from groq import Groq

client = Groq(api_key=os.environ["GROQ_API_KEY"])

stream = client.chat.completions.create(
    model="llama-3.3-70b-versatile",
    messages=[
        {"role": "user", "content": "Tell me about Baahubali"}
    ],
    stream=True
)

for chunk in stream:
    delta = chunk.choices[0].delta.content
    if delta:
        print(delta, end="", flush=True)

With Groq’s fast token generation, streaming feels especially responsive — words appear almost immediately. For any user-facing interface, always stream.

9. Evaluating speed fairly

Groq benchmarks often show faster token generation than other providers. That’s real — LPUs are purpose-built for this.

But the comparison is only fair when you’re looking at the same model. A Llama 70B on Groq versus GPT-4o is not a fair speed comparison — those models have very different capacities.

Evaluate on two dimensions independently:

Speed — where Groq often wins for the same model size
Quality — which depends on the model, not the infrastructure

Groq is an excellent choice when you need open-source model quality at API convenience and fast latency. It is not a replacement for frontier closed models on tasks that need them.

What’s next

Part 7 goes one step further: running open-source models entirely locally on your laptop — no API, no cloud, no data leaving your machine. That post covers the VRAM math, quantization, Ollama, and what’s realistically runnable on consumer hardware.

Full video walkthrough is embedded above.