Your First LLM API Call: OpenAI, Streaming, and System Prompts
Stop using the chat UI — connect to LLMs directly in Python. Covers OpenAI and Groq setup, streaming vs non-streaming, picking the right model for cost, and controlling behavior with system prompts.
This is Part 4 of the AI Agents series. Parts 1–3 covered how LLMs work, how to use them practically, and when to choose open-source vs paid. Now we write actual code.
This is where the series shifts from theory to building. Everything from here on is hands-on.
Two platforms to know
OpenAI — paid, closed-source models (GPT-4o, GPT-4o-mini, o3). Pay per token. Best benchmark performance.
Groq — inference company that hosts open-source models (Llama, Mistral, Gemma). Free tier available, great for experimenting without a credit card.
This post uses the OpenAI SDK, but the patterns — streaming, system prompts, roles — apply to every provider.
Step 1: Get your API key
Go to the OpenAI developer platform → your profile → API keys → Create a new secret key. Give it a name like test-key.
Copy it immediately. The platform won’t show it again. If it gets compromised, delete it and generate a new one.
Step 2: Install the SDK and create a client
pip install openai
from openai import OpenAI
client = OpenAI(api_key="your-api-key-here")
That client object is your entry point to every model OpenAI offers.
Step 3: Pick the right model
Two common choices and why cost matters:
| Model | Cost (per 1M tokens) | When to use |
|---|---|---|
gpt-4o | Higher | Complex reasoning tasks |
gpt-4o-mini | Much cheaper | Most use cases — start here |
For experimentation and most real applications, start with gpt-4o-mini. It’s cheap enough that mistakes don’t hurt. If you hit quality limits on a specific task, upgrade.
Step 4: Make your first call
response = client.responses.create(
model="gpt-4o-mini",
input="Tell me about Bahubali"
)
print(response.output_text)
This works — but there’s a problem with how it feels to use.
Streaming vs Non-Streaming
By default, the API waits until the entire response is generated before returning anything. For a 1000-word answer, that’s 40–60 seconds of a blank screen. A user will assume the app is broken.
LLMs generate text token-by-token internally — streaming just surfaces that in real time.
Non-streaming (default):
response = client.responses.create(
model="gpt-4o-mini",
input="Tell me about Bahubali"
)
print(response.output_text)
# → nothing for 40 seconds, then the full answer appears at once
Streaming:
with client.responses.stream(
model="gpt-4o-mini",
input="Tell me about Bahubali"
) as stream:
for text in stream.text_stream:
print(text, end="", flush=True)
# → words appear immediately as they generate
When to use which
| Mode | Use case |
|---|---|
| Streaming | Any user-facing output — chat, Q&A, real-time generation |
| Non-streaming | Background tasks where nobody is watching — auto-generating emails, batch processing |
If you’re building something a human will read, stream it.
Model reasoning: small vs large
Ask gpt-4o-mini about “Bahubali”:
Returns the plot of the movie franchise directly.
Ask a reasoning-capable model (like o3) the same question:
Pauses, considers that “Bahubali” could refer to the blockbuster film series or to a figure in Indian religious history — then addresses both before answering.
Neither is wrong. The reasoning model is more thorough but slower and more expensive.
Rule of thumb:
- Simple, well-defined tasks → small cheap model
- Ambiguous questions, multi-step reasoning, nuanced decisions → reasoning model
Getting this right keeps your costs down and your margins healthy.
Controlling behavior with System Prompts
By default, the model answers anything within its knowledge. You can constrain and shape that behavior using roles.
The API accepts two roles:
| Role | Purpose |
|---|---|
user | The question or prompt |
system | Instructions that define the model’s persona, scope, and constraints |
Problem: Ask a smart model about “Bahubali” and it might go into religious history — not what you want for a movie recommendation app.
Fix: give it a system prompt
response = client.responses.create(
model="gpt-4o-mini",
instructions="You are a movie buff. You only have knowledge about films and cinema. Do not discuss religion, history, or anything outside movies.",
input="Tell me about Bahubali"
)
print(response.output_text)
# → Only movie content, religious history ignored
Ask about “Sri Ramadasu” next — same constraint applies. The model stays in its lane.
System prompts are how you turn a general-purpose LLM into a specialized assistant for your product.
Alternate syntax: Chat Completions API
There’s a second way to call the API — the Chat Completions endpoint. It works identically but uses a messages list instead of separate input and instructions fields:
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "You are a movie buff. Only discuss films."},
{"role": "user", "content": "Tell me about Bahubali"}
]
)
print(response.choices[0].message.content)
Streaming works here too — same stream=True pattern.
You’ll see both styles in real codebases. The Responses API (client.responses) is newer and cleaner; Chat Completions is older but more widely documented. Both are fine.
What you can build from here
With client + model + system prompt + streaming, you have the core of almost any LLM-powered feature:
- A chatbot with a custom persona
- A document Q&A tool
- An automated content generator
- The backbone of an AI agent
The next post in the series connects the pieces — adding memory and tools to turn a single API call into a proper agent.
Full video walkthrough is embedded above.