Ollama Integration¶
AdalFlow provides comprehensive support for Ollama, enabling you to run open-source LLMs locally without depending on external APIs. This integration supports both synchronous and asynchronous operations, including streaming responses.
Quick Start¶
Installation¶
# Install AdalFlow with Ollama support
pip install adalflow[ollama]
# Install Ollama (if not already installed)
curl -fsSL https://ollama.com/install.sh | sh
# Start Ollama server
ollama serve
# Pull a model (e.g., qwen2:0.5b, mistral, or gpt-oss)
ollama pull qwen2:0.5b
Basic Usage¶
from adalflow.components.model_client import OllamaClient
from adalflow.core import Generator
# Initialize the Generator with OllamaClient
generator = Generator(
model_client=OllamaClient(host="http://localhost:11434"),
model_kwargs={"model": "qwen2:0.5b"}
)
# Generate a response
response = generator({"input_str": "Hello, what can you do?"})
print(response.data)
Model Capabilities¶
Text Generation¶
Basic text generation with various open-source models:
# Configure with specific model parameters
generator = Generator(
model_client=OllamaClient(),
model_kwargs={
"model": "qwen2:0.5b",
"options": {
"temperature": 0.7,
"num_predict": 512,
"top_k": 40,
"top_p": 0.9,
"repeat_penalty": 1.1,
"num_ctx": 2048,
}
}
)
response = generator({"input_str": "Explain quantum computing"})
print(response.data)
Streaming Responses¶
Real-time streaming for better user experience:
Synchronous Streaming:
# Enable streaming
stream_generator = Generator(
model_client=OllamaClient(),
model_kwargs={
"model": "qwen2:0.5b",
"stream": True
}
)
output = stream_generator.call(
prompt_kwargs={"input_str": "Tell me a story"}
)
# Access the raw streaming response
for chunk in output.raw_response:
if "message" in chunk:
print(chunk["message"]["content"], end='', flush=True)
Asynchronous Streaming:
import asyncio
# Using async streaming
output = await stream_generator.acall(
prompt_kwargs={"input_str": "Tell me a story"}
)
# Access the raw async streaming response
async for chunk in output.raw_response:
if "message" in chunk:
print(chunk["message"]["content"], end='', flush=True)
Chat vs Generate API¶
Ollama supports two APIs for text generation:
# Chat API (default) - uses conversation format
chat_generator = Generator(
model_client=OllamaClient(),
model_kwargs={"model": "qwen2:0.5b"}
)
# Generate API - uses raw prompt
generate_generator = Generator(
model_client=OllamaClient(),
model_kwargs={
"model": "qwen2:0.5b",
"generate": True # Use generate API instead of chat
}
)
Text Embeddings¶
Generate embeddings for semantic search and similarity:
from adalflow.core import Embedder
embedder = Embedder(
model_client=OllamaClient(),
model_kwargs={"model": "nomic-embed-text"}
)
# Single text embedding
text = "This is a sample text for embedding"
embedding = embedder(input=text)
print(f"Embedding dimension: {len(embedding.data[0].embedding)}")
Advanced Features¶
Reasoning Models (GPT-OSS)¶
Use OpenAI’s GPT-OSS models locally with Ollama:
# Pull GPT-OSS model
# Run in terminal: ollama pull gpt-oss:20b
reasoning_gen = Generator(
model_client=OllamaClient(),
model_kwargs={
"model": "gpt-oss:20b",
"options": {
"temperature": 0.7,
"num_predict": 1024,
}
}
)
response = reasoning_gen({
"input_str": "Solve this problem step by step: ..."
})
# Access reasoning process if available
if response.thinking:
print("Thinking:", response.thinking)
print("Answer:", response.data)
Model Options¶
Complete list of configurable options:
Option |
Default |
Description |
---|---|---|
|
0 |
Random seed for reproducible generation |
|
128 |
Maximum tokens to generate (-1 for infinite) |
|
0.8 |
Creativity level (0.0-2.0) |
|
40 |
Number of top tokens to consider |
|
0.9 |
Cumulative probability cutoff |
|
1.1 |
Penalty for repeated tokens |
|
2048 |
Context window size |
|
[] |
Stop sequences (e.g., [”\n”, “user:”]) |
|
0 |
Mirostat sampling (0=disabled, 1/2=enabled) |
Available Models¶
Popular models compatible with Ollama:
Model |
Size |
Best For |
---|---|---|
|
0.5B |
Lightweight, fast inference, good for testing |
|
8B |
General purpose, balanced performance |
|
7B |
Fast inference, good for coding |
|
8x7B |
High quality, mixture of experts |
|
0.5B-72B |
Multilingual, various sizes |
|
7B-34B |
Code generation and understanding |
|
20B/120B |
OpenAI’s open-source reasoning model |
|
Text embeddings for semantic search |
To see all available models, visit: https://ollama.com/library