ollama_client¶

Ollama ModelClient integration.

Functions

`extract_ollama_tool_calls`(message)	Extract tool calls from Ollama message response and convert to Function objects.
`parse_chat_messsage`(completion)	Parse the chat message from the completion.
`parse_generate_response`(completion)	Parse the completion to a str.

Classes

OllamaClient([host])

A component wrapper for the Ollama SDK client.

extract_ollama_tool_calls(message: Message) → List[Function] | None[source]¶

Extract tool calls from Ollama message response and convert to Function objects.

Parameters:: message – The message dict from Ollama response containing potential tool_calls
Returns:: List of Function objects if tool_calls exist, None otherwise

parse_generate_response(completion: GenerateResponse) → GeneratorOutput[source]¶: Parse the completion to a str. We use the generate with prompt instead of chat with messages.

parse_chat_messsage(completion: Dict[str, Any]) → GeneratorOutput[source]¶: Parse the chat message from the completion.

class OllamaClient(host: str | None = None)[source]¶

Bases: ModelClient

A component wrapper for the Ollama SDK client.

Streaming Support:

When using streaming with Ollama, the raw response chunks are accessible through output.raw_response. For async streaming:

# Using Generator with async streaming
generator = Generator(
    model_client=OllamaClient(),
    model_kwargs={"model": "llama3", "stream": True}
)

output = await generator.acall(
    prompt_kwargs={"input_str": "Tell me a story"}
)

# Access the raw streaming response
async for chunk in output.raw_response:
    if "message" in chunk:
        print(chunk["message"]["content"], end='', flush=True)

For synchronous streaming:

output = generator.call(
    prompt_kwargs={"input_str": "Tell me a story"}
)

# Access the raw streaming response
for chunk in output.raw_response:
    if "message" in chunk:
        print(chunk["message"]["content"], end='', flush=True)

To make a model work, you need to:

[Download Ollama app] Go to https://github.com/ollama/ollama?tab=readme-ov-file to download the Ollama app (command line tool). Choose the appropriate version for your operating system. One way to do is to run the following command:

curl -fsSL https://ollama.com/install.sh | sh
ollama serve

[Pull a model] Run the following command to pull a model:

ollama pull llama3

[Run a model] Run the following command to run a model:

ollama run llama3

This model will be available at http://localhost:11434. You can also chat with the model at the terminal after running the command.

Parameters:: host (Optional[str], optional) – Optional host URI. If not provided, it will look for OLLAMA_HOST env variable. Defaults to None. The default host is “http://localhost:11434”.

Setting model_kwargs:

For LLM, expect model_kwargs to have the following keys:

model (str, required):
Use ollama list via your CLI or visit ollama model page on https://ollama.com/library

stream (bool, default: False ) – Whether to stream the results.

options (Optional[dict], optional)
Options that affect model output.

# If not specified the following defaults will be assigned.

“seed”: 0, - Sets the random number seed to use for generation. Setting this to a specific number will make the model generate the same text for the same prompt.

“num_predict”: 128, - Maximum number of tokens to predict when generating text. (-1 = infinite generation, -2 = fill context)

“top_k”: 40, - Reduces the probability of generating nonsense. A higher value (e.g. 100) will give more diverse answers, while a lower value (e.g. 10) will be more conservative.

“top_p”: 0.9, - Works together with top-k. A higher value (e.g., 0.95) will lead to more diverse text, while a lower value (e.g., 0.5) will generate more focused and conservative text.

“tfs_z”: 1, - Tail free sampling. This is used to reduce the impact of less probable tokens from the output. Disabled by default (e.g. 1) (More documentation here for specifics)

“repeat_last_n”: 64, - Sets how far back the model should look back to prevent repetition. (0 = disabled, -1 = num_ctx)

“temperature”: 0.8, - The temperature of the model. Increasing the temperature will make the model answer more creatively.

“repeat_penalty”: 1.1, - Sets how strongly to penalize repetitions. A higher value(e.g., 1.5 will penlaize repetitions more strongly, while lowe values *e.g., 0.9 will be more lenient.)

“mirostat”: 0.0, - Enable microstat smapling for controlling perplexity. (0 = disabled, 1 = microstat, 2 = microstat 2.0)

“mirostat_tau”: 0.5, - Controls the balance between coherence and diversity of the output. A lower value will result in more focused and coherent text.

“mirostat_eta”: 0.1, - Influences how quickly the algorithm responds to feedback from the generated text. A lower learning rate will result in slower adjustments, while a higher learning rate will make the algorithm more responsive.

“stop”: [”n”, “user:”], - Sets the stop sequences to use. When this pattern is encountered the LLM will stop generating text and return. Multiple stop patterns may be set by specifying multiple separate stop parameters in a modelfile.

“num_ctx”: 2048, - Sets the size of the context window used to generate the next token.

For EMBEDDER, expect model_kwargs to have the following keys:

model (str, required):
Use ollama list via your CLI or visit ollama model page on https://ollama.com/library

prompt (str, required):
String that is sent to the Embedding model.

options (Optional[dict], optional):
See LLM args for defaults.

References

https://github.com/ollama/ollama-python
https://github.com/ollama/ollama
Models: https://ollama.com/library
Ollama API: https://github.com/ollama/ollama/blob/main/docs/api.md
Options Parameters: https://github.com/ollama/ollama/blob/main/docs/modelfile.md.
LlamaCPP API documentation(Ollama is based on this): https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#low-level-api
LLM API: https://llama-cpp-python.readthedocs.io/en/stable/api-reference/#llama_cpp.Llama.create_completion

Tested Ollama models: 7/9/24

internlm2:latest
llama3
jina/jina-embeddings-v2-base-en:latest

Note

We use embeddings and generate apis from Ollama SDK. Please refer to https://github.com/ollama/ollama-python/blob/main/ollama/_client.py for model_kwargs details.

Example:

from adalflow.core.generator import Generator
from adalflow.components.model_client import OllamaClient

# Initialize the client and generator
ollama_client = OllamaClient()
generator = Generator(
    model_client=ollama_client,
    model_kwargs={
        "model": "qwen2:0.5b",
        "stream": True,
    }
)

# Generate response
output = generator({"input_str": "What is the capital of France?"})
print(output)

init_sync_client()[source]¶: Create the synchronous client

init_async_client()[source]¶: Create the asynchronous client

parse_chat_completion(completion: GenerateResponse | Generator | AsyncGenerator) → GeneratorOutput | AsyncGenerator[GeneratorOutput, None][source]¶

Parse the completion to a str. We use the generate with prompt instead of chat with messages.

Handles both synchronous and asynchronous responses, including streaming.

Parameters:: completion – The response from Ollama API, can be: - GenerateResponse: Non-streaming generate response - GeneratorType: Synchronous streaming response - AsyncGenerator: Asynchronous streaming response - Dict: Chat response
Returns:: GeneratorOutput for non-streaming responses Generator/AsyncGenerator for streaming responses

parse_embedding_response(response: Dict[str, List[float]]) → EmbedderOutput[source]¶: Parse the embedding response to a structure AdalFlow components can understand. Pull the embedding from response[‘embedding’] and store it Embedding dataclass

convert_inputs_to_api_kwargs(input: Any | None = None, model_kwargs: Dict = {}, model_type: ModelType = ModelType.UNDEFINED) → Dict[source]¶: Convert the input and model_kwargs to api_kwargs for the Ollama SDK client.

call(api_kwargs: Dict = {}, model_type: ModelType = ModelType.UNDEFINED)[source]¶

Subclass use this to call the API with the sync client. model_type: this decides which API, such as chat.completions or embeddings for OpenAI. api_kwargs: all the arguments that the API call needs, subclass should implement this method.

Additionally in subclass you can implement the error handling and retry logic here. See OpenAIClient for example.

async acall(api_kwargs: Dict = {}, model_type: ModelType = ModelType.UNDEFINED)[source]¶: Subclass use this to call the API with the async client.

classmethod from_dict(data: Dict[str, Any]) → OllamaClient[source]¶: Create an instance from previously serialized data using to_dict() method.

to_dict(exclude: List[str] | None = None) → Dict[str, Any][source]¶: Convert the component to a dictionary.