ollama_client#

Ollama ModelClient integration.

Functions

parse_generate_response(completion)

Parse the completion to a str.

parse_stream_response(completion)

Parse the completion to a str.

Classes

OllamaClient([host])

A component wrapper for the Ollama SDK client.

parse_stream_response(completion: Generator) Any[source]#

Parse the completion to a str. We use the generate with prompt instead of chat with messages.

parse_generate_response(completion: GenerateResponse) GeneratorOutput[source]#

Parse the completion to a str. We use the generate with prompt instead of chat with messages.

class OllamaClient(host: str | None = None)[source]#

Bases: ModelClient

A component wrapper for the Ollama SDK client.

To make a model work, you need to:

  • [Download Ollama app] Go to ollama/ollama to download the Ollama app (command line tool). Choose the appropriate version for your operating system.

  • [Pull a model] Run the following command to pull a model:

ollama pull llama3
  • [Run a model] Run the following command to run a model:

ollama run llama3

This model will be available at http://localhost:11434. You can also chat with the model at the terminal after running the command.

Parameters:

host (Optional[str], optional) – Optional host URI. If not provided, it will look for OLLAMA_HOST env variable. Defaults to None. The default host is “http://localhost:11434”.

Setting model_kwargs:

For LLM, expect model_kwargs to have the following keys:

model (str, required):

Use ollama list via your CLI or visit ollama model page on https://ollama.com/library

stream (bool, default: False ) – Whether to stream the results.

options (Optional[dict], optional)

Options that affect model output.

# If not specified the following defaults will be assigned.

“seed”: 0, - Sets the random number seed to use for generation. Setting this to a specific number will make the model generate the same text for the same prompt.

“num_predict”: 128, - Maximum number of tokens to predict when generating text. (-1 = infinite generation, -2 = fill context)

“top_k”: 40, - Reduces the probability of generating nonsense. A higher value (e.g. 100) will give more diverse answers, while a lower value (e.g. 10) will be more conservative.

“top_p”: 0.9, - Works together with top-k. A higher value (e.g., 0.95) will lead to more diverse text, while a lower value (e.g., 0.5) will generate more focused and conservative text.

“tfs_z”: 1, - Tail free sampling. This is used to reduce the impact of less probable tokens from the output. Disabled by default (e.g. 1) (More documentation here for specifics)

“repeat_last_n”: 64, - Sets how far back the model should look back to prevent repetition. (0 = disabled, -1 = num_ctx)

“temperature”: 0.8, - The temperature of the model. Increasing the temperature will make the model answer more creatively.

“repeat_penalty”: 1.1, - Sets how strongly to penalize repetitions. A higher value(e.g., 1.5 will penlaize repetitions more strongly, while lowe values *e.g., 0.9 will be more lenient.)

“mirostat”: 0.0, - Enable microstat smapling for controlling perplexity. (0 = disabled, 1 = microstat, 2 = microstat 2.0)

“mirostat_tau”: 0.5, - Controls the balance between coherence and diversity of the output. A lower value will result in more focused and coherent text.

“mirostat_eta”: 0.1, - Influences how quickly the algorithm responds to feedback from the generated text. A lower learning rate will result in slower adjustments, while a higher learning rate will make the algorithm more responsive.

“stop”: [”n”, “user:”], - Sets the stop sequences to use. When this pattern is encountered the LLM will stop generating text and return. Multiple stop patterns may be set by specifying multiple separate stop parameters in a modelfile.

“num_ctx”: 2048, - Sets the size of the context window used to generate the next token.

For EMBEDDER, expect model_kwargs to have the following keys:

model (str, required):

Use ollama list via your CLI or visit ollama model page on https://ollama.com/library

prompt (str, required):

String that is sent to the Embedding model.

options (Optional[dict], optional):

See LLM args for defaults.

References

Tested Ollama models: 7/9/24

  • internlm2:latest

  • llama3

  • jina/jina-embeddings-v2-base-en:latest

Note

We use embeddings and generate apis from Ollama SDK. Please refer to ollama/ollama-python for model_kwargs details.

init_sync_client()[source]#

Create the synchronous client

init_async_client()[source]#

Create the asynchronous client

parse_chat_completion(completion: GenerateResponse | Generator) GeneratorOutput[source]#

Parse the completion to a str. We use the generate with prompt instead of chat with messages.

parse_embedding_response(response: Dict[str, List[float]]) EmbedderOutput[source]#

Parse the embedding response to a structure LightRAG components can understand. Pull the embedding from response[‘embedding’] and store it Embedding dataclass

convert_inputs_to_api_kwargs(input: Any | None = None, model_kwargs: Dict = {}, model_type: ModelType = ModelType.UNDEFINED) Dict[source]#

Convert the input and model_kwargs to api_kwargs for the Ollama SDK client.

call(api_kwargs: Dict = {}, model_type: ModelType = ModelType.UNDEFINED)[source]#

Subclass use this to call the API with the sync client. model_type: this decides which API, such as chat.completions or embeddings for OpenAI. api_kwargs: all the arguments that the API call needs, subclass should implement this method.

Additionally in subclass you can implement the error handling and retry logic here. See OpenAIClient for example.

async acall(api_kwargs: Dict = {}, model_type: ModelType = ModelType.UNDEFINED)[source]#

Subclass use this to call the API with the async client.

classmethod from_dict(data: Dict[str, Any]) OllamaClient[source]#

Create an instance from previously serialized data using to_dict() method.

to_dict(exclude: List[str] | None = None) Dict[str, Any][source]#

Convert the component to a dictionary.