Parser#
Parser is the interpreter of the LLM output. We have three types of parsers:
String Parsers: it simply converts the string to the desired data type. They are located at core.string_parser.
Output Parsers: it orchestrates the parsing and output formatting(in yaml, json and more) process. They are located at components.output_parsers.outputs.
JsonOutputParser
andYamlOutputParser
can work with DataClass for structured output.DataClass Parser: On top of YamlOutputParser and JsonOutputParser,
DataClassParser
is the most compatible to work with DataClass for structured output.
Context#
LLMs output text in string format. Parsing is the process of extracting and converting the string to desired data structure per the use case. This desired data structure can be:
simple data types like string, int, float, boolean, etc.
complex data types like list, dict, or data class instance.
Code like Python, SQL, html, etc.
It honestly can be converted to any kind of formats that are required by the use case. It is an important step for the LLM applications to interact with the external world, such as:
to int to support classification and float to support regression.
to list to support multiple choice selection.
to json/yaml which will be extracted to dict, and optional further to data class instance to support support cases like function calls.
Scope and Design#
Right now, we aim to cover the simple and complext data types but the code.
Parse
The following list the scope of our current support of parsing:
int_str = "42"
float_str = "42.0"
boolean_str = "True" # json works with true/false, yaml works for both True/False and true/false
None_str = "None"
Null_str = "null" # json works with null, yaml works with both null and None
dict_str = '{"key": "value"}'
list_str = '["key", "value"]'
nested_dict_str = (
'{"name": "John", "age": 30, "attributes": {"height": 180, "weight": 70}}'
)
yaml_dict_str = "key: value"
yaml_nested_dict_str = (
"name: John\nage: 30\nattributes:\n height: 180\n weight: 70"
)
yaml_list_str = "- key\n- value"
In Python, there are various ways to parse the string:
Use built-in functions like int
, float
, bool
can handle the simple types.
We can use ast.literal_eval
and json.loads()
to handle the complex types like dict, list, and nested dict.
However, none of them is as robust as yaml.safe_load
. Yaml can:
Parse True/False and ‘true/false’ to boolean.
Parse None and ‘null’ to None.
Handle nested dict and list in both yaml and json format.
Thus, we will use yaml.safe_load
as the last resort for robust parsing to handle complex data types to get List and Dict data types.
We will use int, float, bool for simple data types.
Parser#
Our parser is located at core.string_parser. It handles both extracting and parsing to python object types. And it is designed to be robust.
Parser Class |
Target Python Object |
Description |
---|---|---|
|
Extracts the first boolean value from the text with |
|
|
Extracts the first integer value from the text with |
|
|
Extracts the first float value from the text with |
|
|
Extracts ‘[]’ and parses the first list string from the text. Uses both json.loads and yaml.safe_load. |
|
|
Extracts ‘[]’ and ‘{}’ and parses JSON strings from the text. It resorts to yaml.safe_load for robust parsing. |
|
|
Extracts ‘ |
Data Class Instance
If your parsed object is dictionary, you can define and use DataClass
instance.
With from_dict
method, you can easily convert the dictionary to data class instance.
Output Parsers#
The above parsers do not come with output format instructions.
Thus, we created OutputParser
to orchestrate both the formatting and parsing process.
It is an abstract component with two main methods:
format_instructions
: to generate the output format instructions for the prompt.call
: to parse the output string to the desired python object.
If you are targetting at dict
object, we already have DataClass
to help us describe any data class type and instance that can be easily used to interact with LLMs.
Thus, JsonOutputParser
and YamlOutputParser
both takes the following arguments:
data_class
: theDataClass
type.examples
: the examples of the data class instance if you want to show the examples in the prompt.exclude
: the fields to exclude from both the data format and the examples, a way to tell theformat_instructions
on which is the output field from the data class.
DataClass Parser#
To make things even easier for the developers, we created DataClassParser
which
understands __input_fields__ and __output_fields__ of the DataClass, and it is especially helpful to work on a training dataset where we will have both inputs and outputs.
Users do not have to use exclude/include fields to specify the output fields, it will automatically understand the output fields from the DataClass instance.
Below is an overview of its key components and functionalities.
Method |
Description |
Details |
---|---|---|
|
Initializes the DataClassParser |
Takes a DataClass type, whether to return the DataClass instance after parsing, and the output format type (JSON or YAML). |
|
Returns formatted instructions for input data |
Provides a string representation of the input fields defined in the DataClass. |
|
Returns formatted instructions for output data |
Generates a schema string for the output fields of the DataClass. |
|
Formats the input data as a string |
Converts a DataClass instance to either JSON or YAML based on the specified format type. |
|
Returns the task description string |
Retrieves the task description associated with the DataClass, useful for context in LLM prompts. |
|
Formats a list of example DataClass instances |
Generates a formatted string representation of examples, adhering to the specified |
|
Parses the output string to the desired format and returns parsed output |
Handles both JSON and YAML parsing, converting to the corresponding DataClass if specified. |
Parser in Action#
All of the parsers are quite straightforward to use.
BooleanParser#
from adalflow.core.string_parser import BooleanParser
bool_str = "True"
bool_str_2 = "False"
bool_str_3 = "true"
bool_str_4 = "false"
bool_str_5 = "1" # will fail
bool_str_6 = "0" # will fail
bool_str_7 = "yes" # will fail
bool_str_8 = "no" # will fail
# it will all return True/False
parser = BooleanParser()
print(parser(bool_str))
print(parser(bool_str_2))
print(parser(bool_str_3))
print(parser(bool_str_4))
The printout will be:
True
False
True
False
Boolean parsers will not work for ‘1’, ‘0’, ‘yes’, ‘no’ as they are not the standard boolean values.
IntParser#
rom adalflow.core.string_parser import IntParser
int_str = "42"
int_str_2 = "42.0"
int_str_3 = "42.7"
int_str_4 = "the answer is 42.75"
# it will all return 42
parser = IntParser()
print(parser(int_str))
print(parser(int_str_2))
print(parser(int_str_3))
print(parser(int_str_4))
The printout will be:
42
42
42
42
IntParser
will return the integer value of the first number in the string, even if it is a float.
FloatParser#
from adalflow.core.string_parser import FloatParser
float_str = "42.0"
float_str_2 = "42"
float_str_3 = "42.7"
float_str_4 = "the answer is 42.75"
# it will all return 42.0
parser = FloatParser()
print(parser(float_str))
print(parser(float_str_2))
print(parser(float_str_3))
print(parser(float_str_4))
The printout will be:
42.0
42.0
42.7
42.75
FloatParser
will return the float value of the first number in the string, even if it is an integer.
ListParser#
from adalflow.core.string_parser import ListParser
list_str = '["key", "value"]'
list_str_2 = 'prefix["key", 2]...'
list_str_3 = '[{"key": "value"}, {"key": "value"}]'
parser = ListParser()
print(parser(list_str))
print(parser(list_str_2))
print(parser(list_str_3))
The output will be:
['key', 'value']
['key', 2]
[{'key': 'value'}, {'key': 'value'}]
JsonParser#
Even though it can work on lists, it is better to only use it for dictionaries.
from adalflow.core.string_parser import JsonParser
dict_str = '{"key": "value"}'
nested_dict_str = (
'{"name": "John", "age": 30, "attributes": {"height": 180, "weight": 70}}'
)
list_str = '["key", 2]'
list_dict_str = '[{"key": "value"}, {"key": "value"}]'
parser = JsonParser()
print(parser)
print(parser(dict_str))
print(parser(nested_dict_str))
print(parser(list_str))
print(parser(list_dict_str))
The output will be:
{'key': 'value'}
{'name': 'John', 'age': 30, 'attributes': {'height': 180, 'weight': 70}}
['key', 2]
[{'key': 'value'}, {'key': 'value'}]
YamlParser#
Though it works almost on all of the previous examples, it is better to use it for yaml formatted dictionaries.
from adalflow.core.string_parser import YamlParser
yaml_dict_str = "key: value"
yaml_nested_dict_str = (
"name: John\nage: 30\nattributes:\n height: 180\n weight: 70"
)
yaml_list_str = "- key\n- value"
parser = YamlParser()
print(parser)
print(parser(yaml_dict_str))
print(parser(yaml_nested_dict_str))
print(parser(yaml_list_str))
The output will be:
{'key': 'value'}
{'name': 'John', 'age': 30, 'attributes': {'height': 180, 'weight': 70}}
['key', 'value']
Note
All parsers will raise ValueError
if it fails at any step. Developers should process it accordingly.
Output Parsers in Action#
We will create the following simple DataClass
with one example.
And we will demonstrate how to use JsonOutputParser
and YamlOutputParser
to parse another example to dict object.
from dataclasses import dataclass, field
from adalflow.core import DataClass
@dataclass
class User(DataClass):
id: int = field(default=1, metadata={"description": "User ID"})
name: str = field(default="John", metadata={"description": "User name"})
user_example = User(id=1, name="John")
JsonOutputParser#
Here is how to use JsonOutputParser
:
from adalflow.components.output_parsers import JsonOutputParser
parser = JsonOutputParser(data_class=User, examples=[user_example])
print(parser)
The structure of it:
JsonOutputParser(
data_class=User, examples=[json_output_parser.<locals>.User(id=1, name='John')], exclude_fields=None
(json_output_format_prompt): Prompt(
template: Your output should be formatted as a standard JSON instance with the following schema:
```
{{schema}}
```
{% if example %}
Examples:
```
{{example}}
```
{% endif %}
-Make sure to always enclose the JSON output in triple backticks (```). Please do not add anything other than valid JSON output!
-Use double quotes for the keys and string values.
-Follow the JSON formatting conventions., prompt_variables: ['example', 'schema']
)
(output_processors): JsonParser()
)
The output format string will be:
Your output should be formatted as a standard JSON instance with the following schema:
```
{
"id": " (int) (optional)",
"name": " (str) (optional)"
}
```
Examples:
```
{
"id": 1,
"name": "John"
}
________
```
-Make sure to always enclose the JSON output in triple backticks (```). Please do not add anything other than valid JSON output!
-Use double quotes for the keys and string values.
-Follow the JSON formatting conventions.
Call the parser with the following string:
user_to_parse = '{"id": 2, "name": "Jane"}'
parsed_user = parser(user_to_parse)
print(parsed_user)
The output will be:
{'id': 2, 'name': 'Jane'}
YamlOutputParser#
The steps are totally the same as the JsonOutputParser
.
from adalflow.components.output_parsers import YamlOutputParser
parser = YamlOutputParser(data_class=User, examples=[user_example])
print(parser)
The structure of it:
YamlOutputParser(
data_class=<class '__main__.yaml_output_parser.<locals>.User'>, examples=[yaml_output_parser.<locals>.User(id=1, name='John')]
(yaml_output_format_prompt): Prompt(
template: Your output should be formatted as a standard YAML instance with the following schema:
```
{{schema}}
```
{% if example %}
Examples:
```
{{example}}
```
{% endif %}
-Make sure to always enclose the YAML output in triple backticks (```). Please do not add anything other than valid YAML output!
-Follow the YAML formatting conventions with an indent of 2 spaces.
-Quote the string values properly., prompt_variables: ['schema', 'example']
)
(output_processors): YamlParser()
)
The output format string will be:
Your output should be formatted as a standard YAML instance with the following schema:
```
id: (int) (optional)
name: (str) (optional)
```
Examples:
```
id: 1
name: John
________
```
-Make sure to always enclose the YAML output in triple backticks (```). Please do not add anything other than valid YAML output!
-Follow the YAML formatting conventions with an indent of 2 spaces.
-Quote the string values properly.
Now, let us parse the following string:
user_to_parse = "id: 2\nname: Jane"
parsed_user = parser(user_to_parse)
print(parsed_user)
The output will be:
{'id': 2, 'name': 'Jane'}
DataclassParser in Action#
First, let’s create a new data class with both input and output fields.
@dataclass
class SampleDataClass(DataClass):
description: str = field(metadata={"desc": "A sample description"})
category: str = field(metadata={"desc": "Category of the sample"})
value: int = field(metadata={"desc": "A sample integer value"})
status: str = field(metadata={"desc": "Status of the sample"})
__input_fields__ = [
"description",
"category",
] # Define which fields are input fields
__output_fields__ = ["value", "status"] # Define which fields are output fields
Now, lets’ create a parser that will use the SampleDataClass to parse the output json string back to the data class instance.
from adalflow.components.output_parsers import DataClassParser
parser = DataClassParser(data_class=SampleDataClass, return_data_class=True, format_type="json")
Let’s view the structure of the parser use print(parser).
The output will be:
DataClassParser(
data_class=SampleDataClass, format_type=json, return_data_class=True, input_fields=['description', 'category'], output_fields=['value', 'status']
(_output_processor): JsonParser()
(output_format_prompt): Prompt(
template: Your output should be formatted as a standard JSON instance with the following schema:
```
{{schema}}
```
-Make sure to always enclose the JSON output in triple backticks (```). Please do not add anything other than valid JSON output!
-Use double quotes for the keys and string values.
-DO NOT mistaken the "properties" and "type" in the schema as the actual fields in the JSON output.
-Follow the JSON formatting conventions., prompt_variables: ['schema']
)
)
You can get the output and input format strings using the following methods:
print(parser.get_input_format_str())
print(parser.get_output_format_str())
The output for the output format string will be:
Your output should be formatted as a standard JSON instance with the following schema:
```
{
"value": " (int) (required)",
"status": " (str) (required)"
}
```
-Make sure to always enclose the JSON output in triple backticks (```). Please do not add anything other than valid JSON output!
-Use double quotes for the keys and string values.
-DO NOT mistaken the "properties" and "type" in the schema as the actual fields in the JSON output.
-Follow the JSON formatting conventions.
The input format string will be:
{
"description": " (str) (required)",
"category": " (str) (required)"
}
Convert a json string to a data class instance:
user_input = '{"description": "Parsed description", "category": "Sample Category", "value": 100, "status": "active"}'
parsed_instance = parser.call(user_input)
print(parsed_instance)
The output will be:
SampleDataClass(description='Parsed description', category='Sample Category', value=100, status='active')
Try the examples string:
samples = [
SampleDataClass(
description="Sample description",
category="Sample category",
value=100,
status="active",
),
SampleDataClass(
description="Another description",
category="Another category",
value=200,
status="inactive",
),
]
examples_str = parser.get_examples_str(examples=samples)
print(examples_str)
The output will be:
examples_str:
{
"description": "Sample description",
"category": "Sample category",
"value": 100,
"status": "active"
}
__________
{
"description": "Another description",
"category": "Another category",
"value": 200,
"status": "inactive"
}
__________
API References