この記事は個人ブログと同じ内容です

ChatGPTを使ったテキストの構造化を行うタスクをいくつか進めていく中で、どういうフォーマットが構造化に向いているのかというのは非常に重要な要件になるかと思っています。よく聞くのはJSON形式にはなるのですが、果たしてLLMにおいて最適なのかを考え直した時に、TOML形式がベストなんじゃないかというのが現状の答えなのでそこに至った理由を紹介していきます。

本件は先日開催されたLLM Meetup Tokyo #2のLTで紹介させていただきました。

lu.ma

当時発表したスライドはこちら

docs.google.com

TOMLってなんだっけ？

TOMLはTom's Obvious, Minimal Languageの略で、設定ファイルのフォーマットとして広く使用されています。この形式は、キーと値のペアを使用してデータを表現します。そのシンプルさと直感的な構造が特徴であり、人間が読み書きしやすいフォーマットとなっています。

TOMLは設定ファイルとして広く使用されており、特にソフトウェアプロジェクトでの利用が多いです。以下に、TOMLを設定ファイルとして使っている具体的な例をいくつか紹介します。

静的サイトジェネレータ: JekyllやHugoなどの静的サイトジェネレータでは、サイトの設定やビルド設定をTOML形式で書くことがあります。

継続的インテグレーション（CI）: GitHubやGitLabなどのCIツールでは、ビルドやテストの設定をTOML形式で書くことがあります。

プログラミング言語: PythonやRustなどのプログラミング言語では、パッケージやライブラリの設定をTOML形式で書くことがあります。

それでは、TOMLと他の人気のあるデータフォーマット（JSON、YAML、Markdown）との違いについて見ていきましょう。

JSONとの比較

まず、JSONについて考えてみましょう。

利点

配列やオブジェクト、数値、文字列など、多様なデータタイプを柔軟に表現できます。
多くのプログラミング言語で標準で実装されているため、幅広い用途に利用できます。

欠点

ヒューマンリーダブルではありません。改行コードやダブルクォーテーションはエスケープしなければならず、パースしないとデータの内容を理解するのが難しいです。
開始と終了をカッコで囲む必要があるため、ストリーム処理に向いていません。

YAMLとの比較

次に、YAMLについて見ていきましょう。

利点

JSONの利点を全て引き継いでいます。
データの開始と終了をカッコで囲む必要がないため、ストリーム処理に適しています。

欠点

インデントが重要になり、不必要にトークンを消費します。
複数行の文字列に関しては、パース後のインデントが元のインデントと異なる可能性があります。

Markdownとの比較

最後に、Markdownについて考えてみましょう。

利点

フラットな表現が可能で、データの開始と終了をカッコで囲む必要がありません。そのため、ストリーム処理に適しています。
ヒューマンリーダブルである。

欠点

複雑なデータ構造には向いていない。特にネスト構造は限界がある。

これらのフォーマットと比較して、TOMLはどのようなメリットがあるのでしょうか？

TOMLのメリット

TOMLの大きな利点は、上記の3つのフォーマット（JSON、YAML、Markdown）の欠点を解消していることにあります。

ヒューマンリーダブル: TOMLは人間が読みやすい形式を提供しています。JSONが持つエスケープが必要な改行コードやダブルクォーテーションの問題を持ちません。
ストリーム処理に対応: JSONのようにデータの開始と終了をカッコで囲む必要がないため、ストリーム処理に対応しています。
複雑なデータ構造に対応: Markdownが持つネスト構造の限界という問題を解消しています。
不必要なトークンの消費を避ける: YAMLが持つインデントによるトークン消費という問題も持ちません。

以上のように、TOMLはこれらのフォーマットが持つ一部の欠点を解消し、柔軟で読みやすいデータ表現を提供しています。これらの理由から、ChatGPTの構造化にはTOML形式が良さそうと言えます。

具体例を用いて比較検証

大規模言語モデルをより効果的かつ効率的に制御するためのツール、guidanceを用いてジョーク生成と評価のツールを作成する方法について説明します。

ジョーク生成と評価ツールの作成

まず、guidanceというパッケージを用います。これは、Microsoftが開発した大規模言語モデルを制御するための言語です。guidanceを用いることで、生成、プロンプト、論理制御を一つの連続したフローに組み合わせることが可能となり、言語モデルがテキストを処理する方法に合わせて構造化することができます。

今回のツールでは、我々が大規模言語モデルに対してジョークを生成させ、そのジョークが面白いかどうかを評価させる、というプロセスをコントロールします。

具体的なコードは以下の通りです。

import guidance

guidance.llm = guidance.llms.OpenAI('gpt-3.5-turbo')
prompt = guidance(
'''{{#system~}}
You are a helpful assistant.
{{~/system}}
{{#block hidden=True~}}
{{#user~}}
Please tell me a joke
{{~/user}}
{{#assistant~}}
{{gen 'joke'}}
{{~/assistant}}
{{~/block~}}
{{#user~}}
Is the following joke funny? Why or why not?
{{joke}}

The format is {{format}}.
This should contain
"funny": "yes" or "no"
"why": "reason"
"better_jokes": list of joke text and why this is better
{{~/user}}
{{#assistant~}}
{{gen 'output'}}
{{~/assistant}}''')
print("====format: yaml====")
print(prompt(format='yaml')['output'])
print("====format: json====")
print(prompt(format='json')['output'])
print("====format: toml====")
print(prompt(format='toml and use brackets inline tables')['output'])
print("====format: markdown====")
print(prompt(format='markdown')['output'])

このコードでは、まず大規模言語モデルにジョークを生成させ、次にそのジョークが面白いかどうかを評価させます。評価は、そのジョークが面白いかどうか("funny")、なぜ面白いかまたは面白くないか("why")、そしてより面白いジョークの例とその理由("better_jokes")を含む形式で行います。

また、出力形式は四つの異なる形式、すなわちYAML、JSON、TOML、およびMarkdownで表示します。

出力結果は以下の通りです。(GPT 3.5なので構造化データ以外も出力されてしまっています。)

====format: yaml====

Here's the YAML format for the joke:

funny: "yes"
why: "The joke is funny because it plays on the idea of a tomato turning red due to embarrassment or shame, but instead it turns red because it sees salad dressing, which is unexpected and humorous."
better_jokes:
  - "Why did the scarecrow win an award? Because he was outstanding in his field."
    why: "This joke is better because it uses a pun to create a humorous play on words. The phrase 'outstanding in his field' is typically used to describe someone who is an expert in their profession, but in this case, it's taken literally to refer to a scarecrow in a field."
  - "Why don't scientists trust atoms? Because they make up everything."
    why: "This joke is better because it uses a clever twist on words to create a humorous punchline. The idea that atoms make up everything is a scientific fact, but the joke subverts this by suggesting that atoms are untrustworthy."

I hope this helps! Let me know if you have any other questions.

====format: json====

{
  "funny": "yes",
  "why": "The joke is funny because it plays on the idea of a tomato turning red when it's ripe, but in this case, it's because it's embarrassed. The punchline is unexpected and silly, which makes it amusing.",
  "better_jokes": [
    {
      "joke": "Why did the scarecrow win an award? Because he was outstanding in his field!",
      "why": "This joke is better because it has a clever play on words with the double meaning of 'outstanding.' It also has a clear setup and punchline structure."
    },
    {
      "joke": "Why don't scientists trust atoms? Because they make up everything!",
      "why": "This joke is better because it has a clever twist on the idea of atoms being the building blocks of everything. It also has a clear setup and punchline structure."
    }
  ]
}

====format: toml====

Here's the toml format for the joke:

[joke]
text = "Why did the tomato turn red?\nBecause it saw the salad dressing!"
funny = "yes"
why = "The joke plays on the idea of a tomato blushing or turning red when it sees something it likes, but in this case it's because it sees salad dressing. It's a simple and silly joke that can make people chuckle."

[[better_jokes]]
joke = "Why did the scarecrow win an award?\nBecause he was outstanding in his field!"
reason = "This joke is a play on words, using the double meaning of 'outstanding' to create a pun. It's a classic joke that many people find funny."

[[better_jokes]]
joke = "Why don't scientists trust atoms?\nBecause they make up everything!"
reason = "This joke is a clever play on words, using the double meaning of 'make up' to create a pun. It's a bit more sophisticated than the tomato joke, but still silly and fun."

I would say that the tomato joke is mildly funny, but it's a bit too simple and predictable. The punchline is easy to guess, so it doesn't have a lot of surprise or cleverness to it. However, some people might still find it amusing because it's cute and harmless.

As for the better jokes, I've included two examples that use puns to create humor. These jokes are a bit more clever and unexpected, which can make them more satisfying to hear. Of course, humor is subjective, so what one person finds funny might not work for someone else.

====format: markdown====

"funny": "yes"
"why": "The joke is funny because it plays on the idea of a tomato turning red when it's ripe, but in this case, it's because it's embarrassed by the salad dressing. It's a simple and silly joke that can make people chuckle."

"better_jokes": 
- Why did the scarecrow win an award? Because he was outstanding in his field.
  - This joke is better because it has a pun that is unexpected and clever.
- Why don't scientists trust atoms? Because they make up everything.
  - This joke is better because it has a clever twist on a common phrase and plays on the idea of atoms being the building blocks of everything.

まとめ

TOMLとLLMの相性について紹介させていただきました。もしこういうフォーマットのほうがいいよみたいな意見があればぜひいただけると嬉しいです。

また、ROXXでは積極的に採用しています！

Engineer の求人一覧 - 株式会社ROXX