Jiaxin Huang Shixiang Shane Gu Le Hou Yuexin Wu Xuezhi Wang Hongkun Yu Jiawei Han

Abstract
Large Language Models (LLMs) have achieved excellent performances in various tasks. However, fine-tuning an LLM requires extensive supervision. Human, on the other hand, may improve their reasoning abilities by self-thinking without external inputs. In this work, we demonstrate that an LLM is also capable of self-improving with only unlabeled datasets. We use a pre-trained LLM to generate "high-confidence" rationale-augmented answers for unlabeled questions using Chain-of-Thought prompting and self-consistency, and fine-tune the LLM using those self-generated solutions as target outputs. We show that our approach improves the general reasoning ability of a 540B-parameter LLM (74.4%->82.1% on GSM8K, 78.2%->83.0% on DROP, 90.0%->94.4% on OpenBookQA, and 63.4%->67.9% on ANLI-A3) and achieves state-of-the-art-level performance, without any ground truth label. We conduct ablation studies and show that fine-tuning on reasoning is critical for self-improvement.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| arithmetic-reasoning-on-gsm8k | PaLM 540B (Self Consistency) | Accuracy: 74.4 Parameters (Billion): 540 |
| arithmetic-reasoning-on-gsm8k | PaLM 540B (Self Improvement, Standard-Prompting) | Accuracy: 32.2 Parameters (Billion): 540 |
| arithmetic-reasoning-on-gsm8k | PaLM 540B (Self Improvement, Self Consistency) | Accuracy: 82.1 Parameters (Billion): 540 |
| arithmetic-reasoning-on-gsm8k | PaLM 540B (CoT Prompting) | Accuracy: 56.5 Parameters (Billion): 540 |
| arithmetic-reasoning-on-gsm8k | PaLM 540B (Self Improvement, CoT Prompting) | Accuracy: 73.5 Parameters (Billion): 540 |
| arithmetic-reasoning-on-gsm8k | PaLM 540B (Standard-Prompting) | Accuracy: 17.9 Parameters (Billion): 540 |
| common-sense-reasoning-on-arc-challenge | PaLM 540B (Self Improvement, CoT Prompting) | Accuracy: 88.3 |
| common-sense-reasoning-on-arc-challenge | PaLM 540B (CoT Prompting) | Accuracy: 85.2 |
| common-sense-reasoning-on-arc-challenge | PaLM 540B (Standard-Prompting) | Accuracy: 87.1 |
| common-sense-reasoning-on-arc-challenge | PaLM 540B (Self Improvement, Self Consistency) | Accuracy: 89.8 |
| common-sense-reasoning-on-arc-challenge | PaLM 540B (Self Improvement, Standard-Prompting) | Accuracy: 87.2 |
| common-sense-reasoning-on-arc-challenge | PaLM 540B (Self Consistency) | Accuracy: 88.7 |
| natural-language-inference-on-anli-test | PaLM 540B (Self Consistency) | A2: 64.5 A3: 63.4 |
| natural-language-inference-on-anli-test | PaLM 540B (Self Improvement, Self Consistency) | A2: 66.5 A3: 67.9 |
| natural-language-inference-on-anli-test | PaLM 540B (CoT Prompting) | A2: 58.9 A3: 60.6 |
| natural-language-inference-on-anli-test | PaLM 540B (Self Improvement, Standard-Prompting) | A2: 64.8 A3: 66.9 |
| natural-language-inference-on-anli-test | PaLM 540B (Standard-Prompting) | A2: 55.8 A3: 55.8 |
| natural-language-inference-on-anli-test | PaLM 540B (Self Improvement, CoT Prompting) | A2: 65.3 A3: 67.3 |
| question-answering-on-drop | PaLM 540B (Self Consistency) | Accuracy: 78.2 |
| question-answering-on-drop | PaLM 540B (Self Improvement, Self Consistency) | Accuracy: 83 |
| question-answering-on-drop | PaLM 540B (Self Improvement, Standard-Prompting) | Accuracy: 71.7 |
| question-answering-on-drop | PaLM 540B (Standard-Prompting) | Accuracy: 60 |
| question-answering-on-drop | PaLM 540B (CoT Prompting) | Accuracy: 70.6 |
| question-answering-on-drop | PaLM 540B (Self Improvement, CoT Prompting) | Accuracy: 76.2 |
| question-answering-on-openbookqa | PaLM 540B (Standard-Prompting) | Accuracy: 84.4 |
| question-answering-on-openbookqa | PaLM 540B (CoT Prompting) | Accuracy: 86.4 |
| question-answering-on-openbookqa | PaLM 540B (Self Improvement, Self Consistency) | Accuracy: 94.4 |
| question-answering-on-openbookqa | PaLM 540B (Self Improvement, CoT Prompting) | Accuracy: 93 |
| question-answering-on-openbookqa | PaLM 540B (Self Improvement, Standard-Prompting) | Accuracy: 92 |
| question-answering-on-openbookqa | PaLM 540B (Self Consistency) | Accuracy: 90 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.