H2O-Danube-1.8B Technical Report
Summary:
The article introduces H2O-Danube-1.8B, a 1.8B language model trained on 1T tokens following the core principles of LLama 2 and Mistral. The model exhibits highly competitive metrics across various benchmarks and is openly available under the Apache 2.0 license. Additionally, a chat model trained with supervised fine-tuning followed by direct preference optimization is released.
Major Findings:
- Model Architecture:
- The model is based on the Llama 2 architecture and is trained on 1T tokens from diverse sources.
- Techniques such as sliding window, rotary positional embedding, and grouped-query attention are utilized to enhance the model’s performance.
- Training:
- The model is trained on a single node consisting of 8xH100 GPUs using the AdamW optimizer with a cosine learning rate scheduler.
- Training is conducted on subsets of the data and model sizes up to 500M parameters to find optimal settings.
- Results:
- H2O-Danube-1.8B exhibits consistently good results across all benchmarks compared to other models of similar size.
- The chat model, H2O-Danube-1.8B-Chat, also shows excellent performance across various categories in multi-turn conversations.
Analysis and Critique:
The article presents a comprehensive overview of the development and performance of the H2O-Danube-1.8B language model and its chat variant. The model’s competitive performance across various benchmarks and its open-source availability under the Apache 2.0 license are significant strengths. However, the article lacks a detailed discussion of potential limitations, unanswered questions, or biases in the model’s performance. Additionally, the evaluation of the chat model’s performance is primarily based on MT-Bench, which may not fully capture its real-world capabilities. Further research and evaluation are needed to assess the model’s performance in practical applications and to identify any potential shortcomings.
Appendix
Model | gpt-3.5-turbo-1106 |
Date Generated | 2024-02-26 |
Abstract | https://arxiv.org/abs/2401.16818v1 |
HTML | https://browse.arxiv.org/html/2401.16818v1 |
Truncated | False |
Word Count | 9626 |