H2O-Danube-1.8B Technical Report

production
architectures
H2O-Danube-1.8B: Highly competitive language model trained on 1T tokens, openly available.
Author

Philipp Singer, Pascal Pfeiffer, Yauhen Babakhin, Maximilian Jeblick, Nischay Dhankhar, Gabor Fodor, Sri Satish Ambati

Published

January 30, 2024

Summary:

The article introduces H2O-Danube-1.8B, a 1.8B language model trained on 1T tokens following the core principles of LLama 2 and Mistral. The model exhibits highly competitive metrics across various benchmarks and is openly available under the Apache 2.0 license. Additionally, a chat model trained with supervised fine-tuning followed by direct preference optimization is released.

Major Findings:

  1. Model Architecture:
    • The model is based on the Llama 2 architecture and is trained on 1T tokens from diverse sources.
    • Techniques such as sliding window, rotary positional embedding, and grouped-query attention are utilized to enhance the model’s performance.
  2. Training:
    • The model is trained on a single node consisting of 8xH100 GPUs using the AdamW optimizer with a cosine learning rate scheduler.
    • Training is conducted on subsets of the data and model sizes up to 500M parameters to find optimal settings.
  3. Results:
    • H2O-Danube-1.8B exhibits consistently good results across all benchmarks compared to other models of similar size.
    • The chat model, H2O-Danube-1.8B-Chat, also shows excellent performance across various categories in multi-turn conversations.

Analysis and Critique:

The article presents a comprehensive overview of the development and performance of the H2O-Danube-1.8B language model and its chat variant. The model’s competitive performance across various benchmarks and its open-source availability under the Apache 2.0 license are significant strengths. However, the article lacks a detailed discussion of potential limitations, unanswered questions, or biases in the model’s performance. Additionally, the evaluation of the chat model’s performance is primarily based on MT-Bench, which may not fully capture its real-world capabilities. Further research and evaluation are needed to assess the model’s performance in practical applications and to identify any potential shortcomings.

Appendix

Model gpt-3.5-turbo-1106
Date Generated 2024-02-26
Abstract https://arxiv.org/abs/2401.16818v1
HTML https://browse.arxiv.org/html/2401.16818v1
Truncated False
Word Count 9626