From 9eef954432cdfae128a9fd77e0faf91e8804f2fa Mon Sep 17 00:00:00 2001 From: Stella Biderman Date: Fri, 22 Dec 2023 03:36:50 -0500 Subject: [PATCH 01/10] Update README.md --- README.md | 12 ++++++++++-- 1 file changed, 10 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 528447835..28bdbafa8 100644 --- a/README.md +++ b/README.md @@ -507,10 +507,12 @@ GPT-NeoX has been used by academic and industry researchers for a variety of hig EleutherAI and our collaborators have used it in the following publications: - Sid Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Golding, Horace He, Connor Leahy, McDonell, Jason Phang, Michael Pieler, Prashanth, Shivanshu Purohit, Laria Reynolds, Jon Tow, Ben Wang, and Samuel Weinbach. "[GPT-NeoX-20B: An Open-Source Autoregressive Language Model](https://arxiv.org/abs/2204.06745)." In *Proceedings of the ACL Workshop on Challenges \& Perspectives in Creating Large Language Models* (2022). - Stella Biderman, Hailey Schoelkopf, Quentin Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan et al. "[Pythia: A suite for analyzing large language models across training and scaling](https://arxiv.org/abs/2304.01373)." In _International Conference on Machine Learning_, pp. 2397-2430. PMLR (2023). - - Zhangir Azerbayev, Bartosz Piotrowski, Hailey Schoelkopf, Edward W. Ayers, Dragomir Radev, and Jeremy Avigad. "[Proofnet: Autoformalizing and formally proving undergraduate-level mathematics](https://arxiv.org/abs/2302.12433). *arXiv preprint arXiv:2302.12433* (2023). + - Zhangir Azerbayev, Bartosz Piotrowski, **Hailey Schoelkopf**, Edward W. Ayers, Dragomir Radev, and Jeremy Avigad. "[Proofnet: Autoformalizing and formally proving undergraduate-level mathematics](https://arxiv.org/abs/2302.12433). *arXiv preprint arXiv:2302.12433* (2023). - Stella Biderman, USVSN Sai Prashanth, Lintang Sutawika, Hailey Schoelkopf, Quentin Anthony, Shivanshu Purohit, and Edward Raff. "[Emergent and predictable memorization in large language models.](https://arxiv.org/abs/2304.11158)" *arXiv preprint arXiv:2304.11158* (2023). - Hyunwoong Ko, Kichang Yang, Minho Ryu, Taekyoon Choi, Seungmu Yang, and Sungho Park. "[A Technical Report for Polyglot-Ko: Open-Source Large-Scale Korean Language Models](https://arxiv.org/abs/2306.02254)." *arXiv preprint arXiv:2306.02254* (2023). - - Kshitij Gupta, Benjamin Thérien, Adam Ibrahim, Mats Leon Richter, Quentin Anthony, Eugene Belilovsky, Irina Rish, and Timothée Lesort. "[Continual Pre-Training of Large Language Models: How to re-warm your model?](https://arxiv.org/abs/2308.04014)" In _Workshop on Efficient Systems for Foundation Models @ ICML_ (2023). + - Kshitij Gupta, Benjamin Thérien, Adam Ibrahim, Mats Leon Richter, **Quentin Anthony**, Eugene Belilovsky, Irina Rish, and Timothée Lesort. "[Continual Pre-Training of Large Language Models: How to re-warm your model?](https://arxiv.org/abs/2308.04014)" In _Workshop on Efficient Systems for Foundation Models @ ICML_ (2023). + - **Zhangir Azerbayev**, **Hailey Schoelkopf**, Keiran Paster, Marco Dos Santos, Stephen McAleer, Albert Q Jiang, Jia Deng, **Stella Biderman**, and Sean Welleck. "[Llemma: An open language model for mathematics]([https://arxiv.org/abs/2308.04014](https://arxiv.org/abs/2310.10631))" In _Math-AI Workshop @ NeurIPS_ (2023). + - Alexander Havrilla, Maksym Zhuravinskyi, Duy Phung, Aman Tiwari, Jonathan Tow, **Stella Biderman**, **Quentin Anthony**, and **Louis Castricato**. "[trlX: A Framework for Large Scale Reinforcement Learning from Human Feedback](https://aclanthology.org/2023.emnlp-main.530/)." _Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing_, 2023. ### External Publications The following publications by other research groups use this library: @@ -526,6 +528,12 @@ The following publications by other research groups use this library: - Jean Kaddour and Qi Liu. "[Text Data Augmentation in Low-Resource Settings via Fine-Tuning of Large Language Models](https://arxiv.org/abs/2310.01119)." _arXiv:2310.01119_ (2023). - Alon Albalak, Liangming Pan, Colin Raffel, and William Yang Wang. "[Efficient Online Data Mixing For Language Model Pre-Training](https://alon-albalak.github.io/images/Online_Data_Mixing.pdf)." _preprint_ (2023). - Eghbal A. Hosseini and Evelina Fedorenko. "[Large language models implicitly learn to straighten neural sentence trajectories to construct a predictive representation of natural language](https://www.biorxiv.org/content/10.1101/2023.11.05.564832v1)." _bioRxiv_ (2023). +- Junqi Yin, Sajal Dash, Feiyi Wang, and Mallikarjun Shankar. "[FORGE: Pre-Training Open Foundation Models for Science](https://dl.acm.org/doi/abs/10.1145/3581784.3613215). _Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis_, 1-13, 2023. +- Jean Kaddour and Qi Liu. "[Text Data Augmentation in Low-Resource Settings via Fine-Tuning of Large Language Models](https://arxiv.org/abs/2310.01119)." _arXiv preprint arXiv:2310.01119_, 2023. +- Peng Di, Jianguo Li, Hang Yu, Wei Jiang, Wenting Cai, Yang Cao, Chaoyu Chen, Dajun Chen, Hongwei Chen, Liang Chen, Gang Fan, Jie Gong, Zi Gong, Wen Hu, Tingting Guo, Zhichao Lei, Ting Li, Zheng Li, Ming Liang, Cong Liao, Bingchang Liu, Jiachen Liu, Zhiwei Liu, Shaojun Lu, Min Shen, Guangpei Wang, Huan Wang, Zhi Wang, Zhaogui Xu, Jiawei Yang, Qing Ye, Gehao Zhang, Yu Zhang, Zelin Zhao, Xunjin Zheng, Hailian Zhou, Lifu Zhu, and Xianying Zhu. "[CodeFuse-13B: A Pretrained Multi-lingual Code Large Language Model](https://arxiv.org/abs/2310.06266)." _arXiv preprint arXiv:2310.06266_, 2023. +- Nikitha Rao, Kush Jain, Uri Alon, Claire Le Goues, and Vincent J Hellendoorn. "[CAT-LM Training Language Models on Aligned Code And Tests](https://arxiv.org/abs/2310.01602)." _38th IEEE/ACM International Conference on Automated Software Engineering (ASE)_, pp. 409-420. IEEE, 2023. + + ### Models The following models were trained using this library: From a48e09e6e60409d3b49b553912d57406e0585e0f Mon Sep 17 00:00:00 2001 From: Stella Biderman Date: Fri, 22 Dec 2023 03:37:05 -0500 Subject: [PATCH 02/10] Update README.md --- README.md | 12 ++++++++++-- 1 file changed, 10 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 528447835..28bdbafa8 100644 --- a/README.md +++ b/README.md @@ -507,10 +507,12 @@ GPT-NeoX has been used by academic and industry researchers for a variety of hig EleutherAI and our collaborators have used it in the following publications: - Sid Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Golding, Horace He, Connor Leahy, McDonell, Jason Phang, Michael Pieler, Prashanth, Shivanshu Purohit, Laria Reynolds, Jon Tow, Ben Wang, and Samuel Weinbach. "[GPT-NeoX-20B: An Open-Source Autoregressive Language Model](https://arxiv.org/abs/2204.06745)." In *Proceedings of the ACL Workshop on Challenges \& Perspectives in Creating Large Language Models* (2022). - Stella Biderman, Hailey Schoelkopf, Quentin Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan et al. "[Pythia: A suite for analyzing large language models across training and scaling](https://arxiv.org/abs/2304.01373)." In _International Conference on Machine Learning_, pp. 2397-2430. PMLR (2023). - - Zhangir Azerbayev, Bartosz Piotrowski, Hailey Schoelkopf, Edward W. Ayers, Dragomir Radev, and Jeremy Avigad. "[Proofnet: Autoformalizing and formally proving undergraduate-level mathematics](https://arxiv.org/abs/2302.12433). *arXiv preprint arXiv:2302.12433* (2023). + - Zhangir Azerbayev, Bartosz Piotrowski, **Hailey Schoelkopf**, Edward W. Ayers, Dragomir Radev, and Jeremy Avigad. "[Proofnet: Autoformalizing and formally proving undergraduate-level mathematics](https://arxiv.org/abs/2302.12433). *arXiv preprint arXiv:2302.12433* (2023). - Stella Biderman, USVSN Sai Prashanth, Lintang Sutawika, Hailey Schoelkopf, Quentin Anthony, Shivanshu Purohit, and Edward Raff. "[Emergent and predictable memorization in large language models.](https://arxiv.org/abs/2304.11158)" *arXiv preprint arXiv:2304.11158* (2023). - Hyunwoong Ko, Kichang Yang, Minho Ryu, Taekyoon Choi, Seungmu Yang, and Sungho Park. "[A Technical Report for Polyglot-Ko: Open-Source Large-Scale Korean Language Models](https://arxiv.org/abs/2306.02254)." *arXiv preprint arXiv:2306.02254* (2023). - - Kshitij Gupta, Benjamin Thérien, Adam Ibrahim, Mats Leon Richter, Quentin Anthony, Eugene Belilovsky, Irina Rish, and Timothée Lesort. "[Continual Pre-Training of Large Language Models: How to re-warm your model?](https://arxiv.org/abs/2308.04014)" In _Workshop on Efficient Systems for Foundation Models @ ICML_ (2023). + - Kshitij Gupta, Benjamin Thérien, Adam Ibrahim, Mats Leon Richter, **Quentin Anthony**, Eugene Belilovsky, Irina Rish, and Timothée Lesort. "[Continual Pre-Training of Large Language Models: How to re-warm your model?](https://arxiv.org/abs/2308.04014)" In _Workshop on Efficient Systems for Foundation Models @ ICML_ (2023). + - **Zhangir Azerbayev**, **Hailey Schoelkopf**, Keiran Paster, Marco Dos Santos, Stephen McAleer, Albert Q Jiang, Jia Deng, **Stella Biderman**, and Sean Welleck. "[Llemma: An open language model for mathematics]([https://arxiv.org/abs/2308.04014](https://arxiv.org/abs/2310.10631))" In _Math-AI Workshop @ NeurIPS_ (2023). + - Alexander Havrilla, Maksym Zhuravinskyi, Duy Phung, Aman Tiwari, Jonathan Tow, **Stella Biderman**, **Quentin Anthony**, and **Louis Castricato**. "[trlX: A Framework for Large Scale Reinforcement Learning from Human Feedback](https://aclanthology.org/2023.emnlp-main.530/)." _Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing_, 2023. ### External Publications The following publications by other research groups use this library: @@ -526,6 +528,12 @@ The following publications by other research groups use this library: - Jean Kaddour and Qi Liu. "[Text Data Augmentation in Low-Resource Settings via Fine-Tuning of Large Language Models](https://arxiv.org/abs/2310.01119)." _arXiv:2310.01119_ (2023). - Alon Albalak, Liangming Pan, Colin Raffel, and William Yang Wang. "[Efficient Online Data Mixing For Language Model Pre-Training](https://alon-albalak.github.io/images/Online_Data_Mixing.pdf)." _preprint_ (2023). - Eghbal A. Hosseini and Evelina Fedorenko. "[Large language models implicitly learn to straighten neural sentence trajectories to construct a predictive representation of natural language](https://www.biorxiv.org/content/10.1101/2023.11.05.564832v1)." _bioRxiv_ (2023). +- Junqi Yin, Sajal Dash, Feiyi Wang, and Mallikarjun Shankar. "[FORGE: Pre-Training Open Foundation Models for Science](https://dl.acm.org/doi/abs/10.1145/3581784.3613215). _Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis_, 1-13, 2023. +- Jean Kaddour and Qi Liu. "[Text Data Augmentation in Low-Resource Settings via Fine-Tuning of Large Language Models](https://arxiv.org/abs/2310.01119)." _arXiv preprint arXiv:2310.01119_, 2023. +- Peng Di, Jianguo Li, Hang Yu, Wei Jiang, Wenting Cai, Yang Cao, Chaoyu Chen, Dajun Chen, Hongwei Chen, Liang Chen, Gang Fan, Jie Gong, Zi Gong, Wen Hu, Tingting Guo, Zhichao Lei, Ting Li, Zheng Li, Ming Liang, Cong Liao, Bingchang Liu, Jiachen Liu, Zhiwei Liu, Shaojun Lu, Min Shen, Guangpei Wang, Huan Wang, Zhi Wang, Zhaogui Xu, Jiawei Yang, Qing Ye, Gehao Zhang, Yu Zhang, Zelin Zhao, Xunjin Zheng, Hailian Zhou, Lifu Zhu, and Xianying Zhu. "[CodeFuse-13B: A Pretrained Multi-lingual Code Large Language Model](https://arxiv.org/abs/2310.06266)." _arXiv preprint arXiv:2310.06266_, 2023. +- Nikitha Rao, Kush Jain, Uri Alon, Claire Le Goues, and Vincent J Hellendoorn. "[CAT-LM Training Language Models on Aligned Code And Tests](https://arxiv.org/abs/2310.01602)." _38th IEEE/ACM International Conference on Automated Software Engineering (ASE)_, pp. 409-420. IEEE, 2023. + + ### Models The following models were trained using this library: From 613e5a62a491aded6d7a7f95eb38d49f066862ff Mon Sep 17 00:00:00 2001 From: github-actions Date: Fri, 22 Dec 2023 08:38:04 +0000 Subject: [PATCH 03/10] Update NeoXArgs docs automatically --- configs/neox_arguments.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/configs/neox_arguments.md b/configs/neox_arguments.md index 6003d15cc..a5210bb52 100644 --- a/configs/neox_arguments.md +++ b/configs/neox_arguments.md @@ -111,7 +111,7 @@ Logging Arguments - **git_hash**: str - Default = a279fc8 + Default = a48e09e current git hash of repository From be7eeda60341f9e39354990d2629a5b8bec2fd4d Mon Sep 17 00:00:00 2001 From: Stella Biderman Date: Fri, 22 Dec 2023 03:56:47 -0500 Subject: [PATCH 04/10] Update README.md --- README.md | 1 + 1 file changed, 1 insertion(+) diff --git a/README.md b/README.md index 28bdbafa8..1a3ea819e 100644 --- a/README.md +++ b/README.md @@ -523,6 +523,7 @@ The following publications by other research groups use this library: - Eghbal A. Hosseini, Martin A. Schrimpf, Yian Zhang, Samuel Bowman, Noga Zaslavsky, and Evelina Fedorenko. "[Artificial neural network language models align neurally and behaviorally with humans even after a developmentally realistic amount of training.](https://www.biorxiv.org/content/10.1101/2022.10.04.510681)" _BioRxiv_ (2022). - Byung-Doh Oh and William Schuler. "[Transformer-Based LM Surprisal Predicts Human Reading Times Best with About Two Billion Training Tokens](https://arxiv.org/abs/2304.11389)." *arXiv preprint arXiv:2304.11389* (2023). - Ta-Chung Chi, Ting-Han Fan, Alexander Rudnicky, and Peter Ramadge. "[Dissecting Transformer Length Extrapolation via the Lens of Receptive Field Analysis](https://aclanthology.org/2023.acl-long.756/)." In _Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)_, pp. 13522-13537 (2023). +- Ta-Chung Chi, Ting-Han Fan, Li-Wei Chen, Alexander Rudnicky, and Peter Ramadge. "[Latent Positional Information is in the Self-Attention Variance of Transformer Language Models Without Positional Embeddings](https://aclanthology.org/2023.acl-short.102/)." In _Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)_, pp. 13522-13537 (2023). - Xidong Feng, Yicheng Luo, Ziyan Wang, Hongrui Tang, Mengyue Yang, Kun Shao, David Mguni, Yali Du, and Jun Wang. "[ChessGPT: Bridging Policy Learning and Language Modeling.](https://arxiv.org/abs/2306.09200)" _arXiv preprint arXiv:2306.09200_ (2023). - Orion Walker Dollar, Sameera Horawalavithana, Scott Vasquez, W. James Pfaendtner, and Svitlana Volkova. "[MolJET: Multimodal Joint Embedding Transformer for Conditional de novo Molecular Design and Multi-Property Optimization.](https://openreview.net/pdf?id=7UudBVsIrr)" _preprint_ (2023). - Jean Kaddour and Qi Liu. "[Text Data Augmentation in Low-Resource Settings via Fine-Tuning of Large Language Models](https://arxiv.org/abs/2310.01119)." _arXiv:2310.01119_ (2023). From 2117afcf00aa8b92eb3dce5ae5f4405176b4e25a Mon Sep 17 00:00:00 2001 From: Stella Biderman Date: Fri, 22 Dec 2023 13:01:06 -0500 Subject: [PATCH 05/10] Update README.md --- README.md | 190 +++++++++++++++++++++++++++++------------------------- 1 file changed, 101 insertions(+), 89 deletions(-) diff --git a/README.md b/README.md index 28bdbafa8..eaaaad55e 100644 --- a/README.md +++ b/README.md @@ -3,7 +3,7 @@ # GPT-NeoX -This repository records [EleutherAI](https://www.eleuther.ai)'s library for training large-scale language models on GPUs. Our current framework is based on NVIDIA's [Megatron Language Model](https://github.com/NVIDIA/Megatron-LM) and has been augmented with techniques from [DeepSpeed](https://www.deepspeed.ai) as well as some novel optimizations. We aim to make this repo a centralized and accessible place to gather techniques for training large-scale autoregressive language models, and accelerate research into large-scale training. This library is in widespread use in [academic, industry, and government labs](https://github.com/EleutherAI/gpt-neox#adoption-and-publications), including by researchers at Oak Ridge National Lab, CarperAI, Stability AI, Carnegie Mellon University, and the University of Tokyo. Uniquely among similar libraries GPT-NeoX supports a wide variety of systems and hardwares, including launching via Slurm, MPI, and the IBM Job Step Manager, and has been run at scale on [AWS](https://aws.amazon.com/), [CoreWeave](https://www.coreweave.com/), [ORNL Summit](https://www.olcf.ornl.gov/summit/), [ORNL Frontier](https://www.olcf.ornl.gov/frontier/), [LUMI](https://www.lumi-supercomputer.eu/), and others. +This repository records [EleutherAI](https://www.eleuther.ai)'s library for training large-scale language models on GPUs. Our current framework is based on NVIDIA's [Megatron Language Model](https://github.com/NVIDIA/Megatron-LM) and has been augmented with techniques from [DeepSpeed](https://www.deepspeed.ai) as well as some novel optimizations. We aim to make this repo a centralized and accessible place to gather techniques for training large-scale autoregressive language models, and accelerate research into large-scale training. This library is in widespread use in [academic, industry, and government labs](https://github.com/EleutherAI/gpt-neox#adoption-and-publications), including by researchers at Oak Ridge National Lab, CarperAI, Stability AI, Together.ai, Korea University, Carnegie Mellon University, and the University of Tokyo among others. Uniquely among similar libraries GPT-NeoX supports a wide variety of systems and hardwares, including launching via Slurm, MPI, and the IBM Job Step Manager, and has been run at scale on [AWS](https://aws.amazon.com/), [CoreWeave](https://www.coreweave.com/), [ORNL Summit](https://www.olcf.ornl.gov/summit/), [ORNL Frontier](https://www.olcf.ornl.gov/frontier/), [LUMI](https://www.lumi-supercomputer.eu/), and others. **If you are not looking to train models with billions of parameters from scratch, this is likely the wrong library to use. For generic inference needs, we recommend you use the Hugging Face `transformers` library instead which supports GPT-NeoX models.** @@ -13,7 +13,7 @@ GPT-NeoX leverages many of the same features and technologies as the popular Meg * Distributed training with ZeRO and 3D parallelism * A wide variety of systems and hardwares, including launching via Slurm, MPI, and the IBM Job Step Manager, and has been run at scale on [AWS](https://aws.amazon.com/), [CoreWeave](https://www.coreweave.com/), [ORNL Summit](https://www.olcf.ornl.gov/summit/), [ORNL Frontier](https://www.olcf.ornl.gov/frontier/), [LUMI](https://www.lumi-supercomputer.eu/), and others. * Cutting edge architectural innovations including rotary and alibi positional embeddings, parallel feedforward attention layers, and flash attention. -* Predefined configurations for popular architectures including Pythia, PaLM, Falcon, and LLaMA 1 & 2 +* Predefined configurations for popular architectures including Pythia, PaLM, Falcon, and LLaMA 1 \& 2 * Curriculum Learning * Easy connections with the open source ecosystem, including Hugging Face's [tokenizers](https://github.com/huggingface/tokenizers) and [transformers](https://github.com/huggingface/transformers/) libraries, logging via [WandB](https://wandb.ai/site), and evaluation via our [Language Model Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness). @@ -39,27 +39,43 @@ Prior to 3/9/2023, GPT-NeoX relied on [DeeperSpeed](https://github.com/EleutherA # Contents -* [Quick Start](#quick-start) +- [GPT-NeoX](#gpt-neox) + * [Why GPT-NeoX?](#why-gpt-neox) + * [News](#news) + * [Versions](#versions) +- [Contents](#contents) +- [Quick Start](#quick-start) * [Environment and Dependencies](#environment-and-dependencies) + + [Host Setup](#host-setup) + + [Flash Attention](#flash-attention) + + [Multi-Node Launching](#multi-node-launching) + + [Containerized Setup](#containerized-setup) * [Usage](#usage) -* [Configuration](#configuration) -* [Datasets](#datasets) +- [Configuration](#configuration) +- [Datasets](#datasets) * [Preconfigured Datasets](#preconfigured-datasets) * [Using Custom Data](#using-custom-data) -* [Training and Finetuning](#training-and-finetuning) - * [Select Pretrained Models](#pretrained-models) - * [GPT-NeoX-20B](#gpt-neox-20b) - * [Pythia](#pythia) - * [Polyglot](#polyglot) -* [Inference](#inference) -* [Evaluation](#evaluation) -* [Exporting to Hugging Face](#exporting-to-hugging-face) -* [Monitoring](#monitoring) - * [Weights & Biases](#wandb) +- [Training and Finetuning](#training-and-finetuning) + * [Pretrained Models](#pretrained-models) + + [GPT-NeoX-20B](#gpt-neox-20b) + + [Pythia](#pythia) + + [Polyglot](#polyglot) +- [Inference](#inference) +- [Evaluation](#evaluation) +- [Exporting to Hugging Face](#exporting-to-hugging-face) +- [Monitoring](#monitoring) + * [Weights and Biases](#weights-and-biases) * [TensorBoard](#tensorboard) -* [Administrative Notes](#administrative-notes) +- [Running on multi-node](#running-on-multi-node) +- [Adoption and Publications](#adoption-and-publications) + * [Publications](#publications) + * [Models](#models) + + [English LLMs](#english-llms) + + [Non-English LLMs](#non-english-llms) + + [Code Models](#code-models) + + [Other Modalities](#other-modalities) +- [Administrative Notes](#administrative-notes) * [Citing GPT-NeoX](#citing-gpt-neox) - * [Adoption and Publications](#adoption-and-publications) * [Licensing](#licensing) * [Acknowledgements](#acknowledgements) @@ -452,7 +468,7 @@ Note, however, that this compatibility is not one-to-one, and only certain confi In addition to storing logs locally, we provide built-in support for two popular experiment monitoring frameworks: [Weights & Biases](https://wandb.ai/site) and [TensorBoard](https://www.tensorflow.org/tensorboard/) -

Weights & Biases

+## Weights and Biases EleutherAI is currently using [Weights & Biases to record our experiments](https://wandb.ai/eleutherai/neox). If you are logged into Weights & Biases on your machine—you can do this by executing `wandb login`—your runs will automatically be recorded. There are two optional fields associated with Weights & Biases: wandb_group allows you to name the run group and wandb_team allows you to assign your runs to an organization or team account. @@ -464,6 +480,73 @@ We also support using TensorBoard via the tensorboard-dir Date: Fri, 22 Dec 2023 18:01:19 +0000 Subject: [PATCH 06/10] Update NeoXArgs docs automatically --- configs/neox_arguments.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/configs/neox_arguments.md b/configs/neox_arguments.md index a5210bb52..c05606e05 100644 --- a/configs/neox_arguments.md +++ b/configs/neox_arguments.md @@ -111,7 +111,7 @@ Logging Arguments - **git_hash**: str - Default = a48e09e + Default = 2117afc current git hash of repository From f161245b7c3848811f7e8e092977fbaeea12d283 Mon Sep 17 00:00:00 2001 From: Lintang Sutawika Date: Sat, 23 Dec 2023 01:05:34 +0700 Subject: [PATCH 07/10] Add QK Normalization (#1100) * add qk normalization * Update NeoXArgs docs automatically * Update NeoXArgs docs automatically --------- Co-authored-by: github-actions Co-authored-by: Quentin Anthony --- configs/neox_arguments.md | 10 +++++++++- megatron/model/transformer.py | 15 +++++++++++++++ megatron/neox_arguments/neox_args.py | 5 +++++ 3 files changed, 29 insertions(+), 1 deletion(-) diff --git a/configs/neox_arguments.md b/configs/neox_arguments.md index 6003d15cc..722756d6f 100644 --- a/configs/neox_arguments.md +++ b/configs/neox_arguments.md @@ -111,7 +111,7 @@ Logging Arguments - **git_hash**: str - Default = a279fc8 + Default = 1fc0521 current git hash of repository @@ -261,6 +261,14 @@ Model Arguments +- **use_qk_layernorm**: bool + + Default = False + + Use QK Normalization + + + - **layernorm_epsilon**: float Default = 1e-05 diff --git a/megatron/model/transformer.py b/megatron/model/transformer.py index 63f4122e2..195e57925 100644 --- a/megatron/model/transformer.py +++ b/megatron/model/transformer.py @@ -284,6 +284,16 @@ def __init__( neox_args.num_attention_heads, world_size ) self.pos_emb = neox_args.pos_emb + self.use_qk_layernorm = neox_args.use_qk_layernorm + if self.use_qk_layernorm: + norm, eps = get_norm(neox_args) + self.qk_layernorm = norm( + [ + self.num_attention_heads_per_partition, + self.hidden_size_per_attention_head, + ], + eps=eps, + ) # Strided linear layer. self.query_key_value = mpu.ColumnParallelLinear( @@ -639,6 +649,11 @@ def forward(self, hidden_states, attention_mask, layer_past=None): mixed_x_layer, 3 ) + # QK Normalization https://arxiv.org/abs/2302.05442 + if self.use_qk_layernorm: + query_layer = self.qk_layernorm(query_layer) + key_layer = self.qk_layernorm(key_layer) + if exists(self.rotary_emb): if exists(self.rotary_ndims): # partial rotary diff --git a/megatron/neox_arguments/neox_args.py b/megatron/neox_arguments/neox_args.py index 324a379d4..2cfed465d 100644 --- a/megatron/neox_arguments/neox_args.py +++ b/megatron/neox_arguments/neox_args.py @@ -125,6 +125,11 @@ class NeoXArgsModel(NeoXArgsTemplate): Normalization layer to use. Choose from "layernorm", "rmsnorm", "scalenorm". """ + use_qk_layernorm: bool = False + """ + Use QK Normalization + """ + layernorm_epsilon: float = 1.0e-5 """ Layer norm epsilon. From 7fb3b3c79bc460c12310af042e2ab9883e964af9 Mon Sep 17 00:00:00 2001 From: Stella Biderman Date: Fri, 22 Dec 2023 13:07:32 -0500 Subject: [PATCH 08/10] Update README.md --- README.md | 2 -- 1 file changed, 2 deletions(-) diff --git a/README.md b/README.md index eaaaad55e..a8f6f7c1e 100644 --- a/README.md +++ b/README.md @@ -577,8 +577,6 @@ To cite the 20 billion parameter model named `GPT-NeoX-20B`, please use } ``` -Citation instructions for other pretrained models can be found [in the appropriate repository](#pretrained-models). - ## Licensing This repository hosts code that is part of EleutherAI's GPT-NeoX project. Copyright (c) 2021, EleutherAI. Licensed under the Apache License: From a7509f0e076152036ce5f3e534a153ff2022c718 Mon Sep 17 00:00:00 2001 From: Stella Biderman Date: Fri, 22 Dec 2023 13:14:14 -0500 Subject: [PATCH 09/10] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index a8f6f7c1e..a4724c882 100644 --- a/README.md +++ b/README.md @@ -526,7 +526,7 @@ The following models were trained using this library: - Together.ai's [RedPajama-INCITE (3B and 7B)](https://together.ai/blog/redpajama-models-v1) - Carnegie Mellon University's [proofGPT (1.3B and 6.7B)](https://huggingface.co/hoskinson-center/proofGPT-v0.1-6.7B) - Dampish's [StellarX (2.8B and 4B)](https://huggingface.co/Dampish/StellarX-4B-V0.2) -- Oak Ridge National Lab's [FORGE (26B)](https://dl.acm.org/doi/10.1145/3581784.3613215) +- Oak Ridge National Lab's [FORGE (26B)](https://github.com/at-aaims/forge) ### Non-English LLMs - EleutherAI's [Polyglot-Ko (1.3B through 12.8B)](https://github.com/EleutherAI/polyglot) (Korean) From 4d5a8115752b342aa922cf406dae4d13a7a056c0 Mon Sep 17 00:00:00 2001 From: github-actions Date: Fri, 22 Dec 2023 18:15:21 +0000 Subject: [PATCH 10/10] Update NeoXArgs docs automatically --- configs/neox_arguments.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/configs/neox_arguments.md b/configs/neox_arguments.md index 4dacafa0a..06656d3d8 100644 --- a/configs/neox_arguments.md +++ b/configs/neox_arguments.md @@ -111,7 +111,7 @@ Logging Arguments - **git_hash**: str - Default = 2117afc + Default = 8eaac4e current git hash of repository