2025-05-04 09:29:00
github.com
This repo contains transformer train and inference code written in C++ and CUDA.
TScale is designed to run on consumer hardware. To achive best results it features
- Optimized transformer architecture with faster convergence and ~2x reduced attention costs
- Support for fp8 and int8 model weights and activations precision
- Optimized for consumer nVidia GPUs including fast reduced precision training without sacrificing model quality
- CPU offload reduces GPU memory requirements for training
- Sync distributed training on several same config hosts
- 1-bit gradient compression allowing using regular ethernet links for interconnect
- Async distributed training on arbitrary hosts with negligible network traffic. In this mode training can be run on geographically separated hosts
By using inexpensive GPUs and async distributed mode TScale trains LLMs fast and affordable. Log loss for the 1.5B model trained on fineweb-edu for 2 days and $500 on several spot instances with 4090:
1T model size sounds beyond reach for most people and even organisations. However if we consider creative ways to count model size then there is nothing impossible. In this case we build a model with 1T index which we lookup for every token to make prediction with much smaller model. In terms of logloss/perplexity this construction easily achieves stellar results. Index for fineweb-edu occupies about 1T of disk space. Training run of 125M model with this ~1T index achieves x8 perplexity reduction:
Model | Perplexity |
---|---|
125M | 19.02 |
125M + 1T index | 2.28 |
Training 1T (!) model in your kitchen
Notes on model and compute precision
To build the the code CUDA v12.3 and C++ compiler are required, msvc for windows, cmake+clang for Linux. To support cross platform build files generation this repo uses fo, lightweight solution/build files generator. To generate build files you need to compile fo/fo.cpp and run it with two arguments. First argument is root of source tree, second argument is directory to store build files to.
D:\TScale>fo.exe code sln
Then open code.sln from d:\TScale\sln\code.sln.
To compile TScale for linux you need to compile fo.cpp, generate CMakeLists.txt file, run cmake, run make.
~/TScale/fo$ clang++17 fo.cpp -o fo
~/TScale/fo$ cd ..
~/TScale$ ./fo/fo code make.dir
~/TScale$ cd make.dir
~/TScale/make.dir$ cmake -D CMAKE_BUILD_TYPE=RelWithDebInfo .
~/TScale/make.dir$ make
Examples in the code use enwik9 dataset and its truncacted version enwik8. Also Hugging Face hosted datasets openwebtext, ontocord/CulturaY, danasone/librusec are used in examples. To import them use hf_import.
gpt_train is used to train a model. It is controlled by the train script and data script. Default scripts are stored in main_gpt.cpp. To load train script from file run gpt_train with ‘-d data_script.txt -s train_script.txt’ arguments.
Compile gpt-train. Run it in the root directory:
~/TScale$ ./make.dir/gpt-train
Currently training can be distributed only among pow2 number of worker hosts.
To start a worker process run gpt_train with ‘-w 10000’ argument. 10000 specifies port number to use.
To run master process call net_train(‘worker.txt’) function in train script. List worker IP addresses in the file provided to net_train().
To use multiple GPU devices set DEVICE_COUNT variable in train script to number of GPUs to use. For distributed runs DEVICE_COUNT is applied on each worker, heterogeneous configurations are not supported.
Description of scripts used in training: data script, train script
To try inferencing from the trained model you can use gpt_infer. It runs basic http server on 11311 port and allows sampling continuations from the model. Current implementation is slow and designed for demonstration purposes only.
MIT
Keep your files stored safely and securely with the SanDisk 2TB Extreme Portable SSD. With over 69,505 ratings and an impressive 4.6 out of 5 stars, this product has been purchased over 8K+ times in the past month. At only $129.99, this Amazon’s Choice product is a must-have for secure file storage.
Help keep private content private with the included password protection featuring 256-bit AES hardware encryption. Order now for just $129.99 on Amazon!
Help Power Techcratic’s Future – Scan To Support
If Techcratic’s content and insights have helped you, consider giving back by supporting the platform with crypto. Every contribution makes a difference, whether it’s for high-quality content, server maintenance, or future updates. Techcratic is constantly evolving, and your support helps drive that progress.
As a solo operator who wears all the hats, creating content, managing the tech, and running the site, your support allows me to stay focused on delivering valuable resources. Your support keeps everything running smoothly and enables me to continue creating the content you love. I’m deeply grateful for your support, it truly means the world to me! Thank you!
BITCOIN bc1qlszw7elx2qahjwvaryh0tkgg8y68enw30gpvge Scan the QR code with your crypto wallet app |
DOGECOIN D64GwvvYQxFXYyan3oQCrmWfidf6T3JpBA Scan the QR code with your crypto wallet app |
ETHEREUM 0xe9BC980DF3d985730dA827996B43E4A62CCBAA7a Scan the QR code with your crypto wallet app |
Please read the Privacy and Security Disclaimer on how Techcratic handles your support.
Disclaimer: As an Amazon Associate, Techcratic may earn from qualifying purchases.