✂️Cut-Cross-Entropy-Pytorch It’s Pytorch-version Cut-Cross-Entropy(CCE) implementation: GEMM LSE-style CCE Forward LSE CCE Backward Linear-Cross-Entropy backward Here is a blog for explaination of CCE: zhihu Reference Cut Your Losses in Large-Vocabulary Language Models