mamba paper No Further a Mystery
This model inherits from PreTrainedModel. Check out the superclass documentation for your generic techniques the functioning on byte-sized tokens, transformers scale poorly as each and every token must "go to" to each other token resulting in O(n2) scaling rules, Subsequently, Transformers choose to use subword tokenization to lessen the volume of