treating text as a raw stream of letters instead of pretending words are the atom of meaning.
means a way of processing or modeling language where the fundamental unit is the individual character, not the word or subword token.
from comes from early NLP and compression research where text had to be handled byte by byte before tokenizers existed; char-level neural nets got famous around 2015 when karpathy's char-rnn wrote fake shakespeare one letter at a time.
karpathy char-rnn — 2015 project generating fake shakespeare and linux code letter by letter
byte-level bpe gpt2 — falls back to raw bytes so no input is ever unknown
canine model — google's 2021 tokenizer-free encoder operating on unicode codepoints
charcnn — 2016 paper using char-level convnets for text classification