the.com/subword

the piece of a word that's not quite a word but not quite nothing either.

means a chunk of text smaller than a word and often bigger than a letter, used by language models to turn any string into manageable, recyclable units.

from emerged from information theory compression tricks, then got drafted by NLP researchers around 2016 to solve the problem of rare and made up words breaking fixed vocabularies.

for instance

gpt tokenizersplits unfamiliar into un, familiar, roughly

bert wordpiecegoogle 2018, uses ## prefix for continuations

sentencepiecegoogle 2018 tool, treats spaces as ordinary characters

byte pair encoding

the.com/
what’s happening now · the.com · generated