← back

Character-Level Tokenizer (stoi/itos/BOS)

#374 · Deep Learning · Easy

⊣ Solve on deep-ml.com

Problem

Implement a character-level tokenizer with string-to-index (stoi) and index-to-string (itos) mappings. Include a special BOS (beginning of sequence) token. The tokenizer should encode text to integer sequences and decode back.

Solution

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
class CharTokenizer:
    def __init__(self, text: str, bos_token: str = "<BOS>"):
        self.bos_token = bos_token
        chars = sorted(set(text))
        self.stoi = {bos_token: 0}
        for i, ch in enumerate(chars):
            self.stoi[ch] = i + 1
        self.itos = {v: k for k, v in self.stoi.items()}
        self.vocab_size = len(self.stoi)

    def encode(self, text: str, add_bos: bool = True) -> list[int]:
        tokens = []
        if add_bos:
            tokens.append(self.stoi[self.bos_token])
        for ch in text:
            tokens.append(self.stoi[ch])
        return tokens

    def decode(self, tokens: list[int]) -> str:
        return "".join(self.itos[t] for t in tokens if t in self.itos)

Explanation

  1. Build vocabulary from all unique characters in the training text, sorted alphabetically.
  2. Reserve index 0 for the BOS (beginning of sequence) special token.
  3. Create bidirectional mappings: stoi maps characters to indices, itos maps indices back to characters.
  4. encode converts text to a list of integer indices, optionally prepending BOS. decode converts indices back to a string.

Complexity

  • Time: O(n) for encoding/decoding where n is the text length
  • Space: O(V) for the vocabulary mappings where V is the number of unique characters