Here's how OpenAI Token count is computed in Tiktokenizer

Blog

Here's how OpenAI Token count is computed in Tiktokenizer - Part 3

In this article, we will review how OpenAI Token count is computed in Tiktokenizer — Part 3. We will look at:

OpenSourceTokenizer class

For more context, read part 2.

OpenSourceTokenizer class

In tiktokenizer/src/models/tokenizer.ts, at line 82, you will find the following code:

export class OpenSourceTokenizer implements Tokenizer {
  constructor(private tokenizer: PreTrainedTokenizer, name?: string) {
    this.name = name ?? tokenizer.name;
  }

  name: string;

  static async load(
    model: z.infer<typeof openSourceModels>
  ): Promise<PreTrainedTokenizer> {
    // use current host as proxy if we're running on the client
    if (typeof window !== "undefined") {
      env.remoteHost = window.location.origin;
    }
    env.remotePathTemplate = "/hf/{model}";
    // Set to false for testing!
    // env.useBrowserCache = false;
    const t = await PreTrainedTokenizer.from_pretrained(model, {
      progress_callback: (progress: any) =>
        console.log(`loading "${model}"`, progress),
    });
    console.log("loaded tokenizer", model, t.name);
    return t;
  }

This class also implements tokenizer. Tokenizer is an interface defined in the same file

export interface Tokenizer {
  name: string;
  tokenize(text: string): TokenizerResult;
  free?(): void;
}

constructor

constructor had the following code

 constructor(private tokenizer: PreTrainedTokenizer, name?: string) {
    this.name = name ?? tokenizer.name;
  }

This constructor only sets this.name.

The type of tokenizer is PreTrainedTokenizer, it is imported as shown below:

import { PreTrainedTokenizer, env } from "@xenova/transformers";

static load

This OpenSourceTokenizer class has a static method named load and it contains the following code.


  static async load(
    model: z.infer<typeof openSourceModels>
  ): Promise<PreTrainedTokenizer> {
    // use current host as proxy if we're running on the client
    if (typeof window !== "undefined") {
      env.remoteHost = window.location.origin;
    }
    env.remotePathTemplate = "/hf/{model}";
    // Set to false for testing!
    // env.useBrowserCache = false;
    const t = await PreTrainedTokenizer.from_pretrained(model, {
      progress_callback: (progress: any) =>
        console.log(`loading "${model}"`, progress),
    });
    console.log("loaded tokenizer", model, t.name);
    return t;
  }

This function returns a variable name t and this t is assigned a value returned by the PreTrainedTokenizer.from_pretrained as shown below

const t = await PreTrainedTokenizer.from_pretrained(model, {
  progress_callback: (progress: any) =>
  console.log(`loading "${model}"`, progress),
});

tokenize

tokenize has the following code.

tokenize(text: string): TokenizerResult {
    // const tokens = this.tokenizer(text);
    const tokens = this.tokenizer.encode(text);
    const removeFirstToken = (
      hackModelsRemoveFirstToken.options as string[]
    ).includes(this.name);
    return {
      name: this.name,
      tokens,
      segments: getHuggingfaceSegments(this.tokenizer, text, removeFirstToken),
      count: tokens.length,
    };
  }

It returns the object that contains name, tokens, segments and count which is same as the object returned by the TiktokenTokenizer at line 26.

About me:

Hey, my name is Ramu Narasinga. I study codebase architecture in large open-source projects.

Email: [email protected]

Want to learn from open-source projects? Solve challenges inspired by open-source projects.

References:

Insights from Open Source projects

Best practices used in open-source are explained, compared among multiple projects.

Study the codebase architecture and level up your skills.