MiniGPT: Building a functional ChatGPT clone

In as few lines of code, using Llama 2, Cloudflare AI workers, and our beloved deployment tool Vercel.

Oct 23, 2023

This post is the first post in the series that goes in depth about building MiniGPT.

The other day I read a blogpost from Cloudflare titled “Writing poems using LLama 2 on Workers AI”. The blogpost shows how using workers AI (Cloudflare’s serverless lambda functions) we can very easily spin up endpoints that interact with LLMs (large language models). The blogpost demonstrates how in roughly 9 lines of Javascript you can send a prompt to an AI model.

The magic happens here:

import { Ai } from "@cloudflare/ai";
...
const response = await ai.run("@cf/meta/llama-2-7b-chat-int8", body);

All the complexity is abstracted away in the “ai” package.

This got me thinking, how hard would it be to make a little ChatGPT application that uses Llama?

A weekend side project

Even if your life is falling apart, you can still make Notion docs to track it.

I like to use Notion to track my little side projects. The only way to complete a side project is by tackling the hardest part first. It’s easy to get distracted by the things you’re already good at. If you can get over the initial challenging part, it’ll give you enough velocity to get to the finish line. In this case, I was relatively unfamiliar with Cloudflare workers and interacting with Llama, so that’s where I decided to start.

Using serverless AI functions on the edge

So many techy buzz words! Let me explain:

We’re gonna use the @cloudflare/ai package to run our AI function
These AI functions (called workers) are distributed to the Cloudflare edge network. It’s impressive how quickly I hit save on the function was available on prod.

⚠️ When launching a new worker AI function, Cloudflare makes it quite clear that this is a beta feature not intended for production traffic. We’re gonna scroll past that warning so fast, it’s like it was never even there.

After going through the Cloudflare worker creation wizard, you have the ability to “Quick edit” the code right in the browser. No need to clone any git repos. The Cloudflare worker editor has a built in textarea (powered by Monoco which is basically VSCode for the web) and has a mini “Postman” area where you can make web requests to your function. Super easy to use.

Here’s the code for the function that uses Meta’s llama-2-7b-chat-int8 model.

import { Ai } from './vendor/@cloudflare/ai.js'
import { handleRequest } from './cors.js';

async function miniLLM(request, env) {
  const body = await request.json();
  const messages = body.messages;
  const ai = new Ai(env.AI);
  const response = await ai.run("@cf/meta/llama-2-7b-chat-int8", {
    messages
  });
  return response;
}

export default {
  async fetch(request, env){
      return handleRequest(request, env, miniLLM);
  },
};

The shape of the incoming message POST body is a conversation between the user and the system:

[
  {
    "role": "user",
    "content": "What is climate change, but tell me in a funny way?"
  },
  { "role": "system", "content": "Climate change is not lit, ..." }
]

This is how most LLMs describe their back-and-forth conversations with the users.

Even though I was able to test out my function in the Cloudflare editor, I was unable to make fetch() requests to it from my local frontend. It was a classic CORS issue. Here’s the code that’ll give you the ability to whitelist specific domains to your worker function:

const corsHeaders = {
    'Access-Control-Allow-Headers': '*', 
    'Access-Control-Allow-Methods': 'POST',
    'Access-Control-Allow-Origin': 'https://gpt.shahzeb.co',
  };
  
export async function handleRequest(request, env, cb) {
    if (request.method === "OPTIONS") {
      return new Response("OK", {
        headers: corsHeaders
      });
    } else if (request.method === 'POST') {
      const res = await cb(request, env);
      return new Response(JSON.stringify(res), {
        headers: {
          'Content-type': 'application/json',
          ...corsHeaders
        }
      });
    } else {
      return new Response("Method not allowed", {
        status: 405,
        headers: corsHeaders
      });
    }
  }

Not including the CORS helper function, we can now make POST requests to an endpoint which is globally distributed. All in exactly 18 lines of code.

Cloudflare worker dashboard for the llama worker.

Prompt Engineering 101

I’ve been trying to burn down my Letterboxd watchlist. One movie that’s been on my list forever is the 2005 movie Pride and Prejudice starring Tom Wambsgans from the hit HBO drama series Succession. It’s a solid movie with some really pretty writing (shout out Jane Austen). This got me thinking; can I force the LLM to respond to me as a specific author?

Welcome to a quick prompt engineering crash course. Before we make a request to our AI function, we are going to modify the message we send to the backend with some additional text.

On the UI when a user selects Jane Austen as the desired voice, we change the prompt before making the request. If the user prompts “What is the best drink?” that then becomes: “In the style of Jane Austen, answer the following prompt: What is the best drink?” It’s really that simple. Functionally speaking, typing in that pre-prompt into the prompt textarea and selecting Jane Austen from the dropdown will get the same result.

Here’s a snippet from the frontend for of how we convert the prompt voices:

export type voices = ("none" | "brevity") | authors;
export type authors =
  | "Jane Austen"
  | "Ernest Hemingway"
  | "John Steinbeck"
  | "Mark Twain";

export const PROMPT_VOICE = (prompt: string, voice: voices): string => {
  switch (voice) {
    case "brevity":
      return `Within 4 sentences or less, answer the following prompt: ${prompt}`;
    case "Jane Austen":
      return `In the style of Jane Austen, answer the following prompt: ${prompt}`;
    ...
    default:
      return prompt;
  }
};

I added the “Brevity” option as a voice because the shorter response means the AI runs faster and is less likely to end on an open sentence.

In our client, where we make the request to our backend, we use this PROMPT_VOICE function to reshape the prompt in our userInput:

const userInput: Response = {
      content: PROMPT_VOICE(prompt, promptVoice),
      role: "user",
    };
const req = [...prevResponses, userInput];
const res = await promptRequest({ messages: req }); // does the fetch()

At this point we have a functional deployed backend we can have a conversation with.

I hate to leave you on a cliffhanger…

…but this post is getting very long. Substack is yelling at me that this post will not fit in an email. Whole time I’m thinking, who reads email.

Hopefully you do because if you subscribe (it’s free, just put in your email bellow) I’ll be sharing two new posts in this series:

The frontend for MiniGPT
1. Tools used to build it
2. Deploy it
3. How to debug the most pain-in-the-ass mobile responsive UI bug with a CSS hack
4. ‼️ Link to the code ‼️
The best side project database I’ve found
1. I’m using it to track the total number of prompts so far on the homepage
2. Avoid AWS at all costs
3. Spend like $5 on it

As always, thanks.

thoughts & experiments

Discussion about this post