Okay, let's untack this. What if you could take the raw power behind tools like chat GPT and mold it to your exact needs. We're talking about going beyond just you know, chatting with an AI to actually building intelligent applications and specialized assistants.
Exactly Today, our deep dive is all about the open Ai API, focusing on how it empowers well anyone really to create custom AI solutions. Our main source for this is Henry Habib's Open Ai API Cookbook, which Packed Publishing put out in March twenty twenty four.
And Henry Habib he knows his stuff over a decade and AI and productivity right, and he's a big believer in this citizen developer idea that you don't need to be like a hardcore coder to build amazing things. He's also the guy behind the Intelligent Worker newsletter.
That's right, and Sam Mackay, the CEO of Enterprise DNA. He actually calls the book an essential guide for knowledge workers eager to harness the power of open AI and chat GPT to build intelligent applications and solutions. High praise.
So your mission for this, a deep Dive listener, is simple, get a shortcut really understand how to use the open AIAPI. We're focusing on the practical stuff, those real aha moments. You've heard of AI chat GPT. They're everywhere, constantly talked about. But what's cool here is how actionable they are. We're going to show you how to turn ideas into reality. Let's start with the basics. Why the API matters. It's more than just the chat box you see online. I mean,
chat GPT's growth was just insane, wasn't it. One hundred million users in two months. That's faster than TikTok, which took what nine months. It really brought natural language processing NLPE to the masses.
Absolutely, and the API it takes that democratization way further. It's a genuine paradigm shift. It means anyone can generate really human like text from simple prompts. You don't need a PhD in machine learning anymore. It's not just for the big players like Typeface or Jasper Ai building on top of it. It's for you integrating that power into your own stuff.
And the open II Playground is kind of the perfect place to start messing around. Yeah, like a sandbox. It's got three main parts the system message, the chat log, and then the parameters. The system message is where you tell the AI who it should be, like you are an assistant that creates marketing slogans, simple as that shapes its whole persona right.
And it's fascinating because the model isn't understanding like we do no thoughts, no feelings. Think of it more like super advanced autocomplete. It predicts the next word based on patterns from tons of text data. So you put examples in the chat log, say you give it company makes ice cream, and then that apply sham the ice cream that never melts. You're guiding those predictions. You're kind of training it right there to follow patterns like starting with sham and ending with an exclamation.
Mark that makes sense, guiding the probabilities. Okay, so once you've got your prompts working well in the playground, you move on to making real API requests, maybe using something like postmam. And this is where it gets really powerful because you're not just watching it work, you're controlling it with code programmatically. And for an API request, there are
like four main things you need. Right First is the endpoint that's URL, the address you're sending the request to like https, dot API, dot openI, dot com, forward slash v one chat completions exactly.
Then there's the header. Think of this as containing important metadata. It tells open Ai what you're sending, usually content type dot application JSON because Jason is just a standard way for systems to swap structured data. And critically, it says who you are with your authorization bearer, your API key. That's your secret handshake with open Ai.
Okay, So header is like the envelope details, and the body is what's inside the envelope.
Correct. A body is a Jason object. It holds the specifics like which model you want to use, and the messages that's your system message and chat log content. And finally you get the response back from open Ai. That's also Jason containing the AI's output. It's choices and usage data like how many tokens you used?
Cool? But okay, let's break out of just text. The open Ai API can do more than just words, can't it? Multimodal stuff? Oh?
Absolutely, Beyond text. You've got image generation with Dally. The newer versions Dally two and three use this technique called diffusion. You can kind of picture it like starting with TV static and slowly clearing it up until an image appears. It's pretty neat. But the key with images, unlike text maybe, is you have to be super specific in your prompts.
Just saying a dog gets you, well, a random dog, but a brown, furry, medium sized CORKI doog on a green grass field profile view that gets you much closer to what you actually want. It raises an interesting point. Text generation can infer context sometimes, but image generation it needs precise descriptive language. Ambiguity is your enemy here.
Good point need to be crystal clear. And it does audio too. Transcripts.
Yeah, the audio endpoint uses the Whisper model for that. It transcribes audio files.
Ah and technically for file uploads you need to use form data instead of JSON in.
The request right exactly. Jason is great for text data, but form data is built for sending files, kind of like attaching something to an email. It handles lots of formats dot MP three, dot MP four, dot MPEG, dotwave, dot web, dot WebM quite a few.
So you could transcribe a meeting maybe easily.
And the real magic starts when you chain these things together. Imagine a voice assistant. Voice comes in whisper transcribes it, chat Api figures out a response, maybe Dali even generates relevant image.
Okay, that's starting to sound really powerful. Now, let's talk about fine tuning the dials and knobs as you called them in the book.
The parameters, right, The parameters let you control the AI's behavior, and the model parameter is probably the biggest one. Usually you're choosing between GPT three point five and GPT four. GPT three point five has what one hundred and seventy five billion parameters. GPT four is estimated to be way larger, maybe over one hundred trillion parameters across a bunch of models working together. More parameters generally means the model is
better at capturing subtle patterns and understanding complex instructions. So GPT four tends to be more reliable, better with nuance. It actually scores higher on things like standardized tests, EP calculus.
The lsat Wow, and you can see that difference in the outputs. Can't you like that example in the book asking for a sentence about Mars with six five letter words, GPT three point five up the word count right, It gives our Mars strip felt vast, new, cold.
Hard, grand grand isn't five letters.
Exactly If GPT four gets it, Mars Red World, Brave Crew, Deep Space finds life. Perfect for the cigarette question how many chemicals? How many harmful? How many cause cancer? Just the numbers. GPT three point five gives you a paragraph. GPT four just answers two hundred and fifty sixty concise even logic puzzles. GPT four tends to reason more accurately than three point five, and GPT four has a bigger memory too. The context win much bigger.
Like GPT four thirty two k can handle around thirty two thousand tokens maybe twenty four thousand words. GPT three point five max is out around four thousand tokens about three thousand words. Big difference if you're feeding it long documents.
Okay, but there's a catch, isn't there cost?
Huge catch? GPT four can be twenty to forty times more expensive per token than GPT three point five. It's significant. So the practical advice for you is always start with GPT three point five. If it does the job great, you save a lot of money. Only upgraded GPT four if you absolutely need that extra reasoning power or the larger context window.
That's a massive cost difference. Why is it so much more just the size.
Primarily, Yeah, it's a much bigger, more complex model. Just takes way more computing power to run each request. Think supercomputer versus calculator.
Gotcha? Okay, another parameter dot N that controls how many answers you get back right.
N sets the number of responses can be any whole number for chat, but max ten for images. Super useful for brainstormings, logans, getting different options, or for checking consistency, maybe ab testing outputs.
And the interesting thing you mentioned is the cost isn't linear like N three isn't three times the price?
No, it's often much less, maybe sixty percent more, not two hundred percent more, which tells you something cool. The AI isn't just running the request three times separately. It's likely batching the computation somehow finding efficiencies. It's an optimization hint clever.
Okay, what about temperature? That one sounds a bit abstract. Controls creativity.
Yeah, temperature basically controls the randomness or let's say, creativity of the output. It goes from point zero to two point zero th of it, like tuning a radio. Low temperature maybe twoint zero too point eight is like a sharp, clear signal, very focused, consistent factual responses. Good for things like code generation data analysis where you want deterministic.
Output and higher temperature more static, more like an eclectic mix station.
Yeah, higher temps, say one point two to two point zero, make the AI take more risks with word choices. It flattens the probability curve for the next word, so you get more diverse, unexpected sometimes more creative results. Great for brainstorming, writing stories, generating slogans.
So for general use like a chatbot, maybe somewhere in the middle point eight to one point two exactly.
Balance is making sense with being interesting.
So the advice is start around one point zero and tweak it by like zero point two increments.
That's a good practical approach. Yeah, see what works for your specific need.
Okay, makes sense. Now let's shift gears to building real applications. Usually you don't just have your app talk directly to open AI, right, there's often a back end layer in between.
That's right. The typical flow is from tend what the user sees talks to your back end, and your back end talks to the open AIAPI. This back end layer is crucial. First security, it keeps your precious API key safe hidden from the user's browser. Second control, you can process the input before it goes to open AI or clean up the output after it comes back. Plus it lets you integrate other services, hand logins, all that stuff.
And for that back end, serverless options like Google Cloud functions are pretty popular.
Very popular, yeah, because you don't have to manage servers. It just scales automatically. You write your code, upload it, and Google handles the rest. You set up an HTTP trigger so it could be called like a web address. Allow unauthenticated calls maybe for testing, but be careful in production and define your entry point function.
And then for the front end the user interface. You can use no code tools like Bubble, so anyone can build the app part exactly.
Bubble lets you visually design your web app and connect buttons and inputs directly to your back end cloud function. It's incredibly empowering.
Let's walk through an example, like that email reply wrapper from the book. You could do it in chat GPT, sure, but building it yourself really teaches you the whole process. So you start in the playground testing proms, get the Python code, then you put that logic into a Google Cloud function that's your back end. It takes the email
text as input, adds your API key. Secretly, you'd tell it to use say GPT four, maybe a higher temperature like one point four for creator replies, set N three to get three options, maybe limit topens to five to one.
Right, and then you'd use Postman to test that cloud function directly, make sure it actually returns three email replies in the format you expect. Once that's working, you jump into Bubble. You build the input box for the original email, a button to generate replies, and maybe three textboxes to display choice one, choice two, choice three. Use bubbles API connector to link the button press to your cloud function URL and display the return choices. And really understanding this
whole playground, Cloud function, Postman, Bubble. That's the fundamental pattern. Master this and you can pretty much any intelligent app.
That's a great point. It's the core loop. What's a common sticking point when people first try this? Getting the data flow right often.
Yeah, getting the JSON right in the requests and responses, making sure API keys are correct and secure, little syntax things. Postwind really helps debug that before you even touch the frontend.
Okay, so that's a solid foundation. But let's get to something really cool, something you can't just do in the standard chat GPT interface easily. The multimodal travel itinerary app. That sounds awesome.
It really shows the power of orchestrating multiple API calls. The idea user toxicity gets back a detailed one day plan morning, afternoon, evening activities and three AI generated images matching those activities.
Wow, okay, how does that work behind the scenes in the cloud function.
So first, because this involves multiple calls, including image generation, which can be slow, you need to increase the cloud function's timeout limit maybe to three hundred seconds five minutes, just to be safe.
Good practical tip.
Then one uber one uses the chat api GPT four. Specifically, it takes the city name. Crucially, you give it a detailed chat log with examples what the book calls fu shot prompting. You showed examples for Rome, Lisbon, et cetera. Format it exactly how you want warning activity, afternoon activity, evening activity. This force is GPT four to follow that structure precisely. It stores the resulting itinerary text.
Got it. So the structure comes from good prompting and examples. How do the images get generated?
That's call number two, also chat API, but this time using GPT three point five Turbo one one oh six. Its only job is to take the itinerary text from call one and create three short descriptive prompts suitable for DELI. Like if the itinerary mentioned the Colisseum, Vatican and Trevy Fountain, it might output Colisseum and Rome, Vatican City Interior, Trevy Fountain at night. Just the prompts separated by a pipe symbol.
Ah. And you use GPT three point five here because it's cheaper and the task is simple. It doesn't need GPT four's nuance exactly.
The user never sees this intermediate p output, only the final images, so three point five is perfectly adequate and much more cost effective for this specific step.
Smart resource use nice optimization. Okay, So now you have the itinerary text and three image prompts.
Right, So call number three hits the images API using DELI THII. Your code loops through the three prompts from call too, making a separate API call for each one to generated image. It collects the URLs of the three generated images image rolls Finally, the cloud function bundles everything up and returns a single Jason response containing the itinerary text and the URLs for morning image, afternoon image, and evening image.
And then in bubble you just connect those pieces input for city button, a big text area for the itinerary, and three image elements. You map the JSON fields from the cloud function response directly to those elements. That's really slick, combining text and custom images on the fly like that.
Very cool. Okay, let's switch tracks slightly. Building knowledge assistance this is huge. Standard chat GPT is great, but its knowledge is kind of frozen in time right, and it can sometimes just make stuff up hallocin. You can't easily to only use this specific document precisely.
That's where building your own assistant comes in, using the API combined with your specific trusted knowledge source. A basic way to do this covered in the book is PDF analysis. Your app takes a PDF link and a question. The cloud function fetches the pdf, uses a library like pipdf two to scrabe all the text out of it. Then it stuffs that entire text into the prompt along with the user's question, and sends it off to GPT four so it just crams the.
Whole PDF into the context window every single time. Yeah, coefficient, it can be. It works, but yeah, limitations. It only gets text, no images from the pdf. It struggles with really huge documents, and the biggest issue is that context window limit. If your PDF has more words then the model can handle like those three thousand words for GPP three point five or twenty four thousand for GPT four thirty two. K. It just won't work properly, right.
But there's a better way now, isn't there with the newer assistance API.
Oh yeah, the assistants APIs specifically with its built in knowledge retrieval tool, is a total.
Game changer for this What makes it so different.
It's incredibly smart. When you upload your documents, PDFs, word docs, etc. To an assistant with retrieval enabled, open AI automatically handles the hard parts. It breaks the documents into manageable chunks, creates embeddings for each chunk, those unique numerical fingerprints we talked about, and stores them efficiently. Then when you ask a question, it uses vector search to instantly find only the most relevant chunks of texts from your documents related to your question.
So It doesn't read the whole document every time, It just finds the relevant paragraphs.
Exactly, which means there's effectively no context window limit for your knowledge base. You can upload massive files or hundreds of documents and the assistant intelligently retrieves only the necessary snippets to answer the question. Incredibly efficient.
That sounds amazing. How do you set that up? Still?
Start in the playground, yep, The playground is great for creating the assistant itself. You give it a name US Constitution Expert Instructions answer questions based only on the provided constitution document. Choose a model like GPT four to eleven oh six Preview, which is good for this. Then the crucial step you toggle on the retrieval tool and then you upload your knowledge file like a PDF of the US Constitution. Once it's created, you grab the unique assistant ID.
Okay, assistant created, knowledge uploaded. Then the cloud function code uses this assistant ID.
Correct. The Python code for your cloud function becomes a bit different using the assistants API. First, you create a thread. Think of thread as a single conversation session. Then you add the user's question as a message to that thread. Next, you tell the assistant to run on that thread, providing the assistant ID and the thread ID. Now here's a key detail for the book's code. You need to wait
a bit. The assistant needs time to process, search the knowledge and formulate the answer, so you might add a time dot sleep or similar pause. After the pause, you retrieve the list of messages from the thread and the assistem's answer will be the newest message.
Okay, that pause is important. And the bubble front end for this probably simpler.
Much simpler for this use case. Yeah, just an input boxer the user's question, a button and a text box to display the answer returned by the cloud function.
And the result is you can ask specific questions like how many senators are there or what's the age requirement for a senator and it pulls the answer directly from that constitution pdf you uploaded exactly.
It grounds the AI in your specific source material. It's incredibly powerful for legal teams, medical info, company knowledge bases, educational tools, anywhere you need reliable answers from a defined set of information.
Wow, we've covered a lot, from just understanding the API basics to playing in the playground, making direct calls, adding images and audio. Then building actual apps with back ends and frontends, optimizing costs, and finally creating these powerful knowledgeable assistance tied to specific documents. You've really gone from just using chat GPT to understand how to build with its underlying power. You're equipped now to actually create things.
Yeah, and it brings to mind something Paul Siegel, a tech entrepreneur, wrote in the forward to Henry's book, You said, Essentially, I strongly encourage you to use this knowledge to create your next successful app or business, or simply to enrich your thinking about how to innovate. Dream on it, then fashion your dreams into a reality with the tools you've gained here. I think that sums it up nicely.
Great final thoughts, So the message is clear, don't just use AI, build with it, Go experiment, see what you can create.
