Prompting Fundamentals and How to Apply them Effectively


Hey friends,

I've been helping teams with their prompts lately and was sad to see how they didn't have a good understanding of the basics, even as they reached for advanced techniques and complicated prompting tools. This spurred me to write this piece on the fundamentals of prompting. By mastering these, we should get 80 - 90% of they way to the optimal prompt.

Aside: My friend Hamel Husain is organizing an LLM Conference + Finetuning Workshop:

  • 11 talks by world-class practitioners like Jeremy Howard, Simon Willison, etc
  • 4 workshops on productionizing LLMs, including evals, finetuning, and serving
  • $3,250 in credits from Modal, Replicate, HuggingFace, LangSmith, OpenAI, etc
  • Access to student-only Discord with instructors and guest speakers
  • Access to all material and recordings, including past talks and workshops

Signups close Wed 29 May 2359 PST (affiliate link)

I appreciate you receiving this, but if you want to stop, simply unsubscribe.


👉 Read in browser for best experience (web version has extras & images) 👈

Writing good prompts is the most straightforward way to get value out of large language models (LLMs). However, it’s important to understand the fundamentals even as we apply advanced techniques and prompt optimization tools. For example, there’s more to Chain-of-Thought (CoT) than simply adding “think step by step”. Here, we’ll discuss some prompting fundamentals to help you get the most out of LLMs.

Aside: We should know by now that, before doing any major prompt engineering, we need reliable evals. Without evals, how would we measure improvements or regressions? Here’s my usual workflow: (i) manually label ~100 eval examples, (ii) write initial prompt, (iii) run eval, and iterate on prompt and evals, (iv) eval on held-out test set before deployment. Here are write-ups on practical evals for key tasks and how to build evals with a case study.

We’ll use the Claude Messages API for the prompt and code examples below. The prompts are deliberately kept simple and can be further optimized. The API provides specific roles for the user and assistant, as well as a system prompt.

Mental model: Prompts as conditioning

At the risk of oversimplifying, LLMs are essentially sophisticated probabilistic models. Given an input, they generate probable outputs based on patterns learned from data.

Thus, at its core, prompt engineering is about conditioning the probabilistic model to generate our desired output. Thus, each additional instruction or piece of context can be viewed as conditioning that steers the model’s generation in a particular direction. This mental model also applies to image generation too.

Consider the prompts below. The first will likely generate a response about Apple the tech company. The second will describe the fruit. And the third will explain the idiom.

By simply adding a few tokens, we have conditioned the model to respond differently. By extension, prompt engineering techniques like n-shot prompting, structured input and output, CoT, etc. are simply more sophisticated ways of conditioning the LLM.

Assign roles and responsibilities

One way to condition the model’s output is to assign it a specific role or responsibility. This provides it with context that steers its responses in terms of content, tone, style, etc.

Consider the prompts below: Because the assigned roles vary, we can expect very different responses. The preschool teacher will likely respond with simple language and analogies while the NLP professor may dive into the technical details of attention mechanisms.

Roles and responsibilities can also improve accuracy on most tasks. Imagine we’re building a system to exclude NSFW image generation prompts. While a basic prompt like prompt 1 might work, we can improve the model’s accuracy by providing it with a role (prompt 2) or responsibility (prompt 3). The additional context in prompts 2 and 3 encourages the LLM to scrutinize the input more carefully, thus increasing recall on more subtle issues.

Structured input and output

Structured input helps the LLM better understand the task and input, thus improving the quality of output. Structured output makes it easier to parse responses, thus simplifying integration with downstream systems. For Claude, XML tags work particularly well while other LLMs may prefer Markdown, JSON, etc.

In this example, we ask Claude to extract attributes from a product <description>.

Claude can reliably follow these explicit instructions and almost always generates output in the requested format.

We can scale this to process multiple documents at once. Here’s an example where we provide product reviews as an array of dicts which we then convert to XML input. (While the example only shows three documents, we can increase the input to dozens, if not hundreds of documents).

This gives us the following <reviews> XML.

We can then prompt Claude to provide a <summary> of the <reviews>, with references to the relevant <id> tags, which gives us the following output.

We can also prompt it to extract the <aspect>, <sentiment>, and corresponding review <id>, leading to the following:

Overall, while XML tags make take a bit to get used to, it allows us to provide explicit instructions and fine-grained control over structured input and output.

Prefill Claude’s responses

Prefilling an LLM’s response is akin to “putting words in its mouth”. For Claude, this guarantees that the generated text will start with the provided tokens (at least in my experience across millions of requests).

Here’s how we would do this via Claude’s Messages API, where we prefill the assistant’s response with <response><name>. This ensures that Claude will start with these exact tokens, and also make it easier to parse the <response> downstream.

n-shot prompting

Perhaps the single most effective technique for conditioning an LLM’s response is n-shot prompting. The idea is to provide the LLM with n examples that demonstrate the task and desired output. This steers the model towards the distribution of the n-shot examples and usually leads to improvements in output quality and consistency.

But n-shot prompting is a double-edged sword. If we provide too few examples, say three to five, we risk “overfitting” the model (via in-context learning) to those examples. As a result, if the input differs from the narrow set of examples, output quality could degrade.

I typically have at least a dozen samples or more. Most academic evals use 32-shot or 64-shot prompts. (This is also why I tend not to call this technique few-shot prompting because “few” can be misleading on what it takes to get reliable performance.)

We also want to ensure that our n-shots are representative of expected production inputs. If we’re building a system to extract aspects and sentiments from product reviews, we’ll want to include examples from multiple categories such as electronics, fashion, groceries, media, etc. Also, take care to match the distribution of examples to production data. If 80% of production aspects are positive, the n-shot prompt should reflect that too.

That said, the number of examples needed will vary based on the complexity of the task. For simpler goals such as enforcing output format/structure or response tone, as few as five examples may suffice. In such instances, we may only need to provide the desired output as examples rather than the usual input-output pairs.

Diving deeper into Chain-of-Thought

The basic idea of CoT is to give the LLM “space to think” before generating its final output. The intermediate reasoning allows the model to break down the problem and condition its own response, often leading to better results, especially if the task is complex.

The standard approach is to simply add the phrase “think step by step”.

However, we can do more to improve the effectiveness of CoT.

One idea is to contain the CoT within a designated <sketchpad>, and then generate the <summary> based on the sketchpad. This makes it easier to parse the final output and exclude the CoT if needed. To ensure we start with the sketchpad, we can prefill Claude’s response with the opening <sketchpad> tag.

Another way to improve CoT is to provide more specific instructions for the reasoning process. For example:

By guiding the model to look for specific information and verify its intermediate outputs against the source document, we can significantly improve factual consistency (i.e., reduce hallucination). In some cases, we’ve observed that adding a sentence or two to the CoT prompt removed the majority of hallucinations.

Split catch-all prompts into multiple smaller ones

We can sometimes improve performance by refactoring a large, catch-all prompt into several, single-purpose prompts (akin to having small, single responsibility functions). This helps the model focus on only one task at each step, increasing performance at each step and thus, final output quality. While this will increase total input token count, the overall cost need not be higher if we use smaller models for some simpler steps.

Here’s how we might split our meeting transcript summarizer above into multiple prompts. First, we’ll use Haiku to extract the decisions, action items, and owners.

Then, we can verify that the extracted items are consistent with the transcript via Sonnet.

Finally, we can use Haiku to format the extracted information.

As an example, AlphaCodium shared that by switching from a single direct prompt to a multi-step workflow, they increased gpt-4 accuracy (pass@5) on CodeContests from 19% to 44%. Their coding workflow had multiple steps/prompts including:

  • Reflecting on the problem
  • Reasoning on the public tests
  • Generating possible solutions
  • Ranking possible solutions
  • Generating synthetic tests
  • Iterating on the solution with public and synthetic tests

Optimal placement context

I’m often asked where to put the document or context within the prompt. For Claude, I’ve found that putting the context near the beginning tends to work best, with a structure like:

  • Role or responsibility (usually brief)
  • Context/document
  • Specific instructions
  • Prefilled response

This aligns with the role-context-task pattern used in many of Anthropic’s own examples.

Nonetheless, the optimal placement may vary across different models depending on how they were trained. If you have reliable evals, it’s worth experimenting with different context locations and measuring the impact on performance.

Crafting effective instructions

Short, focused sentences separated by new lines tends to work best. I haven’t found other formats like paragraphs, bullet points, or numbered lists to work as well. Nonetheless, the meta on writing instructions is constantly evolving so it’s good to keep an eye on the latest system prompts. Here’s Claude 3’s system prompt; and here’s ChatGPT’s.

Also, it’s natural to add more and more instructions to our prompts to better handle edge cases and eke out more performance. But just like software, prompts can get bloated over time. Before we know it, our once-simple prompt has grown into a hundred lines. To add injury to insult, the Frankenstein-ed prompt actually performs worse on more common and straightforward inputs! Thus, periodically refactor prompts—just like software—and prune instructions that are no longer needed.

Dealing with hallucinations

This is a tricky one. While some techniques help with hallucinations, none are foolproof. Thus, do not assume that applying these will completely eliminate hallucinations.

For tasks involving extraction or question answering, include an instruction that allows the LLM to say “I don’t know” or “Not applicable”. Additionally, try instructing the model to only provide an answer if it’s highly confident. Here’s an example:

For tasks that involve more reasoning, CoT can help reduce hallucinations. By providing a <sketchpad> for the model to think and check its intermediate output before providing the final answer, we can improve the factual grounding of the output. The previous example of summarizing meeting transcripts (reproduced below) is a good example.

Using the stop sequence

The stop sequence parameter allows us to specify words or phrases that signal the end of the desired output. This prevents trailing text, reduces latency, and makes the model’s responses easier to parse. When working with Claude, the convenient option is to use the closing XML tag (e.g., </response>) as the stop sequence.

Selecting a temperature

The temperature parameter controls the “creativity” of a model’s output. It ranges from 0.0 to 1.0, with higher values resulting in more diverse and unpredictable responses while lower values produce more focused and deterministic outputs. (Confusingly, OpenAI APIs allow temperature values as high as 2.0, but this is not the norm.)

My rule of thumb is to start with a temperature of 0.8 and then lower it as necessary. What we want is the highest temperature that still leads to good results for the specific task.

Another heuristic is to use lower temperatures (closer to 0) for analytical or multiple-choice tasks, and higher temperatures (closer to 1) for creative or open-ended tasks. Nonetheless, I’ve found that too low a temperature reduces the model’s intelligence (thus my preferred approach of starting from 0.8 and lowering it only if necessary).

What doesn’t seem to matter

There are a few things that, based on my experience and discussions with others, don’t have a practical impact on performance (at least for recent models):

  • Courtesy: Adding phrases like “please” and “thank you” doesn’t affect output quality much, even if it might earn us some goodwill with our future AI overlords.
  • Tips and threats: Recent models are generally good at following instructions without the need to offer a “$200 tip” or threatening that we will “lose our job”.

Of course, it doesn’t hurt to be polite or playful in our prompts. Nonetheless, it’s useful to know that they’re not as critical for getting good results.

• • •

As LLMs continue to improve, prompt engineering will remain a valuable skill for getting the most out of LLMs (though we may soon transition to “dictionary learning”). What other prompting techniques have you found useful? Please comment below or reach out!

Eugene Yan

I build ML, RecSys, and LLM systems that serve customers at scale, and write about what I learn along the way. Join 7,500+ subscribers!

Read more from Eugene Yan

Hey friends, I've been thinking and experimenting a lot with how to apply, evaluate, and operate LLM-evaluators and have gone down the rabbit hole on papers and results. Here's a writeup on what I've learned, as well as my intuition on it. It's a very long piece (49 min read) and so I'm only sending you the intro section. It'll be easier to read the full thing on my site. I appreciate you receiving this, but if you want to stop, simply unsubscribe. 👉 Read in browser for best experience (web...

Hey friends, Just got back from the AI Engineer World's Fair and it was a blast! I had the opportunity to give the closing keynote, as well as host GitHub CEO Thomas Dohmke for a fireside chat. Along the same lines, I've been thinking about how to interview for ML/AI engineers and scientists, and got together with Jason to write about the technical and non-technical skills to look for, how to phone screen, run interview loops, and debrief, and some tips for interviewers and hiring managers....

Hey friends, Recently a couple of friends and I got together to write about some challenges and hard-won lessons from a year of building with LLMs. One thing led to another and this is now published on O'Reilly in three sections: Tactics: Prompting, RAG, workflows, caching, when to finetune, evals, guardrails Ops: Looking at data, working with models, product and risk, building a team Strategy: "No GPUs before PMF", "the system not the model", how to iterate, cost We have a dedicated site...