How to Make AI Actually Do What You Want

Eric Lott
Jul 1
5 min read

We’ve all seen what happens when a system optimizes the wrong thing.

In 2018, YouTube’s recommendation algorithm pushed viewers into conspiracy rabbit holes, all in the name of increasing watch time. The system didn’t fail, it did exactly what it was told. The failure was in the incentive.

This is reward hacking: when a system learns to optimize (or create shortcuts) based on the instructions, not the intended goal.

It’s not a bug. It’s an unintended consequence of poorly defined objectives. And with LLMs and AI agents, the risks scale faster than ever. Let’s explore how to craft prompts and build systems that don’t just get results, but get the right results.

Historical Cases of Reward Hacking

🐍 The Cobra Effect

The British colonial government in India offered bounties for every dead cobra. Locals responded by breeding them. When the program ended, they released the cobras. The population exploded.

⏳ Watch Time Optimization

Incentivizing "time on site" led platforms like YouTube to prioritize extreme or addictive content. The algorithm wasn’t broken. It was just obeying orders.

🏁 Racing Game AI

One AI figured out it could rack up points by spinning in circles near checkpoints instead of finishing the race. Technically successful. Totally misaligned.

Takeaway: Misaligned incentives don’t just create weird results. They create highly optimized weird results.

Alignment and Why It Matters in Prompt Engineering

When you write a prompt for an AI, you’re defining its success criteria, explicitly or implicitly.

Every word in your system message is a design decision. You're not just asking the model to complete a task, you're defining what it should value. The closer the output is to that targeted value, the more aligned the prompt or agent llm is.

For example:

Prompt: "Write a listing that will sell this item fast."

Result: Exaggeration, vague claims, urgency without substance.

compared to

Prompt: "Write an honest, accurate listing that helps the right buyer understand what they’re getting."

Result: Clarity, relevance, and trust.

One prompt asks for a result, the other defines a successful, or aligned, result.

Design Principles: Writing Incentive-Aware Prompts

🎯 1. Name the Goal, Not the Mechanism

Your prompt should describe what success looks like, then leave it to the agent or LLM to figure out how to achieve it.

✅ Better:

"Help the buyer trust the description."
"Make the summary easy for a busy manager to scan and act on."
"Give a clear explanation that builds the reader’s confidence in the product."

❌ Worse:

"Use more adjectives."
"Include at least three bullet points."
"Make the tone friendlier by using emojis."

Telling the model what to achieve lets it reason toward the best way to get there. Telling it how often leads to blind, misaligned optimization.

📏 2. Define What Good Looks Like

Set clear expectations for tone, structure, and purpose.

✅ Examples:

"Write a 3-paragraph email that is warm but professional, avoids jargon, and ends with a clear next step."
"Summarize this article for a non-technical audience in under 150 words, focusing on impact and implications."
"List pros and cons from this customer review, preserving the original tone."

You can even include "bad" and "good" samples to anchor output.

🛠️ 3. Optimize for Outcomes, Not Metrics

Don’t incentivize superficial metrics like length, positivity, or word count.

✅ Better:

"Make the response useful for someone genuinely trying to make a decision."
"Prioritize clarity, honesty, and helpfulness over persuasion."
"Focus on reducing confusion, not just increasing engagement."

❌ Worse:

"Write 500 words."
"Make it sound very exciting."
"Use SEO keywords throughout, no matter what."

Good content earns metrics. Chasing metrics directly can lead to misleading results.

🕵️ 4. Interrogate the Output

Build prompts that fail well by stress-testing the logic underneath.

Ask yourself:

What shortcut would an AI take if it didn’t care about the spirit of this task?
What’s the minimum viable answer that technically works but fails the goal?

Examples:

Prompt: "Summarize this meeting."

Minimum Viable Answer: "This was a meeting on Tuesday with 5 participants."

Prompt: "Write a brief review for this product."

Minimum Viable Answer: "Great product, highly recommend."

Think Like a White Hat Hacker

One of your most valuable tools isn’t technical, it’s philosophical and psychological.

Approach prompt design like a white hat hacker: someone who tries to break a system to make it more resilient. Inspired by the "AI Learns to Walk" video, where reinforcement models flail to the finish line because the goal was "get there" instead of "walk well".

Ask yourself:

Is my prompt asking leading questions?
- Example:
  - ❌"Why is this product so popular?" (Injects bias)
  - ✅"What do customer reviews say about this product’s strengths and weaknesses?”

Does this prompt reward hallucination or filler?
- Example:
  - ❌"Write a 300+ character description of this product" (Incentivizes filler)
  - ✅"Write a meaningful and complete description of this product"

Is the model shortcutting comprehension?
- Example:
  - ❌"Summarize this email" (Too vague, leads to inconsistent output)
  - ✅"List all of the key sentences and all of their action items from this email"

Am I over-optimizing for a surface-level metric?a
- Example:
  - ❌"Make this headline more exciting." (Leads to clickbait)
  - ✅"Rewrite this headline to improve engagement while staying factual and clear."

Pro tip: Feed in messy, contradictory, or incomplete inputs. If your prompt still holds up, you’ve built something robust. If it confidently lies, rethink the design.

Use OpenAI’s Playground to Stress-Test

Want to see what your prompt really rewards? Use the OpenAI Playground.

It’s more than a dev tool. It’s a sandbox. Adjust your system message, temperature, and inputs. Then improve it. Push it to failure.

Five Better-Aligned Prompts

Use Case	Poor Prompt	Better-Aligned Prompt	Notes
Customer Support	You are a helpful support agent.	You are a support agent who gives accurate, concise answers. Ask clarifying questions if needed. Avoid filler.	Explicitly states not to add filler, encourages asking clarifying questions.
Product Description	Write a product description.	Write an honest, clear description that helps the right buyer decide. Include key features and avoid exaggeration.	Targets the "right buyer", these two words alone could help reduce returns and negative reviews.
Meeting Summarizer	Summarize this transcript.	Summarize the key takeaways and action items for someone who missed the meeting. Be factual and concise.	Rewards factual outputs, reducing hallucinations.
Email Generator	Write a professional email.	Write a warm, clear email under 150 words that addresses the client’s concern and outlines next steps.	"Under 150 words" rewards concise outputs, adding the context that it should address concerns aligns the output as well.
AI Sales Assistant	Convince the user to buy.	Ask 1 or 2 clarifying questions before recommending. Only suggest a product if it truly meets the user's needs. No upselling.	"No upselling" aligns the sales assistant's goal of recommending products based on user's request instead of pitching more expensive ones.

Final Thought: Read the Output Like a Clue

Every model response is a reflection of your prompt’s design. Each token it chooses reveals what it thinks matters to you. So when something feels off, don’t just blame the model.

Ask yourself: "What game did I accidentally teach it to play?"

👟 Next Steps

Before you launch your next AI feature:

Take one prompt and run it through the Playground.
Feed it weird inputs and edge cases.
Watch for shortcuts or shallow outputs.
Rewrite the prompt based on the outcomes you actually want.

Good prompt engineering isn’t just about getting output. It’s about getting the right output, for the right reasons.