Creating an Experiment Doc
Organize and structure experiments with the template I've used at over a dozen companies
Hi there, it’s Adam. 🤗 Welcome to my (almost) weekly newsletter. I started this newsletter to provide a no-bullshit, guided approach to solving some of the hardest problems for people and companies. That includes Growth, Product, company building and parenting while working. Subscribe and never miss an issue. Questions? Ask them here.
EXPERIMENTS! HYPOTHESES! PREDICTIONS! METRICS! SCIENCE!
That’s what we’re all doing in the world of Growth, right? Eh… not always. In fact, most still aren’t. Or worse, if they are, it's not done in a structured and systematic way.
I’ve created some version of an “Experiment Doc” in every company I’ve been at for the last 15 years. Back when people (like me) were running experiments with Offermatica, which became Omniture Test & Target, which became Adobe Target. I was creating and democratizing experiments through documentation.
In this newsletter post I’ll discuss:
Why you need a way to document experiments
The four parts to an experiment doc
Part 1: The Why
Part 2: The Plan
Part 3: The Results
Part 4: The Checklist
A template you can use today to track your experiments
Why you need a way to document your experiments
There are six important reasons to capture your experiments in a structured, written form.
Educate the team on what great experimentation looks like. (h/t Ben Williams)
Share learnings with the organization.
Ensure you don’t F-up the experiment setup.
Align the whole team on what’s happening.
Get feedback on your hypothesis and test solution.
Establish a plan for what you’ll do after the results come in.
Let’s break each of these down.
Educate the team on what great looks like
This first reason came from Ben Williams, the PLGeek himself.
Ben says,
“Great experiment docs act as a wonderful educational tool around concepts that may not be familiar to many on the team. And the value of consistency in the approach that an experiment template brings is huge.”
If you didn’t read that with a British accent please go back and reread it. This time with FEELING!
Ben is, of course, 100% correct here. An experiment doc introduces concepts that an organization may not have seen before – hypothesis creation, test and control, variants, runtime, minimum-detectable-effect (MDE), statistical significance and measurement plans.
If you’re trying to build Growth Culture in an organization then the experiment document is a tool to do so. Team members who work on experiments can show others in their functional team the experiment doc as an artifact, set a standard for what an experiment should look like, and reduce friction to defining one by empowering others to follow the template.
Share learnings with the organization
A Growth organization is a learning organization and one of the best tools for learning is a well-structured experiment.
You know that saying, “Those who cannot remember the past are condemned to repeat it?” An experiment document helps you remember the past and, more importantly, help your teammates.
In my advisory work I’ll often hear companies say, “Oh, we tested that already.” To which I respond, “That’s great, what’d you learn? Can I see the experiment writeup?” Oftentimes that’s met with an uncomfortable silence because they never. wrote. anything. down.
“But Adam, we didn’t have time to write it down because we were too busy.”
Don’t be the person that’s too busy to document your experiments. There’s a very good chance that in running those experiments you learned something–about your users or customers, your product experience, even your ability to set up an experiment properly. You’re going to want those valuable lessons as your organization and customer base grows.
Ensure you don’t F-up the experiment setup.
This happens a lot more than you might think. It has happened to me countless times in my career - and resulted in more structured and thorough experiment documents.
In fact, it’s one of the reasons that we had such a detailed experiment description in the document we used at Patreon. We wanted our data science and engineering teams to be clear about the implementation and measurement so we didn’t have to re-run the experiment. If you think you’re too busy to document an experiment then you’re definitely too busy to re-run it!
It’s one thing to have a failed test because your hypothesis was incorrect, but it’s quite a different feeling when your test fails because you didn’t implement it properly.
Align the whole team on what’s happening
Let’s see if you’re familiar with this scenario:
You ship an experiment. Yay!
A few hours go by…
Frantic message from Customer Support:
“We’re getting customers asking about X and we have no idea about it.”
😬
Or…
You ship an experiment. Yay!
Fifteen minutes go by…
Engineering:
“Our smoke tests are failing in production because you removed that step and didn’t tell us.”
😥
Or what about…
You ship an experiment. Yay!
Half a day goes by…
Another product team:
“Our metrics on [random other part of the experience] are looking really weird. What’s going on?!?”
You can avoid so much of this by using your experiment document to align the whole team on what’s happening. Do yourself a favor and don’t get yelled at by customer support, engineering, or other product teams.
Get feedback on your hypothesis and test solution
They say feedback is a gift and nowhere is that more true than when it comes to your experiment documents. Sharing your experiment ideation, captured in an experiment document, is a great way to improve the quality of your hypotheses and proposed solutions.
Going back fifteen years in my career to the first company where I was running experiments we would debate whether we had the right hypothesis or whether our experiment idea was testing it appropriately. Fast forward a decade and we were leveraging the experiment document at Patreon to decide whether it even made sense to test something at all!
Establish a plan for what you’ll do after the results come in
One final reason to leverage an experiment document is to create a home for “next steps” ahead of receiving the results. Oftentimes, when experiment results come in, you’ll see endless debates about interpretation of those results and what to do next. I prefer to outline some of those next steps up front. You’ll see in the template at the end that we had a section called “ITWWS” or “If this works we should…” which outlined a series of next steps and follow-up experiments that we wanted to pursue if we proved (or disproved) our hypothesis. Having these next steps served as an enablement of speed and a forcing function to think through the possible outcomes.
Now that we know “why” we should create and maintain experiment documents let’s look at how to create them.
The four parts to an experiment doc
As the headline suggests, there are four distinct parts to any meaningful experiment document:
The Why
The Plan
The Results
The Checklist
Let’s break them down further!
The Why (aka Experiment Proposal)
This first section helps you kick off the experiment design and leads into the planning section.
There are six parts to this section:
Summary
The summary contains a concise statement of the experiment you plan to run using a plain language explanation.
We plan to change this part of the product by altering this property of it for this subset of users. Our goal is to improve a metric we care about, without hurting this other metric that we also care about.
Experiment owners
At your company this could vary, but I typically like to identify the following leads for the experiment:
Product Manager
Data Scientist / Analyst
Engineer
Design
You may also choose to identify some key stakeholders here – like a customer support representative, sales, success, or marketing.
The problem we are solving
In this section you are stating the problem that this current experiment is attempting to address.
We want to understand the impact of the presence of guest checkout in our purchase flow.
The hypothesis
Here you’ll state the underlying hypotheses about user behavior that support these product changes. You’re making a statement (an educated guess, really) about how two factors are related.
We believe that this thing will have an impact on the metric because of a belief which was informed by this briefly presented or linked evidence. We will know whether our hypothesis was right when we observe this change in a metric, and our hypothesis will be refuted if we observe this other change.
Following up on our example above related to guest checkout, we might write the hypothesis like this:
The addition of a guest checkout option on our e-commerce platform will result in a measurable increase in the purchase conversion rate, as it eliminates the need for customers to create an account prior to purchase, thus reducing purchase friction.
What are we hoping to learn from this
This can otherwise be read as, “why should these changes be run as an experiment?”
This can be a checklist but in general you want to state:
We need to know the impact with precision.
This change has potential downside.
What we learn from this change will determine or impact future product decisions
We want to pause at this point in the document and sit with this question. Not everything needs to be run as an experiment. A lot of product teams over-rely on experiments, which can be quite costly. Even simple experiments create complexity and overhead. If you're going to ship a change regardless of the experiment results, save yourself the complexity and skip the experiment.
Supporting evidence
This is a very important aspect of your experiment document. Here you want to include what other information you have that has informed your thinking. For example, do you have other qualitative or quantitative data from your customers? Have you conducted other experiments in the past that might inform this? Have you done any other external research or do you have competitive insights that support this experiment?
Once you’ve completed this first part of your experiment document you should do an alignment check – does everyone agree with the elements we’ve included thus far? Do we feel that the hypothesis is strong? Should we even run this as an experiment – is the supporting evidence so strong that it’s not necessary or is this a low-stakes change?
I have seen many experiments die at this point in the process for the last reason: we don’t need to run this as an experiment OR there’s a better way of getting this information that doesn’t require an experiment.
Assuming that everyone is aligned on the elements of The Why it’s time to move to designing your experiment.
The Plan
Within this section you’ll identify the following:
Experiment Design
Description of Variants
Runtime
Risks
Post-experiment Decisions
Some elements of this section will be overkill for you and your specific situation. In the case of a business like Patreon (where we used the template at the end of this newsletter) we built our own experimentation tooling and wanted to make sure we spelled out all the details for the design of the experiment.
Experiment design
When teams have problems with their experiments – for example, they can’t properly interpret results or they introduce bad user experience – it's probably attributable to one of the elements of experiment design captured in this table. Remember that it’s not a failed experiment if the treatment underperforms control, but it IS a failed experiment if you can’t learn something reliable due to poor design.
The five critical elements of experiment design are:
Randomization unit
Platform
Eligibility criteria
Assignment criteria
Response metric
Randomization Unit
For the randomization unit you identify how the experiment will be triggered and assigned to the different treatment groups. This choice can greatly affect the results and interpretation of your experiment.
You can randomize on the individual (each person is randomly assigned), a cluster (groups of people or segments are randomly assigned), or you can conduct cross randomization (combinations of factors such that each combination is tested).
In the case of Patreon we had to decide if we were randomizing based on creators, patrons, or site visitors (prior to knowing whether they were a creator or patron), or email recipients (not even on the site). And then we needed to figure out how we’d identify them – UserID, DeviceID, CampaignID, etc.
Platform
At most of the companies I’ve worked for we had several different device types to consider. Mobile vs. Desktop, Android vs. iOS, and mobile web vs. mobile app to name a few. Plus the all important Windows Phone 😬.
Platform can be representative of a certain slice of your user base so it’s very important to identify this in your experiment design. For example, you may have a much higher percentage of new users on mobile web and a greater percentage of repeat users on your mobile app. You may also see very meaningful and different demographic profiles by device type. Age, income, locale and language (to name a few) can all vary widely depending on platform usage. Testing on one platform could be broadly applicable to others, but that often won’t be the case.
Eligibility Criteria
This identifies what someone has to do to be included in the experiment. It could be that they need to be part of a specific segment, or visit a particular part of the experience, or both.
Assignment Criteria
This answers the question: when do you bucket someone into either the control or the treatment group? It is a critical point of QA for your data and engineering teams to make sure that you’re capturing people at the correct part of the experience. You can collect a lot of bad data if you’re capturing someone too early who never has a chance to encounter your experiment – for example on session start or login when the response metric is deep in the experience.
Response Metric
The response metric identifies what we’ll measure for each bucket of the experiment. This could be a conversion, or an amount, or something else entirely. A key area of importance for response metric is just how long you set the conversion window. If you have a product where people engage across multiple sessions or extended consideration before conversion then it’s important to identify that in this section. You’ll want to capture data across a long enough period of time to have the results be meaningful for your experience.
After you’ve identified the experiment design above you’ll want to accurately describe your variants next.
Description of Variants
Here we want to focus on the minimal experiment that can validate our hypothesis. It’s your minimum sufficient test or minimum viable test. Explain it in plain language and link out to your sketches, wireframes, screenshots or pull requests. Be diligent about capturing screenshots and mockups here – they’ll serve as both a valuable teaching tool for new team members as well as helpful for the existing team to remember what they tested.
The minimal experiment is important for a few reasons. First, you can move faster and conduct more experiments (thereby learning more things) if you keep your experiments to a minimal scope. Second, learning is clearer for simple experiments. Changing too many variables at once makes it harder to pinpoint what in particular caused the behavior change you’re observing.
Of course this isn’t always possible and sometimes your experiments need to be larger and more complex. For example, when testing an onboarding flow you might not want to iteratively test each step of the experience in isolation. It could impact your ability to observe how different steps interact with one another and you may not be able to see changes to the behaviors that really matter. In this case, go big or go home!
I’ve also found it useful here to call out some of the other variations that we may want to test in the future and specifically identify them as being in the “later” or “next” bucket. This signals that you are aware that there are experiments waiting in the wings but it’s too complex to run them at this time.
Runtime
One of the questions that stakeholders always ask is “how long will this take?” A runtime calculator can tell you this. You start with the existing performance then identify the minimum lift (or minimum detectable effect / MDE) you want to see. Add in the planned variants, the percentage allocation, the statistical power and significance level and you can calculate the sample size required for the test. Once you know that you just compare it to the amount of traffic you get to estimate how long it will take you to reach that sample size.
There are plenty of online calculators out there that can help you with this. At Patreon we built our own calculator as part of our experiment system, but for the vast majority of people an online calculator is sufficient. We used this information to determine when it was safe to look at the results of a test.
The other aspect that runtime helps with is planning. If you know how long it’ll take to reach your MDE and you are trying to sequence several overlapping experiments then you can organize the timed release of the different experiments and be fairly confident you won’t have collisions.
Risks
Every experiment comes with certain risks. Mitigating risk is one of the reasons that we experiment. But it’s possible that an experiment could have adverse effects that we’re not measuring directly and it’s important to call those out here.
You can also use this as a way to identify what other teams might need to be aware of the experiment and its risks. For example, at Patreon (which is a B2B2C business) we could change something on the patron experience side that might negatively affect conversion which could be something that a creator would notice and contact us about. We wanted to make sure that our creator-facing teams were prepared with the knowledge of what experiments were running AND we didn’t always want to tell creators when an experiment was live for fear that they might tell their fans and bias the results.
Post Experiment Decisions (aka If This Works We Should)
Here we wanted to identify what the conditions of a go / no-go decision would be. It’s really important to put a stake in the ground before you run the experiment and call out how you will make decisions based on the results. There are some changes that you’ll want to ship even if they’re neutral or slightly negative on the metrics you care about. Other changes would need to show a BIG win in order to ship the changes because of how they might impact the overall product experience. Articulating this ship/no-ship decision in advance helps drive alignment. Have the hard conversations up front before you feel attached to the results.
I also recommend that you ask about follow-ups that depend on the results of the experiment. This is useful since teams may have parts of their roadmap that are contingent on an experiment’s results. If one experiment informs subsequent parts of the roadmap, and that experiment fails or gets delayed, then the team will likely have to have conversations with leadership about changing priorities or timelines. This also serves as a forcing function to create visibility among teams in case one team has a dependency on another team’s experiments OR in the event that an experiment creates work for another team – like a platform team, or an operations team.
Now that we have an idea of what we’ll do with the results it’s time for… THE RESULTS.
The Results
Here we want to analyze what we saw with the outcome of our experiment. Specifically, we are about:
What was the final outcome?
Which version won and by how much?
Why did we observe what we observed?
What is the final recommendation and conclusion?
Are any follow-ups needed?
What were the takeaways of our metrics assessment - how were our tradeoff metrics impacted? Did we observe any other, interesting behavior?
We include links to analysis, cross-linking to other experiment docs that we’ll use to build off of this one, and links to any additional pull requests for the results analysis or other experiments.
The most important aspect of the results is the “why.” Why did we observe what we observed? Understanding the why of the results means you’ve achieved a learning outcome that can be reinvested back into the business, shared with other teams, and generally improve the likelihood of success of others inside the company. Don’t skimp on this important area.
The Checklist
Now you might be thinking: “Adam, an experiment checklist sounds very project manager-like. I do product management.” Well, once you F-up the details on an experiment and have to re-run it later you’ll appreciate a little dose of project management in the form of this checklist.
Your checklist should identify the steps necessary to:
Complete The Why section in its entirety.
Align on the decision that this should, in fact, be an experiment.
Build your experiment plan.
Share the experiment proposal and plan more broadly for feedback.
Get feedback on instrumentation and implementation.
Notify any important stakeholders.
Setup and turn on the experiment within your system of record.
Notify people that it’s live.
Conduct analysis and share out the results.
CLEAN. UP. THE. CODE.
My friends in engineering will appreciate this last one – post-experiment don’t leave the old, unused code sitting around cluttering things up. Clean it up! That means you might be implementing the changes (or not) but at the very least you’re removing the old experiment code. Experiments aren’t finished until this part is done.
Wrapping it all up
So this is how a bill becomes a law, or rather, how an idea becomes an experiment.
My hope is that this helps you bring some order to the chaos, transform your learning, and overall become better at building high impact experiments.
In closing I’ll share a template that you can use – based off of the examples provided in this newsletter. My team at Patreon used this doc to help our product and growth teams plan experiments much more thoroughly and avoid some of the mishaps I’ve identified throughout. We also used this as a forcing function to have the conversation around whether we should run something as an experiment and align teams across product, engineering, data science, marketing, sales, and support.
I have used adapted forms of this doc with at least a dozen companies and the impact has been quite significant: better experimentation, better organizational learning, and very productive teams.
Happy experimenting!
Link to Experiment Document Template in Google Docs: https://docs.google.com/document/d/1Y0POGrrKL0VhqSJE-OuPp7liY-5ZAgr63cNLaD8L2H0/edit?usp=sharing
This is great, thanks for sharing!
I've become a big fan of making it explicit that the hypothesis can NOT include a feature/idea and instead adding a prediction. So it's Problem - Hypothesis - Evidence - Prediction (thanks Reforge).
Works both ways - start with Problem get to Prediction and (probably more importantly) start with Prediction and remember to go back to Problem!
Love this write up. I've been looking for better documentation into how tech companies are experimenting. It's also validating to see that what I'm doing is similar to the pros. Thanks for the article! Well written.