Considering all the recent innovation in paid search regarding targeting and automation, it’s easy to forget the importance of testing ad creative. With Google continuously moving towards a more automated approach – often based on machine learning – marketers can't afford to overlook the importance of optimising their ad creative.

**The Problem**

Anyone running a medium to large account knows how much manual work goes into analysing ad creative to make effective optimisation decisions. In fact, Google encourages marketers to test between 3-5 ads per ad group. The company even states, “*The more ads that are present in an ad group, the more options you’ll have for success in an auction.*”

This could result in upwards of thousands of ads running across an account. Add in the fact that they often have different start dates and you can end up in an Excel nightmare of trying to calculate in-sample split tests to reliably establish performance.

Additionally, it’s rare that marketers consider (or report on) the potential loss of *not* taking action. Whilst running several ad copies simultaneously may yield in more “options for success”, it also opens up the possibility of wasting visibility on poorly performing ad creative. If given the same ad space, creative with better messaging could have resulted in a click and a subsequent conversion.

While hedging our bets by running several ad copies at the same time is important for speaking to a broader audience, we also need to know when it’s not working well and losing potential revenue.

To illustrate further on the cost of not acting, let’s say we have two running ads: Ad A and Ad B. Ad A has twice the click-through rate of B, and we have established the difference as statistically significant. If left unoptimised going forward, we could miss out on a 33% potential increase in traffic (*see logic below*).

$latex \text{traffic}_A = \text{impressions} \times \text{ctr}_A$

$latex \text{traffic}_B = \text{impressions} \times \frac{\text{ctr}_A}{2}, \text{traffic}_A = 2 \times \text{traffic}_B$

$latex \text{Unoptimized traffic} = \text{traffic}_A + \text{traffic}_B = 3 \times \text{traffic}_B$

$latex \text{Optimized traffic} = 2 \times \text{traffic}_A = 4 × \text{traffic}_B$

$latex \text{Potential increase by optimisation} = \frac{4 \times \text{traffic}_B - 3 \times \text{traffic}_B}{3 \times \text{traffic}_B} = \frac{\text{traffic}_B \times (4 - 3)}{\text{traffic}_B \times (3)} = \frac{1}{3} \approx 33\%$

This is amplified further if the ad is fundamental to conversions. You’re losing traffic volume and, even worse, conversion volume.

Of course, the estimated loss is based on some heavy assumptions, hence the frequent use of the word ‘potential.’ However, the metric does give a good idea of the overall ‘effectiveness’ of ad groups. When combined with the total visibility for a collection of ads, we can estimate the ‘sum of loss of potential’ in traffic. This can then be extended further to loss of conversions or even loss of spend.

Reporting on these metrics should only be done for ad groups we have distinguished as statistically significant differences in ad performance. It can then be used as a priority metric of the ad group’s relative need of optimisation, as well as for providing a prediction of the potential uplifts that recommended actions may yield.

Below is an example of expected clicks lost over a 30-day period if the ads are left to run, based on poorly performing ad creative on an account:

**The Challenge(s)**

Accounts that are highly targeted because of granular account structures can have hundreds, if not thousands, of running ads. It’s crucial to find a way to comb through them all to infer whether each ad’s performance is at a stage to prompt an action.

There’s also the requirement of using data that’s in-sample. There is no point in comparing ad performance unless they have been active during the same time. If you fail to do this, you will likely end up with skewed and misleading results.

Next, you must use a reliable approach to establish eventual differences in performance, and no, eyeballing results to see what’s working is not one of them. Traditional split tests typically take the frequentist approach, though the Bayesian approach is on the rise.

The frequentist approach is based on the p-value, which reports what can be interpreted as “if the null hypothesis is true, how likely are we to see a result as the one we are seeing?” In our case, the null hypothesis is that the tested ads performances are not different. If the chance of obtaining a specific outcome is below a predetermined threshold—the significance level, usually 5%—we reject the null hypothesis. We then claim that the ads’ performances are not the same at all and label the difference as statistically significant.

This is a somewhat-simplified description and doesn’t touch on the required sample sizes nor the ‘power of the test’ (the probability of rejecting a false hypothesis), which must be determined prior of the analysis. Another important aspect—and one that is frequently abused—is that no ‘peeking’ is allowed. This means that the test must be run in full before we can look at the results.

There’s a common misconception we can wait for significance and then end the test. This violates the fundamental assumptions and leads to invalid results. Additionally, testing many variations simultaneously requires p-value corrections, which makes it harder to grasp how results should interpreted.

In short, the frequentist approach is a powerful test when performed correctly. However, there are many rules to follow, and interpreting test results can be difficult without prior training in statistics.

Which brings us to the Bayesian approach. Instead of asking the data ‘*how likely are we to end up with this specific result if there is no difference between the tested variations*?’ it’s more intuitive to ask, ‘*what is the probability that each variation is the best*?’. That question is precisely the reported test result with the Bayesian approach.

It also conveniently quantifies the risk of making wrong decisions. It returns what can be interpreted as a worst-case scenario of how much a ‘winning’ variation can be beaten* if* it is in fact not the best.

This means that even if there is a high probability that an ad is best, before declaring it a winner we also consider ‘the risk of loss’, by this worst-case scenario. If the risk is high, we can choose to not declare the ad a winner before we have collected more information.

The ‘risk of loss’ essentially means that if we're wrong, the expected amount to lose is X% of the CTR. This can be seen in the example below under the ‘expected loss’ column.

To summarise, both the ‘frequentist’ and ‘Bayesian’ approaches have their advantages. The key takeaway is that the Bayesian approach is easier to comprehend (the probability that an ad is winning). It’s arguably the preferable option when testing multiple variations, too, because no corrections need to be made.

Furthermore, it’s one thing to decide on a reliable test approach, but it will never be better than its data. In addition to finding an appropriate test function which deals with uncertainty, you need to consider several factors:

- Natural variation
- Seasonality
- Shift in budgets or strategy

Each of these can potentially skew test results and need to be dealt with properly. They’re controlled for by making sure the runtime is consistent between tested ad creative. Natural variation is trickier, though.

In theory, a normal test—Bayesian or frequentist—will give a false positive 5% of the time by construction of the confidence level. This means that for the full sample space (all running ads in an account), we can expect to see about 5% of ads showing significant differences based on chance alone.

This approach is essentially searching the entire sample space for significant differences by running individual tests for each ad group, so we need to be wary of this. By comparing and testing the frequency of significant ads we expect to see by chance to the frequency we have observed, we know whether we should be cautious about making any conclusions without digging deeper into the data.

**The Solution**

In May, we announced the release of our data science platform, Ayima Intelligence. We wanted to bring the power of algorithmic optimisation to our teams of consultants and simplify the process of data analysis. One of the modules we created was ad analysis and optimisation, and in this case, it gives a straightforward analysis with interpretable results through the Bayesian “multi-armed bandit” model.

When running at the account level, the analysis immediately shows us which ads are in prospect of winning, and just as importantly which ads are losing. This enables us to act regardless of how many ads are in an ad group. Through weeding out poor ad creative, we can keep the recommended number of running ads in ad groups, which means our “options for success” are staying put. We can also simultaneously optimise for future traffic volume.

Below we can see each ad represented by a bar (bars with matching colour belong to the same ad group).

Further, we forecasted performance of each AdGroup (see image 1) so we can understand which ad groups we need to prioritise for maximum return. This makes planning optimisation tasks much simpler in the process—a huge bonus for the time poor.

We can also zoom in on each ad group and the consistency of winning probabilities to ensure we’re not dealing with one of those devious false positives we can expect to find. Below is a case of a clear and consistent result over time.

But what about the results?

Well, we decided to isolate the ads that our model had highlighted as ‘underperformers’ and then performed further split tests to be sure that they were indeed different. We discovered statistically significant uplifts of **+26%** for click-through rate and **+44%** for conversion rate.

Successful ad tests like these are the lifeblood of a successful campaign, and we hope that sharing our approach can help bolster your own testing routine.