How to Choose an Incrementality Testing Tool: 31 Evaluation Criteria
Incrementality testing has moved from a niche capability to a mainstream requirement for serious marketing measurement programs. Yet most organizations struggle with the same problem when evaluating tools: vendor comparison guides are shallow, RFP templates are vague, and it's genuinely hard to know what good looks like across geo tests, A/B tests, conversion lift tests, and the platforms that unify them.
This article presents a research-backed evaluation framework of 31 criteria across 7 categories for assessing incrementality testing tools. The criteria were derived from primary research, not vendor marketing materials, and are designed to help marketing analytics teams and procurement professionals ask the right questions.
If you're looking for a vendor comparison applying this framework, see our separate article: Best Incrementality Testing Tools in 2026: In-depth Vendor Comparison.

Table of Contents
- What is incrementality testing?
- The three primary incrementality test types
- How we developed the evaluation criteria
- The 31 evaluation criteria across 7 categories
- How to use this framework in your own evaluation
- Frequently asked questions
- Further reading
What is Incrementality Testing?
Incrementality testing is a method for measuring the true causal impact of advertising: the additional sales, conversions, or revenue that would not have happened without a specific marketing activity.
Compared to Marketing Mix Modeling (MMM), which estimates incrementality by analyzing historical time-series data, incrementality testing is an active method. A marketing intervention is designed (such as stopping spend on a channel in one geography), then executed, then analyzed. The objective is a clean, causal read of the true incremental lift from a specific marketing activity.
Incrementality testing is the most accurate approach for estimating the true incremental sales impact of a channel at a specific point in time and spend level. But it has limitations: the estimate applies only to the tested channel, at that specific moment, and at that specific spend level. It does not provide continuous marketing measurement or show how Incremental ROAS changes as spend changes. This is why the primary use case for incrementality testing is to calibrate a Marketing Mix Model, which provides continuous, cross-channel measurement of Incremental ROAS (iROAS) and Marginal Incremental ROAS (miROAS).
The Three Primary Incrementality Test Types
Modern incrementality programs rely on three distinct test designs, each suited to different questions and contexts. Understanding all three is essential to evaluating whether a tool genuinely covers your measurement program or only part of it.
Geo Tests
Geo tests compare treatment and control geographies to derive causal estimates of incremental lift. They are the workhorse of modern incrementality measurement and are well-suited for any channel where spend can be varied by region. A geo test produces an iROAS estimate with a confidence interval for the tested channel, geography, and spend period.
A/B Tests for Owned Media
A/B tests randomize at the user or audience level, making them best suited for owned media such as catalog mailings, email campaigns, or push notifications. They produce the same core output as geo tests (an iROAS estimate with a confidence interval), but at the individual exposure level rather than the geographic level.
Conversion Lift Tests
Conversion Lift tests are run inside ad platforms (Meta, Google, TikTok). The platform randomly withholds ads from a portion of the target audience, then compares conversion rates between the exposed and withheld groups. Conversion Lift tests are widely used by marketing teams but treated very differently by incrementality testing vendors: some dismiss them entirely, others natively ingest and normalize results across platforms.
Each test type produces a point estimate of iROAS with a confidence interval. A mature incrementality testing program typically runs all three, and the best tools support all three in a unified library.
How We Developed the Evaluation Criteria
The 31 criteria in this framework were built from primary research across four sources.
1. 700+ discussions with marketers and analytics professionals
We analyzed incrementality testing-related comments and requirements from more than 700 discussions with Sellforte customers and prospects, including marketers, marketing analytics leads, and data scientists working in advertising-heavy industries such as retail, ecommerce, DTC, travel and hospitality, and restaurants.
2. Enterprise RFP documentation
We reviewed the requirements documentation from more than ten enterprise RFPs explicitly specifying requirements for incrementality testing platforms. Enterprise RFPs tend to be more precise than vendor marketing materials about what actually matters in procurement.
3. Internal practitioner interviews
We interviewed Customer Success and Data Science team members who work with incrementality testing in production across dozens of enterprise advertiser implementations, giving us ground-level insight into what differentiates tools in real use versus on paper.
4. Desk research and LLM-assisted analysis
We complemented primary research with desk research and LLM-assisted investigation to identify gaps and pressure-test assumptions against publicly available product documentation, technical specifications, and analyst coverage.
The result: 31 evaluation criteria across 7 categories.
The 31 Evaluation Criteria Across 7 Categories
The 7 categories reflect the full scope of what a mature incrementality testing platform needs to do: from analyzing individual test types to unifying experiments, integrating with MMM, and meeting enterprise requirements.
Category 1: Geo Test Analysis
Geo testing is the workhorse of modern incrementality measurement. This category measures the depth and quality of a tool's geo test analysis capabilities, from the statistical methodology behind the analysis to the user interface that makes it accessible without analyst support.
| ID | Criterion | What it means |
|---|---|---|
| 1.1 | Analyzes geo tests with synthetic control method, providing iROAS and confidence interval | Uses synthetic control methodology to compare treatment vs. matched control geos, outputting incremental ROAS with statistical confidence intervals to quantify causal lift. |
| 1.2 | Self-serve UI for analyzing & reviewing geo test results | Marketers can upload data, run analyses, and review geo test results through a web interface without needing a data scientist or analyst to write code. |
| 1.3 | Estimates media counterfactual for lost/incremental spend | Models what media spend would have been in the absence of the test, so iROAS reflects actual incremental spend rather than nominal budget changes. |
| 1.4 | Configurable default post-test treatment / measurement window | User can set default treatment and measurement windows (e.g., test duration, post-test cooldown) that apply across tests, with the option to override per test. |
| 1.5 | Executes geo experiments on ad platform | Tool can launch and manage geo experiments directly on ad platforms (e.g., Meta, Google) via API, rather than only analyzing tests configured elsewhere. |
| 1.6 | Automatically detects geo tests from media and sales data | Identifies likely geo experiments from observed media and sales patterns automatically, without requiring users to manually flag test periods or geos. |
Why it matters: Most incrementality tools cover the geo testing basics (criteria 1.1–1.4). The differentiators are criteria 1.5 and 1.6 (native ad platform execution and automatic test detection). Both substantially reduce the analyst overhead required to run a mature geo testing program.
Category 2: A/B Test Analysis for Owned Media
A/B tests at the user or audience level are common among large ecommerce businesses testing owned media such as catalogs and email. From an analysis perspective they mirror geo tests, with the same expectations around iROAS estimation and confidence intervals, but dedicated A/B test analysis is rare among incrementality testing tools, which makes this category a meaningful differentiator.
| ID | Criterion | What it means |
|---|---|---|
| 2.1 | Analyzes own media A/B tests with synthetic control method, providing iROAS and confidence interval | Applies synthetic control methodology to user-level or audience A/B tests, producing incremental ROAS with confidence intervals. |
| 2.2 | Self-serve UI for analyzing & reviewing own media A/B test results | Marketers can upload data, run analyses, and review A/B test results through a web interface without analyst support. |
| 2.3 | Estimates media counterfactual for A/B tests | Models counterfactual media spend so iROAS reflects true incremental investment rather than just the budget delta between cells. |
| 2.4 | Configurable default post-test treatment / measurement window for A/B tests | User can set default treatment and measurement windows that apply across A/B tests, with per-test overrides. |
Why it matters: Organizations with significant catalog or email programs need incrementality measurement for owned media, not just paid channels. If your tool only covers geo tests, you're missing a major measurement surface and you'll need a separate solution or manual analysis for owned-media experiments.
Category 3: Conversion Lift Test Analysis
Conversion Lift tests are platform-native experiments run inside Meta, Google, TikTok, and other ad platforms. They are widely run by marketing teams, often as a standard practice on major paid media channels, but they receive very different treatment from incrementality testing vendors. Some dismiss them citing platform bias; others natively ingest, normalize, and feed them into MMM calibration. This category separates the two approaches.
| ID | Criterion | What it means |
|---|---|---|
| 3.1 | Ingests Conversion Lift test results, providing iROAS and confidence interval | Imports Conversion Lift test results from ad platforms (Meta, Google, etc.) and reports iROAS with confidence intervals. |
| 3.2 | Self-serve UI for analyzing Conversion Lift test results | Marketers can review Conversion Lift test results in a web interface without needing to pull raw data from each ad platform separately. |
| 3.3 | API connectors for automated conversion lift test ingestion | Pre-built API connectors automatically pull Conversion Lift results from ad platforms, eliminating manual exports. |
| 3.4 | Results comparable to ad platform data on campaign and ad set level | Normalizes Conversion Lift outputs so iROAS and lift can be compared at the campaign and ad set level across platforms on a like-for-like basis. |
| 3.5 | Daily snapshot of conversion lift test progress, including iROAS and confidence interval | Provides daily updated views of in-flight tests, including running iROAS and confidence interval estimates, so users can monitor progress before completion. |
Why it matters: Most marketing teams run Conversion Lift tests as part of their standard paid media workflow. A tool that doesn't ingest them forces teams to manage a separate data stream outside the experiment library, breaking the unified view of incrementality across all test types. Criterion 3.5 on daily in-flight snapshots is particularly underrated: waiting until test completion to check results is operationally inefficient and risks wasted spend on underperforming tests.
Category 4: Experiment Recommendations & Insights
Beyond analyzing experiments you've already run, the best incrementality testing tools actively help you get more value from your measurement program by recommending what to test next, designing statistically rigorous experiments, and translating technical outputs into narratives that non-technical stakeholders can act on.
| ID | Criterion | What it means |
|---|---|---|
| 4.1 | Platform recommends which channels to test | Suggests which channels are highest-priority to test next based on uncertainty in current measurement, spend levels, or expected learning value. |
| 4.2 | Platform recommends control & test groups | Recommends which geos, audiences, or users to assign to control vs. test based on similarity, balance, and statistical power. |
| 4.3 | Platform recommends test design (type, methodology) and predicts test success | Recommends the best test type (geo, A/B, conversion lift) and methodology for the question at hand, and predicts statistical power / probability of detecting a meaningful effect. |
| 4.4 | AI-generated plain-language readouts / executive summaries | Generates plain-language summaries of test results suitable for non-technical stakeholders and executive review, without requiring manual write-up. |
| 4.5 | Conversational AI for discussing experiments | Built-in AI assistant lets users ask natural-language questions about experiments, results, and learnings across the library. |
Why it matters: Criteria 4.1–4.3 reduce the data science overhead required to run a well-designed measurement program. Criteria 4.4–4.5 reduce the communication overhead: translating iROAS and confidence intervals into executive-readable narratives is time-consuming work that AI can handle at scale.
Category 5: Unified Experiment Library
For organizations with mature incrementality programs, experiment management becomes a significant challenge. Large advertisers may run dozens or hundreds of experiments annually across channels, geographies, teams, and test types. A unified library that stores all of them in one searchable place prevents duplicate testing, compounds organizational learning, and enables governance at scale.
| ID | Criterion | What it means |
|---|---|---|
| 5.1 | Central library covers all experiments regardless of type | Single repository stores results from all experiments (geo tests, A/B tests, and conversion lift tests) across channels, teams, and methodologies in one searchable place. |
| 5.2 | Filterable by country, channel, brand, campaign, team, date | Library supports filtering of experiments by key attributes such as country, channel, brand, campaign, team, and date range. |
| 5.3 | Role-based access & governance for the experiment library | Supports role-based access control and governance so different users see appropriate experiments and have appropriate edit rights, which is critical for multi-team, multi-region organizations. |
Why it matters: Without a unified library, experiment results scatter across spreadsheets, slide decks, and platform dashboards. Teams often re-run experiments already completed elsewhere in the organization, and learnings don't accumulate. Criterion 5.1, which covers all three test types rather than just geo tests, is the critical gate. Most tools cover only geo tests in their library, leaving conversion lift and A/B test results unmanaged.
Category 6: MMM Integration
Incrementality testing and Marketing Mix Modeling are complements, not substitutes. Experiments provide ground-truth point estimates for specific channels and time windows. MMM provides continuous, cross-channel measurement of incremental ROAS over time. The integration between the two is what makes each more valuable. This category assesses how deeply a tool supports that integration.
| ID | Criterion | What it means |
|---|---|---|
| 6.1 | Bayesian MMM that can be calibrated by the user | Includes a Bayesian Marketing Mix Model that users can calibrate with their own priors and assumptions, rather than a black-box model with fixed parameters. |
| 6.2 | UI tool where the user can connect experiment results to MMM | User interface for connecting experiment results into the MMM as calibration inputs, without requiring custom code or spreadsheet workarounds. |
| 6.3 | Experiment-based priors comparable to attribution-based priors | Allows side-by-side comparison of priors derived from experiments vs. priors derived from attribution data, so users can make informed decisions about model calibration inputs. |
Why it matters: The quality of MMM calibration depends on how well experiment results feed into the model. Criterion 6.2 separates tools where experiment-to-MMM integration is a UI-based workflow from those where it's a manual, code-required process. Criterion 6.3 adds analytical value: being able to compare experiment-based priors against attribution-based priors in the same interface gives teams full visibility into the assumptions driving their measurement model.
Category 7: Enterprise-Grade Platform
The final category assesses whether the platform can operate in an enterprise environment. These criteria appear repeatedly in RFPs from large advertisers and often act as hard filters in procurement processes.
| ID | Criterion | What it means |
|---|---|---|
| 7.1 | At least 10 public reference customers from $1B+ revenue brands | Proven track record with large, sophisticated advertisers, not just mid-market or DTC brands. Publicly verifiable, not claimed. |
| 7.2 | SOC 2, ISO 27001, or audited IT security by a third-party cyber security auditor | Independently verified security posture, a baseline requirement for enterprise IT procurement and often a hard gate in vendor approval processes. |
| 7.3 | Data residency: geography option between US and EU | Customers can choose whether their data is stored and processed in US or EU regions, which is critical for GDPR compliance and data sovereignty requirements in Europe. |
| 7.4 | Multi-cloud: option between AWS, GCP, and Azure | Customer can choose between major cloud providers to align with their IT infrastructure and existing enterprise agreements. |
| 7.5 | Single sign-on (SSO) for enterprises | Supports enterprise authentication via SSO, required by most large-company IT security policies. |
Why it matters: Data residency, multi-cloud, SSO, and security certifications are not aspirational features for large advertisers; they are procurement requirements. Mid-sized organizations that don't yet need all of these will eventually grow into them. Evaluating a platform's enterprise readiness today prevents a disruptive migration later. Criterion 7.1 on publicly verifiable $1B+ references is also worth noting: claimed customer lists and publicly verifiable reference customers are very different things, and the distinction matters when you're asking for internal IT approval.
How to Use This Framework in Your Own Evaluation
Not all 31 criteria carry equal weight for every organization. Here's how to prioritize based on your context.
If you run geo tests only, start with Category 1. But plan for Categories 2 and 3 as well, as most mature measurement programs add conversion lift tests and owned-media A/B tests over time. Choosing a platform that only covers geo tests now may force a migration later.
If you run conversion lift tests (Meta, Google, TikTok) as part of your standard practice, Category 3 becomes a hard gate. Many tools in the market score zero on conversion lift analysis. If a vendor dismisses conversion lift tests as "too biased to be useful," ask whether that's a principled methodological position or a capability gap rationalized after the fact.
If you have a mature, multi-team incrementality program, Category 5 (Unified Experiment Library) becomes critical. At scale, an experiment library without role-based access, cross-type coverage, and robust filtering creates organizational risk: teams duplicate tests, learnings fragment, and governance breaks down.
If your incrementality program is primarily designed to calibrate an MMM, Category 6 is the integration you should scrutinize most carefully. A UI-based workflow for connecting experiment results to model priors (criterion 6.2) versus a manual spreadsheet process is a substantial difference in operational overhead at scale.
If you're in enterprise procurement, apply Category 7 as a filter early. SOC 2, data residency, SSO, and multi-cloud are often hard requirements that take months to evaluate. Disqualifying a vendor on these dimensions early saves time on the full evaluation.
For any evaluation: ask every vendor you shortlist to demonstrate their capabilities in a live product session using realistic scenarios from your own measurement program, not a scripted tour. A vendor confident in their product will do this willingly. A vendor who defers to slides and case studies probably cannot.
To see how five incrementality testing tools score against all 31 criteria, see the full vendor comparison: Best Incrementality Testing Tools in 2026: In-depth Vendor Comparison.
Frequently Asked Questions
How do I choose an incrementality testing tool?
Evaluate candidates against seven dimensions: geo test analysis, A/B test analysis for owned media, conversion lift test analysis, experiment recommendations and insights, unified experiment library, MMM integration, and enterprise-grade platform requirements. The 31 specific criteria in this article define what good looks like in each dimension. Prioritize based on which test types you run today and plan to run in the future. A platform that covers only geo tests may require a migration if your program expands to conversion lift tests or owned-media A/B tests.
What is the difference between geo tests, A/B tests, and conversion lift tests?
Geo tests compare treatment and control geographies to estimate incremental lift: they work for any paid channel where spend can vary by region. A/B tests randomize at the user or audience level, making them best suited for owned media like catalogs and email. Conversion Lift tests are run inside ad platforms (Meta, Google, TikTok) and compare conversion rates between users who saw an ad and those who didn't. All three produce an iROAS estimate with a confidence interval, but they differ in methodology, use case, and the vendor support they receive.
Why does the unified experiment library matter?
Large organizations run dozens or hundreds of experiments per year across channels, regions, and teams. Without a unified library, results scatter across spreadsheets and platform dashboards. Teams re-run experiments already completed elsewhere, learnings don't compound over time, and governance becomes impossible. A unified library that covers all test types (geo, A/B, and conversion lift) in one searchable, filterable, role-controlled repository is the infrastructure that makes an incrementality program scale.
How do incrementality testing tools integrate with Marketing Mix Modeling?
Experiments provide point estimates of iROAS for specific channels, time windows, and spend levels. These estimates serve as calibration inputs (called priors) that inform a Bayesian MMM's estimates of incremental ROAS across all channels continuously. The best tools provide a UI-based workflow for connecting experiment results directly to model priors, with the ability to compare experiment-derived priors against attribution-derived priors side by side. Tools without this integration require manual, code-based calibration processes that are slow and error-prone.
Should I include conversion lift tests in my evaluation, even if I'm skeptical of their accuracy?
Yes. Even if you discount conversion lift test results or treat them as directional rather than definitive, having them in a unified experiment library alongside geo tests and A/B tests gives you a more complete picture of incrementality across your media mix. The appropriate response to platform bias concerns is methodological awareness and careful interpretation, not excluding a widely run test type from your measurement framework entirely. The best platforms let you decide how much weight to put on conversion lift results when feeding them into MMM calibration.
Further Reading
- Best Incrementality Testing Tools in 2026: In-depth Vendor Comparison
- What is Incrementality Testing? Guide for Marketers
- What is Marketing Mix Modeling?
- Calibrating Marketing Mix Models with Experiments and Attribution Data
- Marginal Incremental ROAS (miROAS) explained
- ROAS, iROAS, miROAS: Choosing the Right KPI for Optimizing Media Spend
- How to Integrate Experiments Into an MMM Platform: A Practical Guide
- 7 Best AI Tools for MMM and Incrementality Testing in 2026
Authors

Lauri Potka is the Chief Operating Officer at Sellforte, with over 15 years of experience in Marketing Mix Modeling, marketing measurement, and media spend optimization. Before joining Sellforte, he worked as a management consultant at the Boston Consulting Group, advising some of the world’s largest advertisers on data-driven marketing optimization. Follow Lauri in LinkedIn, where he is one of the leading voices in MMM and marketing measurement.

Kacper Solarski is a Lead Data Scientist at Sellforte, focused on developing Sellforte's Experiments product. Kacper is one of the most senior data scientists and developers at Sellforte, where he has implemented Marketing Mix Models and incrementality testing solutions to Sellforte customers, while at the same time developing Sellforte's platform. Follow Kacper in LinkedIn.
.png?width=701&height=132&name=Juha%20Nuutinen%20(701%20x%20132%20px).png)
Juha Nuutinen is the Chief Executive Officer and co-founder at Sellforte, with over 15 years of experience in optimizing marketing spend and promotional activity for the largest advertisers in the world. Before co-founding Sellforte, he worked as a management consultant at the Boston Consulting Group, specializing in promotion optimization. Follow Juha in LinkedIn, where he is actively sharing his views on marketing measurement.
