A backtest is a hypothesis test, and you are the adversary

Strip away the charts and a backtest is a single statistical claim: this rule, applied to this history, produced returns whose mean is distinguishable from zero. That is a hypothesis test. And like every hypothesis test it has a null — the rule has no edge, the returns are noise — and a probability of rejecting that null by accident. In a clean scientific experiment you fix the hypothesis before you see the data. In backtesting you do the opposite: you look at the data, try a rule, look again, adjust the rule, and repeat until the equity curve points up and to the right. Every one of those looks is a silent hypothesis test, and you are running them against yourself.

The headline number you finally report is the Sharpe ratio — annualised excess return divided by annualised volatility:

SR  =  ( μ − rf ) / σ

It looks like a clean measurement of skill. It is nothing of the kind until you know how many rules you tried before you kept this one. A Sharpe of 2.0 found on the first attempt and a Sharpe of 2.0 selected as the best of five hundred attempts are not the same evidence. They are not even close. And the natural-language interface that makes research feel effortless is, mechanically, a machine for running the second kind of search at a speed no human desk ever could.

The math of finding alpha by accident

Here is the uncomfortable part, and it is not a vibe — it is a theorem. Suppose you test N independent strategies, every one of which has a true edge of exactly zero. Over a sample of T years, each one's measured Sharpe is a random draw, roughly normal, centred on zero with a standard error of about 1/√T. Now take the best of the N. The expected maximum of N such draws is, to a very good approximation,

E[ max SR ]  ≈  √( 2 ln N ) · ( 1 / √T )

Read what that says. With ten years of daily data and a hundred strategies tried, the best one is expected to post a Sharpe of about √(2 ln 100) / √10 ≈ 0.96 — a number most retail platforms would render in confident green — purely from luck, with zero real edge anywhere in the set. Push the search to ten thousand strategies and the expected lucky best climbs past 1.3. The grim feature of that formula is the √(ln N): the more you search, the higher the bar that pure noise will clear, and it keeps climbing without limit. An AI that cheerfully drafts a hundred variations of your idea before lunch is not helping you find an edge. It is helping you manufacture a false one, and then handing you the prettiest sample of the batch.

The haircut nobody applies

This problem has a name in the literature — the multiple-testing problem — and a defence. Bailey and López de Prado's Deflated Sharpe Ratio asks the only question that matters: given that I tried N things, how likely is a Sharpe this high under the null that none of them work? It deflates the observed Sharpe against the expected maximum of the search, and it bends the haircut further for the two things that make financial returns lie — negative skew and fat tails (non-zero kurtosis), the signatures of strategies that earn pennies for years and then give it all back in a week:

DSR  =  Z [ ( SR − E[ max SR ] ) · √(T−1) / √( 1 − γ3·SR + (γ4−1)/4·SR² ) ]

You do not need to memorise the form. You need exactly one habit it encodes: a Sharpe ratio reported without the number of trials behind it is not a result, it is a screenshot. The same raw SR can deflate to a DSR that says "almost certainly real" or one that says "indistinguishable from noise," and the only thing that moved was honesty about how hard you looked. Virtually no consumer backtesting tool tracks N, because tracking it means routinely telling the user their brilliant idea is a coin that landed heads. That is a bad demo and a good product. I would rather show you a lower, truthful Sharpe than a flattering one, and the deflated number is how you keep that promise mechanically rather than as a slogan.

Before the statistics, the data has to stop lying

All of the above assumes the returns going in are real. Most of the time they aren't, and in ways that flatter you by construction. Two biases do almost all the damage.

Survivorship. Pull "the S&P 500" as it exists today and backtest a strategy on it across twenty years, and you have quietly restricted the universe to the companies that made it. Every Lehman, every Enron, every delisted shell has been swept out of the history before your rule ever saw it. You have run a strategy on a world where bankruptcy was abolished retroactively. The bias is large, it is always in the direction of better-looking returns, and it is invisible unless the data is explicitly survivorship-free — built from the index as it actually stood on each date, dead names and all.

Look-ahead. The subtler killer. A company reports Q1 earnings in May, but if your database stamps that figure to March 31 — the quarter it describes rather than the day it became public — your backtest "knew" the number six weeks before the market did. Trade on it and the equity curve is spectacular and entirely fictional. The only defence is point-in-time data: every figure carried as it was known on each historical date, with restatements and revisions preserved rather than overwritten. The uncomfortable truth is that most freely scraped financial data is neither survivorship-free nor point-in-time, which means most backtests built on it are measuring a market that could not have been traded. Get this layer wrong and the Deflated Sharpe is rigour applied to fiction.

Out-of-sample is the only opinion that counts

So how do you actually test a rule when you, the researcher, are the source of the bias? You hold data back. You fit on one window and you measure on a window the rule has never seen, and you do it the way the future will actually arrive — forward in time, never letting later data leak into an earlier decision. That is walk-forward analysis: fit on years one through five, test on year six; roll forward, fit two through six, test seven; and so on. The out-of-sample track record you stitch together is the only performance estimate that wasn't contaminated by the search that produced the rule.

López de Prado formalises the failure mode it guards against as the probability of backtest overfitting — across all your trials, how often does the configuration that looked best in-sample underperform the median out-of-sample?

PBO  =  P ( rankOOS below median | rankIS best )

When that probability is high, your selection process has no predictive content whatsoever — you are picking next year's strategy by last year's luck, which is worse than picking at random because it costs you turnover to do it. A PBO you never compute is a PBO you implicitly assume is zero, and it almost never is. The honest output of a research session is not the in-sample equity curve. It is the out-of-sample one, with the in-sample number shown next to it precisely so you can see the gap between the story and the evidence.

And then the market charges you to trade it

Suppose a rule survives all of that — deflated, point-in-time, walked forward, still breathing. It is still a gross number. I have written elsewhere about how the entire ranking of equity factors reshuffles once you divide through by what it costs to trade them, and the same equation governs every backtest ever run:

αnet  =  αgross  −  T · c

Turnover T times all-in cost c — spread, fees, impact, and the taxes that fall on trading rather than profit — is subtracted from every strategy, and it falls hardest on exactly the high-turnover rules that tend to post the gaudiest gross Sharpes. A backtest that fills at the mid, ignores slippage, and assumes infinite liquidity is not optimistic. It is measuring a different market — one with no frictions, where you are the only participant and your orders move nothing. The honest tearsheet models the fill, not the fantasy.

What Polaris actually is

Stack those four cuts together — multiple-testing deflation, clean point-in-time data, walk-forward out-of-sample, net-of-cost fills — and you have the difference between a backtest that informs a decision and one that funds someone else's. That stack is what Polaris is. The plain-English interface that turns a question into SQL across twenty years of US equities, crypto, FX, indices, and macro is the part everyone will notice first, and it is genuinely the easy half. It removes the access wall. It is not the product.

That access layer is, admittedly, broad — and useful long before you ever run a backtest. You can ask the market a question and get a chart, the rows, and the exact query back: compare two assets over any window, pull the ten most volatile names in the universe right now, chart a token against a hundred-plus FRED macro series. The same data is directly queryable — SEC fundamentals, institutional 13F filings, analyst consensus, news headlines and sentiment — so you can read what the big holders bought last quarter or screen on a fundamental the way you'd screen on price. Live dashboards and per-asset pages sit on top, and conditional alerts let you arm a rule — tell me if this crosses that — with the compute cost shown before you commit to it. All of that is the surface. It is the part a Bloomberg seat used to gate, now reachable in plain English. It is necessary, and it is not where the edge or the danger lives.

The product is what happens when you hand a promising query to Vega, the research analyst underneath: it runs the multi-step study with cited sources, stays aware of a portfolio you @-mention, and when you ask it to backtest the idea it backtests it like someone who has been burned before. Walk-forward by default, not as an advanced toggle. Tearsheets that report Sharpe, Sortino, and Calmar next to their deflated and probabilistic cousins, so the haircut is in your face rather than buried. Data that is survivorship-free and point-in-time at the source, so look-ahead never enters the room. Paper trading that is always and only simulated, so the line between research and a track record stays bright. The defaults are built so the path of least resistance is the honest one — because the whole lesson of the math above is that the moment honesty becomes optional, the search process quietly opts out.

I have spent a decade pricing the gap between a model's fill and a real one, and the single most expensive mistake I have watched people make — funds and retail alike — is mistaking an in-sample curve for a discovery. AI does not fix that. Left alone it makes it worse, because it drops the cost of the search that creates the illusion to nearly zero while leaving the illusion fully intact. The only thing that fixes it is a tool that does the unglamorous accounting on every result, automatically, and is willing to come back with: you tried four hundred things, this is the best of them, and once we deflate for that and trade it out-of-sample net of costs, there is nothing here. That sentence is the most valuable output a research engine can produce, and almost nothing on the market will say it to you. Polaris will. That is the point.

It goes live this week at tradepolaris.com. You can ask the market a question without signing up and watch the query and the data come back in plain sight. If you run something systematic — or you are about to, and you want to know whether the thing you found is real before you trade it — that is who I'm building it for.

— Rohan Rathod, London, June 2026

I trade for a living and have spent a decade in market microstructure and quantitative research. This is a structural and statistical argument, not investment advice; backtested and simulated results are hypothetical and not indicative of future performance. I'm building Polaris — a natural-language research and walk-forward backtesting engine for systematic equity strategies — alongside Solistic Finance. Reach me at r@solistic.finance or @ro_lend.