Great things are done by a series of small things brought together. — Vincent Van Gogh.

The last few changes — adding null move pruning, hashing, and a quiescence search — have had major impacts on playing strength, which was expected. But inevitably, not all changes can be big winners, or even big losers. Some will have a very minor effect, either for the good or bad. The change I deal with in this post is one of those. The idea being tested was — should the search iterator be stopped early if, after completing an iteration, more than half the allotted time has been used? The rationale is that, due to the exponential growth of search trees, if you have less than half your time left when starting a new iteration, you’re unlikely to complete the iteration. Perhaps it would be better to save the time for future use. (This would not apply for games with a fixed time-per-move.)

The idea is so simple I hesitate to even show the (pseudo) code below:

```
boolean stopSearching = false;
do {
++depth;
// DO SEARCH
score = search.search(board, depth ...)
if (search.isStopped()) {
break;
}
long elapsed = System.currentTimeMillis() - startTime;
// .... OTHER STOP CONDITIONS ....
// if we've used more than half our time, don't start a new iteration
if (earlyExitOk && !skipTimeChecks && (elapsed > maxTimeMs / 2)) {
stopSearching = true;
}
} while (!stopSearching);
```

Up until this point I’ve been running 2,000 game self-play matches to gauge if a change is good or not. 2,000 games is enough to get an Elo difference with error bars of around 12-13 points, which has been “good enough.” So, my starting point here was the same – 2 matches, running concurrently on my 2 core testing machine, each match being 2000 games.

Wins | Losses | Draws | Pct | Elo | Error |

634 | 589 | 777 | 51.1% | +7.82 | 11.90 |

634 | 614 | 752 | 50.5% | +3.47 | 12.02 |

Judging by the results of this test, it appears likely the change is a good change, albeit a minor one. The problem is the error bars. If the error bars are to be believed, it’s possible the change is actually neutral, or even a minor loss. The only way to tell would be to play more games – a lot more. People that have done the math state that to measure a 5 Elo change, you need something like 10,000 games. To measure a 2 Elo change requires 60,000 games! I decided I would run 2 20,000 game matches concurrently, for a combined 40,000 games. That should be enough to make a decision. The results of that experiment are below.

Wins | Losses | Draws | Pct | Elo | Error |

6492 | 5782 | 7726 | 51.8% | +12.34 | 3.77 |

6334 | 6151 | 7515 | 50.5% | +3.18 | 3.80 |

These results were very confusing. If you take the Elo for each match and use the error bars to create an interval, the two intervals don’t even overlap! I have enough of an appreciation of the statistics to know this is not impossible, but it seems extremely suspect. I ended up creating a post about it on a computer chess forum – http://talkchess.com/forum3/viewtopic.php?f=7&t=74685 . One poster, H.G. Muller gave a possible explanation. Since I am not restarting the engines between matches (chess4j is a Java engine, and the JVM has a”warm up” cost that I’m avoiding by not restarting it), it’s possible that memory (cache) is being mapped in a way that is harmful to one process. Not restarting causes that condition to persist. I don’t know if that’s the case or not, but it does make some sense and raises the question of whether it really is OK to run two matches concurrently (or more generally N matches on an N core machine, especially if you don’t restart between matches). I decided to run another test, but just one match this time, not two.

Wins | Losses | Draws | Pct | Elo | Error |

6179 | 6137 | 7684 | 50.1% | +0.73 | 3.77 |

My conclusion is that it’s not as stable as I originally thought to run two games concurrently. Restarting between matches could help, but again, being a JVM engine that’s not very appealing either. I could switch to testing first with Prophet as it’s C and could be restarted between matches, but one of the points of having a Java engine was quick prototyping. It’s also possible that a machine with more cores would resolve the issue, if at least one core is left free. I’ll revisit this in the future. For now, it’s one match at a time.

As far as the effect of the change in strength, it appears that it is worth a very small but positive amount of Elo. I think it’s a keeper, but just barely. But, running tens of thousands of games (especially when I’m only able to use a single core!) is not tenable. This would be a good time to look into some statistical methods. Many engine authors use Sequential Probability Ratio Test . If all you want to know is “is this engine better than this engine?” , and not necessarily the Elo difference between the two, SPRT could give a faster result.

The starting point for SPRT is to form a hypothesis. If player PA is my new version, and player PB is the current, best known version, my hypothesis is “*PA is stronger than PB”.* That is almost complete; that statement is really equivalent to “*PA is stronger than PB by at least 1 Elo”. * That’s OK, but would take quite a long time to prove or disprove. In the end, I decided that for expediency I would accept: *“PA is stronger than PB by at least 5 Elo points.”* That would take significantly less time to prove.

You also need a “null hypothesis” – the hypothesis that there is no significant difference between players. My null hypothesis is *“PA is NOT stronger than PB by at least 1 Elo point.”*

Finally, since this is a statistical test you need to determine your tolerance for an error. There are two types of errors to consider. One is the likelihood of rejecting a change that is good (a type I error). The other is the likelihood of accepting a change that is bad (a type II error). I decided to accept a 5% likelihood of error for both.

Putting all that together:

H1: *PA is stronger than PB by at least 5 Elo.*

Ho: *PA is not stronger than PB by at least 1 Elo*

A: 0.05

B: 0.05

Note that, SPRT does not necessarily mean the match will terminate early. You can read about the math behind it at https://en.wikipedia.org/wiki/Sequential_probability_ratio_test , but the idea is that, if either H1 or H0 are “accepted” , the match will terminate. If not, it will continue until the maximum number of games is reached. In my example, if the change being tested is actually worth +3 Elo, it would not terminate early. If it were worth, say, +10 Elo, it would terminate early. Likewise if it were worth -3 Elo, it would also terminate early.

As it turns out, the program I use to drive the testing process, Cutechess , has the capability to terminate a match using SPRT. I had to experiment with it a little bit, my first attempt ended in an 18,000 game match where H0 was accepted! – not good. But, I had some parameters flipped so the results are invalid. I decided it would be more productive to experiment with a very large and obvious change, like disabling null move. The correct flags are:

-sprt elo0=1 elo1=5 alpha=0.05 beta=0.05

Given the previous results, I am not going to repeat the experiment yet again to see if it would terminate early, as it’s very likely it wouldn’t and I’d end up losing a week of processor time. I will however be using SPRT going forward.

Since the change was a good one in chess4j I’ve ported it to Prophet4 as well, and for confirmation ran a 2,000 game P4 vs P3 match.

Wins | Losses | Draws | Pct | Elo | Error |

233 | 1153 | 614 | 27.0% | -172.8 | 13.4 |

That is marginally worse than the last run by 1.0%, but well within the margin of error.

Edit: I ran the P4 vs P3 match again out of curiosity:

Wins | Losses | Draws | Pct | Elo | Error |

250 | 1139 | 611 | 27.8% | -166.0 | 13.4 |

Once again, the error bars are large enough in a 2000 game match that there is a lot of variability between runs, so I’m not making decisions based on these results, but I do like to see how P4 is measuring up against P3 as changes are made.

Next up- aspiration windows.