A Chesscomputer Test Set. Abstract How can we live without a number of testpositions as a tool to measure a programs strength? I missed such a test in 1984 and started to make one. The test is not perfect, but with my test and my 5 years experience with such a test described in this paper as a base, I hope a 'perfect' test will be worked out in the future. The purpose of a test set 1) Generally accepted test positions makes it easier to compare programs and their performance, compare mainframes with commercial computers, discuss hashtables and singular extension etc. 2) If the test can calculate a reliable rating for a computer, you don't have to guess or play hundreds of games. Also computers outside the sweedish ratinglist can have their rating determined. 3) Use the test as a diagnostic tool. When you have tested a program, you really know the programs profile. 4) Programmers may use the test to verify if changes made to the program improves the program or not. F.ex. adding knowledge will make it play positionally better, but will also slow down it's tactical speed. What is the overall benefit of this change? The most pleasent use is to let the program do the test automatically during a night, which Larry Kaufman has done to ensure, that no bugs has appeared in REX. An advanced use is to instantly callibrate the parameters in the chessprogram to their optimal values determined by the test. Similary Deep Thought uses 900 GM-games to adjust 120 parameters. Description of the test ?????The test is included in appendix A?????. In 1985 I started making a test set for chesscomputers. The test consists of positional play, endgame, traps, timedisposition, use of opponents time, opening library (size and transpositions) and tactics. All together 145 moves to test from 86 positions. A test is solved if the computer plays a certain move or evt. avoids a certain move. Any other move is considered wrong. The infinite level is used and the calculation is stopped after 3 minutes. Tournament level is not used, as the used time vary too much. In tactical test the time is measured, till the computer has solved the test. Just as a precaution both white and black moves are tested. As tactics are very important, these positions are constructed in such a way that you are sure the computer has 'understood' the position when choosing the right move. This is typically made by offering the program a pawn for free, which it starts to capture until it finds the right combination (Diagram 67). Other important features in my tests tactical part is to see how well a program is at avoiding bad moves, where the opponent has a combination (Diagram 69). Richard Lang's programs are doing well here. Finally combinations without sacrifices are included (Diagram 67). They are rather seldom in tests and chessmagazines, but are common in actual play and therefore important. Selektive programs are good at these tests. Each correct solved test gives some ELO-points, and the test tries to estimate a computers rating after completing the test. The sweedish ratinglist is used as the truth although it is not exactly the truth. With 61 computers on the list the probability that all the computers are within their 95% limit, which is typically 25 points to each side, is less than 5%. The points given to each testposition has been adjusted many times to make the results fit as well as possible to this list and meanwhile tries to keep the test as close as possible to an original set of points, which I have determined and consider as realistic. The test is hoped to be better and better this way. This adjusting follows many rules, and the most important is, that many small differences are preferred to a few big differences. One can argue it is wrong just to add the points for all the positions: Imagine an extreme computer which is perfect in the endgame, but hopeless in the opening and the midgame - it will never reach the endgame! Torben Osted has proposed to use the results this way: Each computer starts with 2000 points and have this value adjusted after each position similar to the way humans ratings are adjusted after each tournament. But the simple adding of points has been satisfying until now. The history of the test The test has been described in PLY until autumn 1988, and since then in the austrian MODUL, which published the whole test in their 3.rd issue 1988. The test started with 6 computers, and more and more computers were tested. After each computers entry the adjusting of point were done. The tactical tests were changed in October 1986. Some tactical tests were dropped, and that is why there are 'holes' in the numbers. 12 new tactical tests were added to fill the gap. The problem was, that I had some matingproblems, where the time was measured in seconds until the computer played the move. It would only do that if it has seen the mate. But some computers did not play the move immediately even they knew it would lead to mate. It's easy to prove that by forcing the computer to move, as it would move and announce the mate. These computers would complete the current iteration before moving, and this difference in the construction of programs made the results impossible to compare with each other. I miss these matingproblems today, as they with their detailed measuring could distinguiz between the performance of the very best of the commercial computers of today, which solves almost every tactical test within a minute. The same applies of course to mainframes - already in August 1988 Deep Thought solved every test within 15 seconds! Someone may wonder why the tactical tests are only measured every minute? The reason is, that also computers without the capability to instantly show its best move should also could be tested. They are seldom today, but not 5 years ago. But Larry Kaufman and other has proposed a similar tool by measuring the time every 10.th second the first minute or two. This feature may be added in the future. In August 1987 I wanted to keep the points fixed for some time and called the current version of the test for prototype 1. February 1988 prototype 2 was made. I wanted two extra rules in the adjusting added: 1) No single test must be worth more than 50 points. That is still too much, but even more tactical tests must be added to lower this limit. 2) I realised that selektive programs were harder to estimate depending on how 'lucky' their selection rules fitted my still few number of tactical testpositions. So selective programs do not affect the pointadjusting as much as brute-force programs, and the rating of future selektive programs are expected to be harder to estimate. Prototype 3 and 4 were just adjusting of points. The latter to get nice figures for this article. This table shows how good/bad the test has been to predict a new computers rating. The computers 'first ratingcalculation' has been adjusted due to the changes of level on the sweedish ratinglist to make a comparison possible. ******************************************************** * * First * PLY ELO* * * * rating- * figures* * * Computer * calcultn * Oct 90 * Difference * ******************************************************** *1-6. Mephisto Amst. * The first version of the test * *1-6. Excellence 3 * were based on these 6 computers* *1-6. Turbostar * * *1-6. Elite A/S * It was ready July 1986. * *1-6. Constell. 3.6 * * *1-6. Champion * * ******************************************************** * 7. Super Constell. * 1872 * 1719 * +153 * Instant * 8. Conchess 4 * 1777 * (1712) * +65 * point- * 9. Plymate 5.5 * 1771 * 1803 * -32 * adjustment. * 10. Elegance 5.0 * 1840 * (1801) * +39 * ******************************************************** * Oct 1986 The tactic part of the test were * * changed. * ******************************************************** * 11. Expert 4 * 1795 * 1782 * +13 * * 12. Constell 2.0 * 1575 * (1592) * -17 * * 13. Par Excellence * 1825 * 1818 * +7 * * 14. Super Enterprise* 1765 * 1546 * +219 * further * 15. Meph. MM2 3.7 * 1790 * 1762 * +28 * point- * 16. Mephisto Dallas * 1978 * 1971 * +7 * adjustment. * 17. Super Mondial * 1687 * 1801 * -114 * * 18. Rebell * 1583 * 1810 * -227 * * 19. Forte B * 1887 * 1809 * +78 * * 20. Mephisto III 6.1* 1502 * (1455) * +47 * ******************************************************** * Aug 1987 Prototype 1. * ******************************************************** * 21. Primo (VIP) * 1598 * 1625 * -27 * Fixed * 22. Forte A * 1764 * 1801 * -37 * points * 23. Mephisto MM IV * 1746 * 1897 * -151 * for * 24. Stratos 6 Mhz * 1914 * 1801 * +113 * prototype 1. * 25. Mephisto Roma * 1997 * 1967 * +30 * ******************************************************** * Feb 1988 Prototype 2. * ******************************************************** * 26. Psion Atari * 1887 * 1874 * +13 * Fixed * 27. Excel 68000 Club* 1846 * 1851 * -5 * points * 28. M.Roma 68020 * 2081 * 2027 * +54 * for * 29. Excel Mach II C+* 1930 * 1917 * +13 * Prototype 2. * 30. Turbo S 24K * 1381 * 1459 * -78 * * 31. Meph B&P 3.7 * 1695 * (1695) * +0 * * 32. Meph Mega IV * 1760 * 1914 * -154 * * 33. Mach III * 2044 * 2005 * +39 * * 34. Avantgarde * 1862 * 1829 * +33 * * 35. Super Expert * 1851 * 1824 * +27 * * 36. Almeria 68020 * 2119 * 2095 * +24 * * 37. Almeria 68000 * 2083 * 2018 * +65 * * 38. Meph. Academy * 1878 * 1938 * -60 * * 39. Meph. II 6.1 * 1324 * (1471) * -147 * * 40. Sphinx Galaxy * 1812 * 1875 * -63 * ******************************************************** * Jul 1989 Prototype 3. * ******************************************************** * 41. Polgar * 1883 * 1982 * -99 * * 42. Portorose 68020 * 2092 * 2133 * -41 * * 43. Super Expert B * 2097 * 1898 * +199 * * 44. Elite 2x68000 * 2023 * 2036 * -13 * * 45. Super Expert C * 2010 * 1954 * +56 * ******************************************************** * Sep 1990 Prototype 4. * ******************************************************** * ....... * * * Future * * ..... * * * results... * * ... * * * ..... * * . * * * ... * ******************************************************** This table shows how prototype 4 calculates the rating of the 45 commercial computers tested. Deep Thought were tested september 1988. It scored maximum in tactics, but its positional play and endgame was not at the level of the best commercial computers. It would be interesting to follow Deep Thoughts development according to this test. An 's' outside the table denotes a selektive program, and a '-' denotes that the computer has not participated in the pointadjusting. This is caused by my BBC-B computer with only 32K RAM. A 'y' denotes 'yes' to manage transpositions in the opening. Use of the opponents time gives 40 point, and 3 traps gives 2 point each. ************************************************************************ * *Opening lib.* Tak-*Posi*End-*Time*Trap* *PLY *Diffe * * *halfmoves & * tic *tio-*game*dis-*opp.* sum *rating*rence * * COMPUTER *transpos* * *nal * *posi*time* *Oct 90* * ************************************************************************ *Max. points*34.000 y* 44* 1409* 585* 172* 32 * 46 *2288 * * * ************************************************************************ -*DeepThought* 6.000 y* 22* 1409* 412* 109* 25 * 44 *2021 * 2400 * -379 * ************************************************************************ s*M.Por.68020*60.000 y* 44* 1377* 442* 143* 29 * 42 *2077 * 2133 * -56 * s*M.Alm.68020*60.000 y* 44* 1373* 437* 149* 30 * 40 *2073 * 2095 * -22 * -*Eli 2x68000*64.000 y* 44* 1391* 379* 143* 32 * 42 *2031 * 2036 * -5 * s*Roma 68020 *35.000 y* 44* 1357* 423* 162* 27 * 42 *2055 * 2027 * +28 * s*M.Alm.68000*60.000 y* 44* 1356* 425* 149* 26 * 40 *2040 * 2018 * +22 * ************************************************************************ *Mach III *28.000 y* 40* 1371* 374* 160* 24 * 42 *2011 * 2005 * +6 * s*Polgar *28.000 y* 40* 1252* 441* 125* 26 * 42 *1926 * 1982 * -56 * s*Meph. Dall.*35.000 y* 44* 1339* 386* 137* 26 * 40 *1972 * 1971 * +1 * s*Meph. Roma.*35.000 y* 44* 1328* 423* 138* 28 * 40 *2001 * 1967 * +34 * s*S.Expert C *32.000 y* 42* 1329* 418* 141* 28 * 42 *2000 * 1954 * +46 * ************************************************************************ s*M. Academy *30.000 y* 41* 1251* 454* 125* 25 * 42 *1938 * 1938 * +0 * s*Meph. Amst.*24.000 y* 38* 1290* 383* 109* 26 * 40 *1886 * 1923 * -37 * *Exc M.II C+*16.000 y* 34* 1314* 360* 157* 25 * 42 *1932 * 1917 * +15 * s*Meph MegaIV* 7.000 y* 22* 1218* 429* 102* 23 * 42 *1836 * 1914 * -78 * s*S.Expert B *36.000 y* 44* 1368* 384* 132* 30 * 40 *1998 * 1898 * +100 * ************************************************************************ s*Meph. MM IV* 3.500 y* 18* 1199* 453* 105* 24 * 42 *1841 * 1897 * -56 * s*Sphinx Gal.* 8.000 y* 22* 1256* 420* 82* 26 * 42 *1848 * 1875 * -27 * s*Psion Atari*24.000 y* 38* 1261* 361* 113* 25 * 40 *1838 * 1874 * -36 * *Exc Club *16.000 y* 34* 1276* 356* 143* 25 * 40 *1874 * 1851 * +23 * *Avantgarde *40.000 y* 44* 1247* 336* 136* 18 * 42 *1823 * 1829 * -6 * ************************************************************************ *SuperExpert*32.000 y* 42* 1308* 296* 105* 30 * 42 *1823 * 1824 * -1 * *Par Excell.*16.000 y* 34* 1265* 349* 136* 19 * 42 *1845 * 1818 * +27 * s*Rebell * 3.000 y* 16* 1187* 410* 89* 24 * 42 * 1768* 1810 * -42 * *Forte B *20.000 y* 36* 1306* 301* 118* 25 * 42 * 1828 * 1809 * +19 * *Plymate 5.5* 3.000 y* 16* 1370* 226* 126* 25 * 42 * 1805 * 1803 * +2 * ************************************************************************ *Stratos 6 *15.000 y* 33* 1263* 379* 103* 28 * 42 * 1848 * 1801 * +47 * *Forte A *20.000 y* 36* 1282* 307* 109* 29 * 40 * 1803 * 1801 * +2 * *Sup Mondial* 6.000 y* 22* 1263* 336* 97* 25 * 42 * 1785 * 1801 * -16 * -*Elegance5.0* 3.000 y* 16* 1262* 405* 131* 18 * 42 * 1874 *(1801)* +73 * *Expert 4 *22.000 y* 37* 1253* 292* 121* 29 * 42 * 1774 * 1782 * -8 * ************************************************************************ *Meph. MM2 * 3.000 y* 16* 1343* 236* 126* 25 * 42 * 1788 * 1762 * +26 * s*Turbostar *10.000 y* 22* 1230* 316* 109* 25 * 40 * 1742 * 1756 * -14 * *Excell 3.0 * 3.000 y* 16* 1215* 342* 131* 25 * 40 * 1769 * 1745 * +24 * *Super Con. *20.000 y* 36* 1223* 311* 101* 28 * 42 * 1741 * 1719 * +22 * -*Conchess 4 * 3.000 y* 16* 1356* 225* 108* 24 * 40 * 1769 *(1712)* +57 * ************************************************************************ -*Meph B&P3.7* 3.000 y* 16* 1340* 209* 108* 22 * 40 * 1735 *(1695)* +40 * -*Elite A/S * 8.160 y* 22* 1085* 324* 134* 30 * 40 * 1635 * 1663 * -28 * -*Const. 3.6 * 3.000 y* 16* 1216* 299* 80* 23 * 42 * 1676 * 1636 * +40 * -*Primo (VIP)* 2.000 y* 11* 1176* 313* 64* 29 * 42 * 1635 * 1625 * +10 * -*Const. 2.0 * 3.000 y* 16* 1130* 320* 80* 24 * 42 * 1612 *(1592)* +20 * ************************************************************************ -*Enterprise * 6.000 y* 22* 1199* 419* 85* 9 * 40 * 1774* 1546 *+228 * -*M. II 6.1 * 3.000 y* 16* 1031* 240* 89* 24 * 40 * 1440*(1471)* -31 * s-*Turbo S 24K* 5.000 y* 22* 1010* 315* 78* 32 * 40 * 1497 * 1459 * +38 * -*MephIII 6.1* 3.500 y* 18* 1116* 288* 102* 21 * 42 * 1587 *(1455)*+132 * -*Champion * 3.000 y* 16* 1025* 279* 98* 29 * 46 * 1493 *(1396)* +97 * ************************************************************************ Conclusion I assume that 300 testpositions would be sufficient to calculate any computers rating with a maximum difference of 50 ELO- point. I compare this to the task of estimating another players rating on a basis of 8 games, and the test must be better, as every move is important. But with more positions the test would be more precise and reliable. Many useful positions can be found in Modul, where many different tests have been published. The test should be made dynamic, so it is possible to instantly modify the test by adding missing positions, correct or delete positions. To make my test simple, I have only one move as a right solution in each position. You must also be sure that the move is chosen for the right reason, and these demands makes it difficult to find satisfying positions or makes them look strange when you construct them (Diagram 64). It is probably better to give points to several moves in a position as the Bratko-Kopec test. The optimal is perhaps to use the programs evaluation score to ensure, that the program has understood the position. This facility is used in one of the tests in Modul. The future I hope some people f.ex. at a university will try to make this complete test and that ICCA will support this work. My practical experience The rest of this paper describes what I have come across at my work with such a test. I hope it will be of benefit for future testmakers and users of a test. Don't let all these things lead you to think, that it is impossible to make a test for chesscomputers. The moves of the computer tells everything about its strength! Programs in several versions My main problem is that some programs exist in several versions (Mach II, of which at least 4 versions has been proven, Simultano etc.). If my test uses one version and the sweedes use a second version or a mix of versions, it may be difficult to get the results fit. PC's and homecomputers running with different versions of processors, different Mhz, different RAMsize for hashtables, fast or slow RAM, with/without coprocessor give the same problems. As a precaution a little test could could be made to ensure, that several used models of a computer are likely to be identical. The test could be to measure the time for a mate i 3 positions. But that is too late for the sweedish ratinglist. A good but timeconsuming method to avoid the problem would be to choose f.ex. 16 different programs and let them play a number of games against each other until a sufficient reliable ratingrelation between the programs have been determined. The very same programs should then do the test, the points should be adjusted etc. Minor different versions of a program The computers that learns from their errors also makes things a little more complicated, as the computers rating increases the more they learns. The same goes for computers where you can add variations to the openinglibrary or add extra openingmodules/endgamemodules. Missing issues My test does not notice, how well a program is in using information like killermoves, hashtables etc. from calculation of earlier moves in the game. Positions with castling are missing. Knowledge about the wrong bishop together with an a-/h- pawn and endgames like KQKR, KBBKN etc. are missing. All computers get the same for using the opponents time. But someone has complained, that Plymate 5,5 too often calculates on a strange, unlikely move. If the opponent thinks for a long time, it might be better to find an answer to several moves instead of using all the time on a single move, which the opponent eventually does not play anyway. Hitech does this (ICCA JOURNAL June 1990 p. 112). A program may have a little feature: The program discovers that the score has turned bad compared to the previous move and uses some extra time here to find 'something'. I don't know if this might improve the programs rating. Psychological play like doing an unexpected move, which may not be the objective best, but is difficult to calculate and respond correct to in timetrouble. The opening library The opening library is only measured by its size. But is it the number of variations, positions or ply the producents use? What about the quality of the library? Is it only best-moves, or are there funny and riscy moves between? Are the variations fitted for the computers style of play? Computers with openingvariations dedicated to kill other computers will probably get better results against other computers than its result in my test. Super Expert B has been suspected to even have killergames in its library. Randomfunction If a program has a randomfunction, it may give different results. Errors by testing the program The tester of a computer may have done the test in a wrong way, set up some positions wrong, written the result wrong etc. Only doubletesting can discover such errors. If a program has some killermoves/hashtables in its memory from an earlier calculation on the same position (typically by doubletesting a position), the second may be solved faster because of the stored information. If the tester gets suspicious by the two different times measured and just does the test once more, he will get the same result as test 2 and perhaps accept this result. But it is the first result that is correct. Mephisto Almeria shows this behavior and should be turned off before doubletesting. If the tester lets the program show its thinking instantly, it might slow its execution. Even if the tester does it right and only checks the best move every minute, it may also give a wrong timing because some programs stops while this displayfunction is used. So the tester should do it quickly. Never trust a computers internal clock before it has been checked. The clock of my Designer 2265 runs ca. 10% too slow. Mephisto Almeria starts calculation as soon as you leave set- up mode, so if the tester is not aware of this the computer might use some seconds or even minutes for free. Infinitive/tournament level My test uses the infinitive level and the sweedes use the tournament level. This may give a little difference. On the infinitive level the program may save a little bit of time, as it does not have to calculate if it should move now or continue thinking. There is no difference if the lazy programmer lets the program do this calculation on all levels and 1: just ignores the result on the infinitive level or 2: has given the program 'infinite' time before the calculation starts, so the thinking will never be stopped. A more serious difference might occur if the programmer has made different iterative searchdeepening as f.ex. the normal 1,2,3,4 ect. on tournament level and 1,3,5,7 etc. on the infinite level. Kaufman has mentioned several computers with different iterative searchdeepening on these two levels. The test has been used to make the program If the test has been used in callibrating parameters of a program, it will probably perform relative better in the test than against other computers. D. Kittinger has known my test for some years, and that may explain why Super Expert B did so well at its entry to my test. Bugs If a program has some bugs, they will probably not occur exactly as often in the testpositions as in a big number of games. Sfhinx Galaxy sometimes loses its queen, but does not in my testpositions; one version of the Mach III finds nonexistent mates etc. Computer versus computer or human versus computer The sweedish ratinglist calculates a computers rating against other computers, and play against humans may be quite different. Use of open positions with tactical play is probably best. Making traps as the 3 in my test is also good. Even such things as long diagonal moves, complicated moves in the humans timetrouble, moves depending on the opponents strength, psychological moves etc. are yet to be incorporated in play against humans. APPENDIX A: The test positions????? Further information You can get more information by sending a check to Jens Baek Nielsen, Daltoften 15, 8600 Silkeborg, Denmark. The test with diagrams, explanation etc. 4$. Require english/german/danish version. How the adjusting is done. 3$ (diskette 6$). Require paper or IBM-diskette 3"5. Whatever you order, add 6$ for sending it. References Donskoy,Schaeffer: ICCA JOURNAL 1989 nr. 3 p. 160-161. Marsland: ICCA JOURNAL 1990 nr. 1 p. 15-19. Paul Lu: ICCA JOURNAL 1990 nr. 3 p. 155. Private communication 1984-1990 with Thoralf Karlsson, Goran Grottling, Thomas Mally, Larry Kaufman and others.