The latest update to this file was made 14 april 1991. This file contains information about what could be wrong in my ratingtest or what could be improved. It is not ordered very well, but if you want to read the text it should be possible! (The writing were concluded autum 1989 just before the sweedes lowered the ratinglist with 73 point.) The most dangerous for my test (and the sweedish ratinglist) is programs in several versions (f.ex.Simultano has some very different versions and has been excluded from my test (Which version(s) does the sweedes use??). Also the Mach II and others have more versions). If my test uses one version and the sweedes use a second version(or a mix of versions) it may be difficult to get the results fit! (If a person has a third version and describes this in MODUL/PLY/CSS are we really going to have fun!). PC's and homecomputers running with different processors(models/versions), different Mhz, different RAMsize for hashtables, fast or slow RAM, with/without coprocessor give the same problems. I have proposed a little test to both MODUL and PLY, so the readers could help discover more versions of a program - from time to time you see readers really doubt if they have a bug- free and optimal version of a program. If you are making and selling chesscomputers and find an error in a program after having sold 1000 chesscomputers, you correct the error and sells 9000 errorfree computers - who will bother about the first 1000? But very strange to me neither PLY nor MODUL are interrested in this little test!?! The opening library is only measured by it's size. But is it the number of variations, positions or ply the producents use? What about the quality of the library? Is it only best-moves, or are there funny and riscy moves between? Are the variations fitted for the computers style of play? All these things are impossible to measure correct. All computers get the same for using the opponents time. But someone has complained, that f.ex. Plymate too often calculates on a strange, unlikely move.... If the opponent thinks for a long time, it might be better to find an answer to several moves instead of using all the time on a single move, which the opponent eventually does not play anyway. Hitech does this (ICCA JOURNAL June 1990 p. 112). If a program has a randomfunction (so it does not always give the same move to vary the play, when there are several almost equal good moves to choose between - you especially want this in the opening), it may give different results in a testposition..... The testers should do some tests several times and see if this occurs..... See also the next 'trap': If a program has some killermoves/hashtables in its memory from an earlier calculation(especially if it is the same position!), a too good result might occur. Imagine a tester doing a test twice: the second will be solved faster - and if he has a mistake of two solutiontimes, he does the test once more and gets the same result as test 2 and accepts this result. But it is the first test which is right! Mephisto Almeria shows this behavior..... Similar my test does not notice, how well a program is in using information (killermoves, hashtables etc.) from calculation of earlier moves in the game. Mephisto Almeria also starts calculation as soon as you leave set-up mode, so if the tester is not aware of this the computer might use some seconds or even minutes for free! My test uses the infinitive level, and the sweedes use the tournament level. This may give a little difference, but I don't think it is important. But on the infinitive level the program saves a little bit of time, as it does not have to calculate if it should move now or continue thinking (perhaps the lazy programmer lets the program do this calculation on all levels and 1: just ignores the result on the infinitive level or 2: has given the program infinitive time before the calculation starts, so the thinking will never be stopped). A more serious difference might occur if the programmer for some reason(??) has different iterative searchdeepening as f.ex. the normal 1,2,3,4... on tournamentlevel and 1,3,5,7... on the infinite level. Kaufman has mentioned several computers with different iterative searchdeepening on tournamentlevel and the infinite level. An exact example can be found in MODUL-91-1 p. 4,5 where Super Expert C gives different times on the two levels in a position. The sweedes ratinglist is a very powerfull tool to compare the different computers strength to each other. I use it as the truth and try to get my tests calculations fit the ratings on the list. But please remember that the list is not EXACTLY the truth! The list shows for each computer, that it with a probability of 95% is within a range of X point. Take this example: with 40 computers on the list you can calculate, that the probability that they all are within the 95% limit is ca. 13%! (0.95 raised to a power of 40). In other words: The probability that one or more computers exceeds the 95% limit is 87%!! (Besides this the list may have other errors like results reported wrong, using wrong setup of computers, errors in the ratingcalculation-program etc.) I have always had the opinion, that the SPREAD on the sweedish ratinglist is too big. I think it is because all computers to some extent has the same STYLE of play(never making simple blunders etc.) - even a little difference in strength, which is difficult to measure in a match against humans, will give a different result in a match between computers. I have had Super Enterprise myself, and it would not surprise me if it could get 1700 or more against humans (it has 1585 on the list). Perhaps the sweedes should - instead of just lowering the level of the list with 73 points because the top computers cannot prove their ratings against humans - take each computers distance to 1850 (a middle value) and reduce the distance to f.ex. 75%. F.ex. would 1550 and 2150 become 1625 and 2075. But how could you find a scienticific basis of doing such a trick? The computers that learns from their errors also makes things a little more complicated, as the computers rating is not fixed anymore as it's rating increases the more it learns! The same goes for computers where you can add variations to the openinglibrary, or computers where you can add extra openingmodules and endgame-modules. Computers with openingvariations(/games!) (Super Expert B against Mach III?!) dedicated to kill other computers will probably get better results against other computers than it's result in my test. The tester of a computer may have done the test in a wrong way, set up some positions wrong, written the result wrong etc. Only double testing may discover such errors. If my test has been used in adjusting parameters in a program (Kittingers Super Expert B, as Kittinger has had the test since 1988?!), it will probably perform relative better in my test than against other computers! If a program has some real bugs, they will probably not occur exactly as often in my few testpositions as in a big number of games. F.ex. loses Sfhinx galaxy sometimes it's queen, and I'm sure it does not in any of my testpositions! A program may have a little feature: The program discovers that the score has turned bad compared to the previous move and uses some extra time here to find 'something'. I don't know if this improves the rating of the program (or how much it may be improved). But I'm sure my test won't discover this timedisposition-feature! The same goes for doing an unexpected move, which may not be the objective best, but is difficult to calculate and respond correct to, when your opponent(especially a human) are in timetrouble! Mr. Hoffmann has put me some questions which might interrest other readers: Test 13/14: Does any computer play Td4!, and does any computer not play cxd? All Richard Langs programs and Forte B plays Td4! And Par Excel., Turbostar, Elite A/S and Champion does not play cxd!! Test 59/60: Why are so different points given for exchanging pieces and to avoid exchanging pieces? I have initially set both these tests to 9 point, but my BBC- programs automatic pointadjustment (described in MODUL 88-3) has changed these points to 3 and 18 points. I admit this does not look realistic. Test 91/92: Does any computer solve exactly one of these two similar tests (insufficient material with one knight or one bishop)? I have not been aware of this thought before and I have checked my results. It was a surprise to see, that Forte B plays Kxd2 in the first test and Kxf2 in the second! Perhaps this is an error by the tester... Readers with a Forte B can try this (and write to me). During the years I have got many advices, complains etc. It is not easy to construct a good and errorfree test! I constructed test 27, so white should play d6! immediately before black would prevent it with his -,d6. Later I thought the threat Nc7 winning the exchange activated by a move like Rc1 gives black no time to play -,d6 and must be judged as a good move like d6!. But now I know black can play Bb4+ and castle and still has some hope. What is worse, is that Tommy Nagel has pointed out, that d6! wins a piece!: 1.d6!,Kf7 2.Nc7,g6 3.Qd4! This changes the test from a positional to a tactical. Tommy also proved, that test 82 also has 1.Ke5! as a right solution. Test 121. In most tactic testpositions they have been constructed so there is an obvios move like capturing an uncovered piece. Later the program understands, that this move is not good or there is a better move. This is nessescery to prevent, that the program by accident chooses the right move from the beginning without 'understanding' the position. But a sideeffect is often unnaturel positions with a big difference in the materiel balance. Larry Kaufman tells, that Hitech, even though -,Rxh4? does not prevent the mate, plays - ,Rxh4! for this reason: 'I am far behind in material, but I'll try to save my rook, capture the knight and hope that my (human) opponent does not see the mate. If he misses the mate, I am not so far behind in materiel any more'. But this argument is bad against another computer, and the test is designed for computer versus computer play like the sweedish ratinglist. But the tests 62-64 are included to see, if the program plays optimal against humans. Too many of the tactic tests are concerned about mate (Kaufman), and there should be more tactical tests. The first minute or two the time should be measured in seconds(also Kaufman) to better distinguiz between the best computers and to give the test a higher pointmaximum, so strong mainframes could score the points they deserve. More endgametests - especially rook-endings are missing today. There is no test where castling is important to do. Test 94 is a blunder by me when I was making the test ready. The move -,Kxh2! should have been h3! White was to move and AVOID insufficient materiel. There are many positions about insufficient materiel in the test, but although not important they are included to make the test complete (similar is timedisposition included - it IS important, but probably impossible to evaluate with testpositions). Test 97-104: 8 tests with one position. For every right move you can find a number of moves that are almost as good and also wins, and many people think it is random if a computer chooses just the right move and scores the points. I can agree to some degree. When I constructed this position I thought that the right moves showed an understanding of the position like humans have. But this is too much to expect from commercial computers of today. One of the main worries that I have, and noone has put a question to, is the tests budgetpoint that the pointadjustmentfunction in my program tries to keep close at. I have myself determined these points and would prefer a group of good chessplayers to have decided these points. I have a rating of ca. 2000 and expect the budgetpoints have been set reasonable well, but probably not optimal. What follows now is something I wrote to Thomas Mally, and it shows what kind of problems you can have in testing programs on PC's and homecomputers: "As you might remember I have got the chessmaster 2000 tested on both the Amiga and the Atari. But the results look wrong. The Amiga got 1582 and the Atari got 1518. The point is, that the Amiga runs on 7,2 Mhz and the Atari on 8,0 Mhz. If you look at the tactical results they show, that the Amiga were fastest in 14 tests, 7 tests were equal (4 of these were not solved) and the Atari were faster in 1 test. I have tried test 140 to find out what could have been done wrong. If you let the program show its thinking instantly it slows its execution: 12'57 compared to 8 minutes is my results. The Amigatester had 6 minutes as his result..... Hmmmmm (The Atariresult were 9 minutes). Perhaps the Ataritester has let the computer show its thinking instantly.... Do you evt. know a carefull person who could do the tactical test on the Atari? If you do it right and only checks the best move every minute it also gives a little error: While you look at the menues the program stops completely, so you should do it quickly! Another point is this: I have read in PLY, that the tester of the Atari complains that the clock is running too slow.... I have checked this on my Amiga, and the clock runs ca. 16% too slow. Dave Kittinger did not know of this when I asked him.... I have later wondered if it is caused by the difference in Mhz in the powersupply: It is 60 Mhz in USA and 50 Mhz in Denmark. These figures fits wery well with 16%. I don't know if this is possible and what the Mhz is in Sweden, Austria, Germany etc. But it might give wrong test results if the tester uses this clock! It is obvious, that wrong results are very bad for my test. Not only that the test gives bad predictions, but the test is adjusted on a wrong basis."