File: ADJUST The following were written in the summer 1989. I made a describtion to MODUL of how the pointadjustment function in my ratingtest in detail worked. It were never published as it were too detailed and the sweedes were planning to change their ratinglist. But I wrote it, and now YOU are offered to read it! Since the summer 1989 these things have been made: 1) test 143 has been discarded. 2) The prototype 3 in the described form were never published, as the sweedes were planning to change their ratinglist, so we decided to wait and see. 3) Where I write "For a tactical test the difference is divided by 4 BEFORE the powerraising" you should read 1.3 instead of 4. The value has been changed to get the point for the tactical tests nearer the budget. I have spent many hours with my BBC-program during the last month, and it has become smaller(more space free), faster, more easy to read and modify, gives more information and even some minor errors have been found and corrected. The adjusting function is also faster now, as it is more intelligent: if the new weight improves the test, another attempt in the same direction is tried. I have used this time because I have been thinking of my next article to MODUL - here is what I have: In this issue I will try to explain my BBC-basicprograms pointadjusting function more in detail than I did in MODUL-88-3 page 27.(another reason for making this describtion is if someone takes over the test.) The calculation of the computers rating should be close to the PLYratings, and the points for each test should be close to my budgetpoints. Tactical tests and selektive computers are allowed bigger differences - how this is done is shown below. No test can get more than 50 point, and the lowest number of point is 10 for tactical tests and 3 for other tests. Some tests has only 1 or 2 points, mainly tests on insufficient material and traps, and are not adjusted. Timedisposition is also not adjusted. A total number of 103 tests are adjusted. 22 of these are tactical tests. Notice that the published test had an error in test 106. It should only have 1 point, not 3. When the pointadjusting runs, it goes through all the tests and tries to adjust the weight 1 point up and down. If the test is improved, the new value is kept. The tactical tests are also tried to be adjusted with random steps up to 15 point, as there is a risc that only one point disappears by being rounded off, and the weight would never be changed even it was needed.. I calculate two number which shows how big the ratingdifferences are (G) and how big the testpointdifferences are (H). G: The sum of the computers ratingdifference raised to a power of 3. For a selective program this number is divided by four AFTER the powerraising. Finally this sum is divided by the number of computers to get the AVERAGE. There are two reasons for using the average: The size of the number is independent of the number of computers involved (why this is important is shown later), and it is comfortable the size is of a similar size of H. H: The SUM of all the 103 tests budgetdifference raised to a power of 3. For a tactical test the difference is divided by 4 BEFORE the powerraising. The use of raising to a power of 3 has the effect of avoiding some single big unrealistic differences even though the average difference could be lower. Example: the program prefers the three differences 5,5,5 compared to 2,2,8 even though the average 5 is bigger than the average 4. This is because the differences raised to a power of 3 gives 375(for 5,5,5), which is less than 520! If both G and H goes down, the test is improved, and if they both goes up, the change is discarded. That is clear, but what if one goes up and the other down? For this I have invented a factor, which has the value of 2: G * 2 = H. It determines this way the RELATION between G and H (now you see why it is important to keep G at a constant size!!). If H improves, it must be double as much as G goes worse to be accepted. And similar the other way round: if G improves, it must be at least be improved half as much as H goes worse. By adjusting this factor, I can determine if the program shall go nearer to the ratingpoints or nearer the budgetpoints! See the two experimental versions in the table below, where the factor has been changed 200 times up and down. When the program has adjusted points for 4-5 days(!!), it makes no more changes and a new prototype 3 has been born! Testname A B C D E F G H I Budget points 80.4 90.5 269 0.0 0.0 0.0 742557 0.0 (0) Experiment 2 42.8 53.9 161 2.9 1.0 10.1 122453 1143.6 0.1 Prototype 2 35.1 50.5 165 7.4 4.4 18.4 80423 33630.4 2 Prototype 3 18.7 24.1 52 6.4 3.5 17.1 5835 17895.8 2 Experiment 1 8.2 11.6 23 8.2 5.7 17.2 549 174075.5 400 A: How much the 30 computers in average differs from the rating in PLY-89-2. B: The same for the 13 selective programs. C: The difference for the computer with the biggest difference. D: How much the 103 adjusted tests in average differs from my budget. E: The same for the 81 nontactical tests. F: The same for the 22 tactical tests. G&H: Numbers used in adjusting the test - explained above. I: Is the factor used. (A quick check of G: 19 raised to a power of 3 = 6859. For prototype 3 19 is the average ratingdifference and G=5835) I actually don't quite understand why the relation between G and H became 3.07 in prototype 3 when I use a faktor of 2...???? I had expected them to be more equal.... In the experimental version 2 G & H hits much better with a factor=0.00934!!.... Version 1 got a factor=317, but you can't make all computers rating fit no matter how much you adjust points. Version 1 actually did not run completely to the end - after 8 days(!!) I stopped it, as it only made very few changes! I have some spaceproblems on my BBC, which only has 32K RAM, so I have only used 30 of the computerresults. I have mainly used the best 30 computers, and this should also mean, that the test should be more fitted for stronger computers. This sounds reasonable, as the computers in the future are expected to be very strong compared to the earlier, of which the most weak are almost out of interrest now. The figures show, that selective programs and tactical tests differs most - exactly what I wanted them to! It can also be seen, that the new prototype 3 is very much better than prototype 2. In prototype 3 every computer starts with 991 point. (I have used the word prototype, as you can never know when the test has finished.....) In the table above I have shown the figures, if I simply used the budgetpoints. It is interresting to notice, that the selective programs actually differs more (90.5) than the brute-force programs, which difference can be calculated to 72.7 - just as expected! Some computers differs very much, and here are the worst: Mephisto mega IV 4.9 -269 point. Mephisto MM IV -213 point. Rebel -194 point. Sphinx galaxy -152 point. Mephisto MM2 3.7 +148 point. Plymate 5.5 +141 point. Forte B +127 point. Mephisto Academy -104 point. It is reasonable to assume, that it is these computers that causes the worst changes of the weights away from my budgetpoints to get their ratingfigures fit. Another thought could be, if you really believe in the test, that these computers has been tested in a wrong way..... Or there are several versions of these computers, and the sweedes uses one version and the tester another (Simultano C..?)..... Or.... Who said it would be easy? Noone! But let the future show how good Prototype 3 is in predicting new computers rating!