I decided to take a break from getting frustrated with the problems in DS2x86 and work on DSx86 instead. This version has various small fixes, mostly for specific games but all of these fixes were actual bugs in the emulation, so they might fix problems in other games as well. Here is the list of the fixes:
Battle Bugs also played PC Speaker sounds (before configuring the audio) and at times changed the speaker frequency so fast that it caused the ARM7 FIFO buffer to overflow, which in turn caused the touchpad to seem to stop working. I fixed this by changing the PC speaker frequency handling to use a shared memory location instead of the FIFO system. Now ARM7 reads that memory location 512 times per second (once per each AdLib music buffer fill cycle) and adjusts the PC Speaker channel frequency by the read value. This is not yet quite fast enough for PC Speaker digitized audio emulation, but at least it fixes the touchpad hanging problem.
Battle Bugs also needed various enhancements to the VGA 640x480 screen handling. It uses split screen mode and a non-standard display pitch value, both of which were still unsupported in my 640x480 mode screen blitting code. I added support for those features to the Zoom, Scale and Jitter modes, but did not have time to add them to the Smooth scaling algorithm yet. I don't actually know how to play Battle Bugs, or Populous II for that matter, so I don't know how playable those games are, but at least they start into the actual game.
Just before this weekend my NAS server machine died. I had all the backups and previous version source codes for DSx86 and DS2x86 in a RAID 5 disk array on that machine, along with all the debug logs and such you have sent me. It looks like the motherboard of that machine is dead, and I don't have a suitable RAID controller in my other machines, so I can not access the old version sources or debug logs at the moment. I have ordered a new motherboard, so hopefully I can get my RAID 5 array up and running again when the new motherboard arrives.
My summer vacation begins after next week, so then I should have more time to work on DSx86 and DS2x86. There are still a lot of things missing from DS2x86 that I would like to add to it, but before that I would want to fix the problems with the DSTwo I/O layer which are the source for the constant frustration. Perhaps I will have time to really dig into this problem during my summer vacation.
This version has no changes besides the rewritten and improved audio code. I managed to port the AdLib emulation code from ARM ASM to MIPS ASM, so this version has also AdLib audio support. I also implemented the SB digitized audio ADPCM sample formats, and improved the auto-init DMA behaviour.
However, the audio features are not yet fully working, there are problems in the ADPCM audio playing, and there is a rather big problem with the DSTwo I/O layer. I fought pretty much the whole of last week with the DSTwo I/O layer, as it keeps hanging very often when playing audio. I have tried all sorts of changes to my audio code, but it seems that I can only choose between the audio buffers occasionally hanging, and the whole I/O layer hanging! Thus, this version has the audio-buffer-hanging problem but it should not hang the whole I/O layer very often. I have no idea what exactly causes the I/O layer to freeze, so I have no ideas how to stop this from happening, sorry.
This is the situation with audio on the games I have been extensively testing this version with:
Since this version has a high risk of breaking some games that used to work in the previous version, I'll keep the previous version 0.10 also available on my download page. I hope I can make some sense of the DSTwo I/O layer behaviour some day, as it has been very frustrating trying to circumvent weird problems that have nothing to do with my x86 emulation code!
For the past week I have been working on the new audio system for DS2x86. As I mentioned in the previous blog post, the code was mostly done by last Sunday, but the problem was that it sounded very bad and also seemed to cause the games to hang much more frequently than with the old audio code. The current status is that the new code is in use, and works at least as well (if not better) in the games that had audio also with the old code. The new code also has ADPCM audio supported, and it works fine in Warcraft, when the original code either failed to initialize or caused the game to crash very often.
I have been using four different games to test the new audio code, as each of these games use somewhat different SoundBlaster audio features:
I am using 22050Hz audio playing rate in DS2x86. In the original audio code I had initialized the DSTwo audio transfers to 3*128 samples per transfer (with the ds2io_initb() call), as that was the closest I could get to be able to send a new audio buffer within my 60Hz main emulation timer handler. That number was derived from the fact that I would need to send 367.5 (= 22050/60) new samples during each 60Hz timer interval, and the need of the transfer size to be divisible by 128. Using the 60Hz timer for this also meant that I could not emulate an SB IRQ faster than at 60Hz (which again corresponds to 367.5 samples at 22050Hz). For example Doom wanted to get an SB IRQ after every 128 samples have been played. As Doom played the audio at 11kHz (which is very close to half the 22050Hz playing rate I use), I should have generated an SB IRQ after every 256 output samples, or at 85Hz (= 11000/128). To make the audio in Doom work, I had coded a special hack into my audio code where I slowed down the sample rate if a game wants SB IRQs faster than my code could provide them. Obviously this was not a proper solution, so in my new audio code I wanted to handle this situation better.
The basic idea of my new audio code is that the emulation interrupt runs at 4*60Hz (240Hz) and calls the screen and keyboard handling only during every fouth interrupt. The audio emulation is handled at 240Hz, which should be fast enough for any SB IRQ frequency a game would need. By last Sunday I had implemented this new audio code, using 128 samples per transfer in ds2io_initb(), with a separate 1024-sample ring buffer that is filled in the 240Hz interrupt. Since 22050/240 = 91.875, I had designed the ring buffer filling algorithm so that it tried to fill the buffer with approximately 96 samples in each 240Hz interrupt, with the actual amount adjusted by the number of DSTwo IO/ layer audio buffers in use. Since the DSTwo I/O layer has 4 audio buffers, I calculated the needed new samples as 128-ds2_checkAudiobuff()*32. And in every interrupt where the ring buffer had >=128 samples, I sent the buffer to the DSTwo I/O layer.
I thought this code was much better than the original one, however there were several problems that took me pretty much the whole of last week to fix:
I first debugged the behaviour in Doom, as that was one of the main problematic games I tried to fix with the new code. At last I found that the problem was in how the timer and SB IRQs interract in the Doom code. For some peculiar reason, Doom does not fill the 1024-sample DMA buffer during the SB IRQ, but instead in the timer IRQ. And with my new 240Hz maximum SB IRQ frequency, when the DSTwo audio buffers were empty, Doom could get two SB IRQs with no timer IRQ in between, and thus it skipped one 128-sample block in the DMA buffer, which in turn caused that block to play some old data that was in the buffer. I fixed this problem by fine-tuning the amount by which I fill the ring buffer, and never continuing to fill it after it was time to send the SB IRQ to the emulated game.
The second problem with the warbling sound seemed to be simply caused by the 128-sample transfer size. The DSTwo I/O layer does not seem to work very well with anything less than 512-sample transfer sizes. I tested all sizes between 128 and 512, and the 512-sample transfer size sounds best by a wide margin. It would be better to use smaller transfer sizes, as the larger the transfer size the bigger the delay between the game initializing audio playing and the time when the audio is actually heard. With the 512-sample buffer the delay is 23ms, which should still be small enough not to be noticeable.
The third problem was actually mostly fixed by fixing the second problem, using larger transfer sizes. I noticed that I got rid of the hangs by transferring the audio buffer in the same every fourth interrupt as where I handle the screen and keyboard stuff. It seems that sending the audio buffers too fast can hang the DSTwo I/O layer. I actually experienced an interesting hang once in Doom, where the screens stopped updating but the audio still continued, and based on the audio (gunshots and monster roars) the game continued fine in the background even when both the screens were completely frozen!
After I got those three problems fixed, I continued by adding the ADPCM sample routines for Duke Nukem 2. Those have now also been implemented, but for some still unclear reason Duke Nukem 2 seems to hang quite often. At times only the audio hangs and the game continues forward, at other times the whole system hangs. I have debugged the situation where the audio stops working, and in that situation the DSTwo I/O layer never releases the audio transfer buffer, and thus DS2x86 is unable to send the next transfer block. So, the problem is again somewhere in the DSTwo I/O layer, or more likely in some interaction between my emulation interrupt and the DSTwo I/O layer. It is rather frustrating to always fight with the DSTWo I/O layer to get rid of weird problems, but I suspect that is the price to pay for trying to bypass some limitations in the I/O layer.
Anyways, I got bored with debugging Duke Nukem 2, as Supaplex, Doom and Warcraft all now play nearly perfect SB digital audio and work fine without crashing for at least 15 minutes (that's the longest I have tested them). So, I started porting the AdLib audio code from DSx86 to DS2x86. The first step is to simply convert the assembler code from ARM to MIPS, and that is what I am currently doing. The bigger step is then to actually change the playing scheme. In DSx86 I could simply play each of the 9 AdLib channels using different NDS hardware audio channels, but in DS2x86 I need to mix all of these channels to the same output buffer that the SB digitized audio emulation uses. This might require some further changes to the AdLib emulation code, but I don't know for sure yet as I am now just converting the code. There is a tiny chance that the next version of DS2x86 might have AdLib audio, but it is more likely that the code does not work properly yet at that time.
Thanks again for your interest in DSx86 and DS2x86! The GBATemp Homebrew Bounty 2011 is currently in the voting phase, so I don't know yet whether DSx86 or DS2x86 will win anything. I am looking forward to seeing some results for that competition!
This version does not bring many improvements. It is mainly just a small maintenance update, for the Homebrew Bounty competition. I fixed a potential memory alignment problem in the EMS emulation, which could cause a BSOD exception. This happened in Colonization, at least. I also implemented several previously unsupported INT calls.
For the whole of this weekend I worked on a completely rewritten audio handling for DS2x86. Sadly, I was not able to make it work reliably yet, so I could not include that in this version. It has gotten quite frustrating, as the new code is much cleaner and faster than the current audio code, and it should work much better, but it just doesn't! I have been debugging it for several hours now, and everything seems to work as it should, but still the audio sounds very bad! Very frustrating!
Looks like I will need more time to thoroughly debug this problem, and compare the behaviour of the new code to that of the original audio code. The reason I want to rewrite the audio code is that it is nearly impossible to add other emulated audio sources (besides the SB digital audio) to the current code. I want to start working on the AdLib audio code soon, so I needed to rewrite the audio code so that it can be extended with other audio sources. But, the first step is obviously to make it run the plain SB digitized audio at least as well as the current code.
The Homebrew Bounty 2011 competition will close after tomorrow, so I believe DSx86 0.34 and DS2x86 0.10 are the final versions that I participate in the competition with. Any last minute changes have a high risk of breaking some previously working code, and I try to avoid releasing versions I haven't tested properly. Since both my emulators are still in a continuous "work-in-progress" state, the competition versions will in any case be kind of snapshots of the evolution of x86 emulation on the Nintendo DS. Now I'll just wait and see what happens, whether the judges feel my entries are worthy of some prizes. So, good luck, me! :-)
Sorry for this unscheduled release, but as this is the day of the original Homebrew Bounty deadline, I decided to release a quick fix version to address a couple of issues introduced in the previous 0.33 version. This version supports 'sudokuhax' for running in DSi mode again, and it also has the screen blitting and keyboard reading code in the same order as in the 0.32 version.
I had reports about DSx86 missing keyboard events in 0.33 when using Smooth scaling mode, and the only thing I had changed that could cause this was that I moved the keyboard reading before the screen blitting in the VBlank interrupt. The reason I did this was that reading the keys also reads the scrolling buttons L and R, and scrolling the screen must happen within the VBlank period or the screen will tear. However, adding the Smooth scaling options make the screen blitting take so much time that the VBlank period has ended before the code gets to keyboard reading. I still don't quite see how swapping the order of the subroutines would cause key events to get missed, but it seems that changing the order back and just making sure the screen scrolling actually takes place only during the NEXT VBlank period fixes this problem.
The other problem introduced in version 0.33 was that I had to remove the DSi SD slot access code from libNDS 1.5.0 to make my AdLib emulation run without crashing. I now found a way to include that back in and still have my AdLib code run properly. I still need to dig into the original problem further to see what actually causes the hang. I originally got rid of the problem after I removed the SD reading code, the i2c handling code, and all the additional DSi mode ARM7 interrupt and FIFO handling. Now I managed to add the SD reading code (along with the FIFO stuff that that needs) back in, so I plan to check whether it is the interrupt, some remaining FIFO or i2c code that actually causes the problem. I have already tested the original libNDS 1.5.0 by only removing the i2c code, but that still caused a hang.
Anyways, I think I'll need to make some final touches also to DS2x86 before the actual Homebrew Bounty deadline, so this might very well be the final DSx86 Homebrew Bounty version.
The GBAtemp Homebrew Bounty 2011 is about to close, so I wanted to release a new version of DSx86 before the deadline. I had to choose between DSi and DS categories in the competition, which was a somewhat difficult decission as DSx86 does run in both. In the end I decided to put DSx86 into the DSi category, as pretty much all the other emulators in the competition are also in that category, and running DSx86 on a DSi does bring some worthwhile advantages.
Last week when I was testing DSx86 on my friend's DSi, I noticed that Supaplex demo kept hanging, same as Wing Commander II during the speech intro. First I thought the problem only affected DSi mode, but when I tested those games on my DS Lite, they also caused DSx86 to hang! That was a nasty surprise, with only a few days left before the bounty deadline! I tested all the older versions of DSx86, and noticed that version 0.23 (the last one built with libNDS 1.3.4) ran fine, while all the newer versions (built with libNDS 1.5.0 with DSi mode support) hang on both DSi and DS Lite!
So, the problem was obviously introduced with the new libNDS 1.5.0. I then began building different versions of libNDS 1.5.0, removing various stuff that had changed between 1.3.4 and 1.5.0 versions, to see what exactly caused the hang. Finally I got a version that ran DSx86 fine, but I had to remove quite a lot of stuff (like the DSi SD slot access, DSi i2c chip handling, new system FIFO commands, etc). I believe the problem could be caused by the ARM7 processor simply running out of stack space with the new ARM7 binary being so big. The libnds7.a from version 1.3.4 is 39 KB, while the libnds7.a from 1.5.0 is 193 KB! My AdLib emulation needs a lot of RAM so that I had to remove all WiFi and MAXMod stuff even in 1.3.4 version. It is quite possible that it just won't fit properly with all the extra stuff that 1.5.0 version has included.
Anyways, I am very sorry for all the recent versions of DSx86 having had this hanging problem! After fixing this I began working on some final touches to DSx86 so that I could be satisfied that it is as bug-free as I have time to make it. I mainly improved the EGA smooth scaling features, and I even added the most difficult 640x480 to 256x192 smooth scaling algorithm. This takes so much CPU power that it is only available in DSi mode, though. Smooth scaling in the lower resolutions, up to 640x400, is available also when running in DS Lite mode. Here below are some screen copies of Windows 3.00a and Castle of the Winds running on a DSi in 640x480 smooth scaling mode.
I also added a new DSx86.ini configuration option EMSSize to choose the emulated EMS (and thus also XMS) memory size. The size can be between 256KB and 4MB, and it needs to be in the [DSx86] section. For example, to select 1 MB as the EMS memory size (which leaves another 1 MB for XMS memory in DS Lite mode) you can configure it like this:
[DSx86] LogFile=/data/dsx86/dsx86dbg.log EMSSize=1DSx86 is able to figure out the units automatically, so you can give the amount in either kilobytes or megabytes. If you mainly run Windows in DSx86 on DS Lite, you can turn the EMS memory size down so that Windows has more XMS memory, or if you run games that need more EMS memory, you can increase the EMS size (up to 4MB or however much RAM is available). Note that this setting is not configurable per game or another program you run, as 4DOS wants to keep track of the available memory.
So, to recap, here are the changes in DSx86 version 0.33:
This version has the following improvements:
Below are some examples for EGA screen scaling. Silpheed uses 640x200 mode, and it has used smooth scaling in DSx86 as well. EGATrek uses the highest resolution available on EGA displays, 640x350. Since the vertical resolution is not easily scalable to 192 vertical rows, I instead scale it 2:1 to 175 vertical rows. Thus, there are some black rows on the bottom of the screen. The mode used by Mahjong Fantasia (640x400) is not a proper graphics mode that the EGA/VGA BIOS support, instead the game initially switches to 640x200 mode, and then doubles the vertical resolution by directly accessing the EGA card registers. Finally, A-Train is an example of the VGA high resolution 640x480 mode.
Silpheed (640x200) | EGATrek (640x350) |
Mahjong Fantasia (640x400) | A-Train (640x480) |
I also debugged the Windows 3.00a crashing problem, but could not yet solve that issue. I worked also a little bit on Windows 3.1 support, which complained about there not being enough XMS memory. I found and fixed this problem, but it is using some new protected mode opcodes and features that I did not have time to code for this version, so it will crash with unsupported opcode errors. I'll see if I can make it run better in the next version. Making Windows 3.1 run should also help me in locating the problem in Windows 3.00a.
There are also other misbehaving games on my TODO list, and I'll continue debugging these and fixing the problems. Thanks again for all of you who have tested the games and sent me debug logs and other information to help me in fixing the problems in DS2x86!
After the 0.08 release I began debugging the games that hang or have other such problems where they do not crash into the debugger. The first game I debugged was X-COM UFO: Enemy Unknown. It had two problems: the intro hangs, and if you start the game directly, there is a problem with mouse handling when you should select the location for the home base. I debugged the intro first, and when breaking into the debugger (after it hung), continuing, and breaking again, it seemed to get stuck in the following small routine in it's code:
mov ah,02 int 1A // INT 1A, AH=02 (Get Real-Time Clock Time) mov bl,dh // Save DH (seconds value) into BL 15B10E: mov ah,02 int 1A // INT 1A, AH=02 (Get Real-Time Clock Time) cmp bl,dh // Is the saved seconds value the same as the new seconds value? je 0015B10E // Back to loop if still the same second.The immediately obvious thing that could cause a hang in that situation would be if the time reported by the software interrupt (BIOS routine) never changes. And indeed, when running this code, the time returned was always 01:06:53! That was strange, as I am using the exact same C code in DS2x86 as I used in DSx86, and I have never had a problem in DSx86 with that. The code looked like this:
case 0x02:
// TIME - GET REAL-TIME CLOCK TIME (AT,XT286,PS)
// Return:CF clear if successful
// CH = hour (BCD)
// CL = minutes (BCD)
// DH = seconds (BCD)
// DL = daylight savings flag (00h standard time, 01h daylight time)
// CF set on error (i.e. clock not running or in middle of update)
{
time_t unixTime = time(NULL);
struct tm* timeStruct = gmtime((const time_t *)&unixTime);
SET_CX((((timeStruct->tm_hour/10)%10)<<(8+4))|((timeStruct->tm_hour%10)<<8)|
(((timeStruct->tm_min/10)%10)<<(4))|((timeStruct->tm_min%10)));
SET_DX((((timeStruct->tm_sec/10)%10)<<(8+4))|((timeStruct->tm_sec%10)<<8));
}
CLR_CF;
return 0;
It uses the C standard library time functions to get the current time and then
modifies it to BCD values to be returned by the INT 1A function. However, it seems
that the DSTwo SDK has no support for using the C standard time functions! I needed
to replace the C standard function with the DSTwo-specific ds2_getTime() function
to get this routine (and also the similar DOS routine) to work properly! Beware
of this peculiarity if you are porting any existing C code to DSTwo!
case 0x02:
// TIME - GET REAL-TIME CLOCK TIME (AT,XT286,PS)
// Return:CF clear if successful
// CH = hour (BCD)
// CL = minutes (BCD)
// DH = seconds (BCD)
// DL = daylight savings flag (00h standard time, 01h daylight time)
// CF set on error (i.e. clock not running or in middle of update)
{
struct rtc tmp;
ds2_getTime(&tmp);
SET_CX(((((tmp.hours > 40 ? tmp.hours - 40 : tmp.hours)/10)%10)<<(8+4))|((tmp.hours%10)<<8)|
(((tmp.minutes/10)%10)<<(4))|((tmp.minutes%10)));
SET_DX((((tmp.seconds/10)%10)<<(8+4))|((tmp.seconds%10)<<8));
}
CLR_CF;
return 0;
After fixing this routine, the X-COM: UFO intro continued fine into the actual game.
Next, I debugged the mouse input problem in X-COM UFO. It took me a while to trace into the code that handles the actual mouse input. The mouse interrupt just saves the mouse position and button state into some global variables in the game code, and the code that then reads the values and performs some actions is somewhere else. I could not get my debug version of DOSBox to run the game properly, so I had to search for the correct code with the DS2x86 debugger. When following the call stack upwards I finally found the main game loop, and luckily the mouse handling code was at the top level at the end of the main game loop. I debugged into that code, and soon found the routine that handles the conversion of the mouse coordinates into location on the Earth. The routine begins fine, but then starts to use a large number of floating point operations, none of which are yet supported in DS2x86. So, no wonder the coordinate conversion to the Earth globe does not work. Here below is the beginning of the coordinate conversion routine. All the opcodes starting with the letter f are FPU instructions, which are simply skipped in DS2x86. It is easy to see that the result of this function will most likely not be correct.
18B7E0 push ebx push esi push edi sub esp,00000098 cbw sub eax,00000080 mov [esp+00000094],eax fild [esp+00000094] fstp [esp+20] movsx eax,dx sub eax,64 mov [esp+00000094],eax fild [esp+00000094] fstp [esp+14] fld [esp+20] call 1962C8 fistp [esp+00000094] mov eax,[esp+00000094] mov ebx,FFFFFFFF cwd xor eax,edx sub eax,edx movsx edx,[001A8428] mov [esp+00000090],bx cmp eax,edx jg 18BF59 ($+708) fld [esp+14] call 1962C8 fistp [esp+00000094] mov eax,[esp+00000094] cwd xor eax,edx sub eax,edx movsx edx,[001A8428] cmp eax,edx jg 18BF59 ($+6DD) fildw [001A8428] fld1 fdivrp st(1), st fld [esp+20] fmul st(1) fstp [esp+40] fmul [esp+14] fstp [esp+6C] fld [esp+40] fmul st(0) fstp [esp+30] fld [esp+6C] fmul st(0) fstp [esp+10] fld [esp+30] fadd [esp+10] fld1 ficompw st(1) fstsw ax sahf jb 18BF59 ($+698) fld1 fsub [esp+30] fld st, st(0) fsub [esp+10] fsqrt fstp [esp+08] fsqrt fstp [esp+38] test dword [esp+38],7FFFFFFF fld [esp+08] fdiv [esp+38] fstp [esp+70] fld1 fcomp [esp+70] fstsw ax sahf jnc 18B900 ($+8) mov dword [esp+70],3F800000 18B900 fld [esp+70] ...I am thinking of adding a break into the debugger and a warning message whenever the first unsupported FPU opcode is encountered for the current game, with the option to continue. At that point it is likely that the game will fail to run correctly, and this warning message would be a clear indicator as to why the game will fail. However, I do plan to at least look into adding FPU support into DS2x86 in the future, as that would again be an interesting new feature (much like the protected mode support). Adding the FPU support will take months, as there are a lot of new opcodes to handle, so it won't come very soon. In any case, I plan to add audio support (which could also take months) before the FPU support.
I have had reports and have also myself noticed that the Windows 3.00a Standard Mode support in DS2x86 is not very stable. It seems to crash after a short while, and the crash seems to be caused by the stack either getting corrupt or the stack pointer pointing to somewhere else than where the stack actually is. I haven't yet determined which it is, as the crashes are so intermittent and never seem to happen at exactly the same place, so it takes a lot of debugging to find the cause. I am working on this, but it might not be fixed in the next version yet.
For the last couple of days I have been working on improving the smooth scaling features of DS2x86. I have so many partially implemented features in DS2x86 that it is getting frustrating, so I thought I'd start pruning my TODO list by implementing all the missing screen scaling modes. I started with the Mode-X modes, as they are quite straightforward (one input byte is one pixel). There are various screen configurations that can be used with Mode-X, the most common being 320x200 (scaling already supported) and 320x240 (which gives a nice 1:1 pixel aspect ratio). The vertical resolution can also be doubled for both of those, giving 320x400 and 320x480 modes, which are relatively uncommon (probably due to the awkward aspect ratio). Then there are also the weird 360-pixel wide modes, which are luckily even less common.
I have now managed to code the scaling support for all these Mode-X modes, using some test games I have noticed using these modes. The 320x200 mode has been supported already, it scales only the horizontal resolution with a 5:4 scaling, leaving the 8 vertical pixels to be scrolled if needed. The 320x400 mode also scales horizontally by 5:4, and also vertically by 2:1, so that each output row is averaged from two adjacent input rows. This also leaves 8 output pixels hidden and needing scrolling to get visible.
The 240-row and 480-row vertical resolutions need more work in scaling to 256x192, so they will also be slower. Both are in principle scaled 5:4 in both directions, with the 320x480 resolution first scaled to 320x240 using 2:1 vertical scaling. In the latter mode each output pixel might need 8 input pixels to be read and converted from palette index to RGB value, so it will obviously be rather slow. The game I used for testing this mode was my old Linewars II game, as it has the option to run either in MCGA 320x200 or Mode-X 320x480 mode.
The 360x240 mode is so rare that I am not willing to create a specific scaling code for it, instead it simply scales 320 horizontal pixels to 256, leaving the rest to be scrolled. An exact scaling would need a 45:24 scaling ratio, which is not simple to make fast. Here below are some screen copies of the scaling result from various games using the various Mode-X screen modes.
Trekmo (320x240) | Settlers (360x240) |
Destruction Derby (320x400) | LineWars II (320x480) |
The macro I am using to average two 16-bit ARGB color values into a new 16-bit ARGB value is below. This is based on the Quick Color Averaging article I mentioned on the April 3rd blog post. The advantage is that this macro can calculate the separate R, G and B color channels in one go.
.macro average ret in1 in2 .set noat xor AT, \in1, \in2 // AT = (in1 ^ in2) and AT, 0x7BDE // AT = (in1 ^ in2) & 0x7BDE (clean up the underflow pixels) and \ret, \in1, \in2 // ret = (in1 & in2) srl AT, 1 // AT = (((in1 ^ in2) & 0x7BDE) >> 1) addu \ret, AT // ret = (((in1 ^ in2) & 0x7BDE) >> 1) + (in1 & in2) = (in1 + in2) / 2 .set at .endmIn many cases I need a weighted (75%/25%) average instead of the unweighted average that this macro produces, so in those cases I need to call this macro twice, like this:
average t6, t3, t4 // t6 = average of colors t3 and t4. average t6, t3, t6 // t6 = 75%/25% weighted average of colors t3 and t4.If you are mathematically inclined and can figure out a faster method of calculating a weighted average (the weights are always 75%/25%) of two ARGB color values, please let me know!
The next step is to create similar proper scaling functions for the EGA screen modes, 640x200, 640x350, 640x400 and 640x480. This work I will begin next, using pretty much the same ideas I have used in these Mode-X scaling routines. I also still have some opcodes to implement, and then I will debug a few more misbehaving games. If I happen to find the problem in Windows Standard Mode before the next weekend, I'll fix that as well, but I fear that bug will take more time to hunt down.
Thanks again for all your debug logs and other material you have sent me! It is nice to know that you continue to be interested in this project!
This version has a lot of previously missing opcodes implemented, so a few more games might run again. I did not have time to add support for new INT calls or I/O ports into this version, so those errors will still happen in the games that previously had these errors. I hope to be able to handle these in the upcoming versions.
I have also debugged some ill-behaving games, but sadly that has been mostly frustrating work with no clear improvements in the behaviour. I have collected a lot of information and have been able to close in on the actual problems, but have not yet been able to solve them yet.
After spending many days with various problems in the games, and getting frustrated, I decided to see if I could get Windows 3.00a to run in Standard Mode also in DS2x86. This basically meant porting the 16-bit protected mode features I had coded into the original DSx86 to DS2x86. This work progressed better, so that this version can run Windows 3.00a in both Real and Standard mode. The Enhanced mode would need support for virtual memory, which is not coded in yet, so that mode does not work.
The problem in Windows 3.00a that I have been fighting with today was that opening some programs, like Paintbrush, gave an error "File PBRUSH.DLL not found" or similar. I checked the Windows directory and the file does exist, and I even checked my SD card for errors, but still I got that error. Next I checked my emulated DOS System File Table contents, which has room for 64 entries, but there were only 16 entries used so that was not the cause. I debugged my DOS file routines, and made sure the directory and file name are exactly correct, and then found out that it was the DS2 SDK fopen() function that returned NULL for the file. Finally it then occurred to me to check the the SDK file system functions (which luckily come with source) to see what the file table size is there, and it had only 16 entries! So, I doubled the size to 32 entries, and now Windows 3.00a seems to run. I think I could increase the table to 64, to match my emulated DOS file table, but I'll do that later when I have more time to test that everything still works.
This DSx86 version has only one minor fix, as I have been focusing on DS2x86 during this two-week period. I implemented the direct file reading into EGA VRAM, which is used by games like Rockford and Heimdall. Here is a screen copy of the title screen from Rockford, which stayed black in the previous versions as the graphics are read directly to video RAM.
Happy May 1st! Since releasing the 0.07 version of DS2x86 I have received several new debug log files. For the past week I have been adding the missing opcodes from the logs, and I have now added nearly all of them. The remaining issues in the logs are either due to missing virtual memory support (which is not coming very soon), or the game executing data instead of code. So, instead of adding new opcodes I can now move my focus to making the emulation itself more robust.
On Saturday I improved the exception (BSOD) handling again. I noticed that several games, for example Carmageddon and Destruction Derby crashed within my ASM code, so that the call stack was not sufficient to determine which opcode caused the error. I added some checks to the exception handler to determine if the crash happened inside my ASM code, and if so, made the exception handler return gracefully to the debugger breakpoint exit. Thus, I know the opcode that caused the exception was the opcode immediately before the opcode where the debugger points to. This has the added bonus that the exception info gets written to the dsx86dbg.log, so you don't need to type it down from the blue screen.
After that change, I then debugged both Carmageddon and Destruction Derby further. I noticed that both of them crash because of floating point opcodes not being supported. I don't plan to add floating point opcodes (as the DSTwo architecture does not have a floating point unit), and since I could not get Carmageddon to work on DOSBox (after turning off the floating point support in it as well) either, I decided to ignore Carmageddon. It will not work in DS2x86 in the foreseeable future. This is what the crash log in Carmageddon looks like with the new exception handling:
------------------- [STYLE] -------------------- Exception 2 at 801A33F0! TLB miss on load from 08241B60! CPU: PROT, USE32, CPL=0 GraphMode=03, EGAMode=00, Chain4=OFF EAX=0029274D EBX=00281E46 ECX=00281E46 EDX=00000004 ESP=002C9B6C EBP=00133111 ESI=002C9C70 EDI=020906D8 DS=0168 ES=0168 SS=0168 CS=0160 FS=0000 GS=0020 NV UP EI PL NZ NA PE NC VM=0 IOPL=0 0160:1CCF83 3C00 cmp al,00 Disassembly of code around the location: 0160:1CCF63 E8F0970500 call 00226758 ($+597f0) 0160:1CCF68 DB9C24A4010000 fistp [esp+000001A4] 0160:1CCF6F 6BBC24A40100000Eimul edi,[esp+000001A4],0E 0160:1CCF77 B894272900 mov eax,00292794 0160:1CCF7C 01C7 add edi,eax 0160:1CCF7E 57 push edi 0160:1CCF7F 8A06 mov al,[esi] 0160:1CCF81 8807 mov [edi],al 0160:1CCF83 3C00 cmp al,00 0160:1CCF85 7410 je 001CCF97 ($+10)Looking at the above log, the crash happens at address 0160:1CCF81F, where the AL register value is being written to memory address in EDI register. The EDI register contains value 0x020906D8, which is above the 16MB of RAM that DS2x86 has available, thus the address is not within the mapped RAM. (In case you are wondering why the crash happens with address 0x08241B60 instead of 0x020906D8, that is because addresses outside the mapped memory area default to Mode-X/EGA RAM handling, where the address is first multiplied by 4 before it is attempted to be used). The value in EDI seems to be calculated by multiplying memory address [ESP+000001A4] contents with 0x0E, and then adding 0x00292794 to that value. However, the content of memory address [ESP+000001A4] is the result of an unsupported floating point operation fistp, so this memory address will contain whatever it did before this unsupported opcode. Thus, multiplying a random value with 14 and then adding something to it can easily point to outside the 16MB RAM area. There is no proper way of fixing this without adding support to floating point opcodes.
However, Destruction Derby seems to work on DOSBox, even tough it attempts to run floating point opcodes there also. Thus, there is something else wrong in DS2x86 with this game. I spent the whole of Saturday debugging this problem, but did not yet find the cause. In the end I got frustrated with that issue, and looked into the Arkania hanging problem. I noticed that it hangs also in DSx86, so this issues seems to be something that is common to both my emulators. I have now deciphered the timer code that causes a neverending loop. The game tries to measure the time it takes to draw a full VGA frame, using the timer. The code first checks that two adjacent measurements do not differ by more than 50 timer ticks (where the timer runs at 1.193.182 Hz, so 50 timer ticks is about 42 microseconds). If the difference is larger, the code loops to try again. Surprisingly, this does not seem to cause big problems in DSx86 nor in DS2x86, the code might need to try a few times but this does not cause a never-ending loop.
The next step in the timer check code is where the problem seems to be. The code gets the timer result, subtracts 1000 from it, divides the result by two, and then checks that the resulting value is larger than 14336. Counting backwards from the test value, it seems the original timer value must be larger than 29672 for the code to accept it. I don't quite understand how that would make sense, as that timer value corresponds to 25ms (or about 40Hz), and for the timer value to be bigger the screen refresh rate would need to be slower! The VGA screen in DSx86 and DS2x86 is refreshed at about 60Hz, so the code never sees slower than 40Hz screen refresh rate and thus hangs. However, in DOSBox the resulting timer value is about 34000 (28ms), which is what Arkania accepts. But that speed corresponds to a 35Hz VGA refresh rate! In DSx86 and DS2x86 the resulting timer value is around 23000 (20ms), which would be around 50Hz. I'll need to look deeper into DOSBox sources and the VGA documentation to make some sense of these values.
Thanks again for the debug logs you have sent, and for the games to test that some of you have attached! I'll look into these during the next week, and hope to release a new version again next weekend.
Sorry for the one day delay to my normal Sunday updates, but as this was a long Easter weekend I wanted to use the one extra day to work on DS2x86 and get as many fixes implemented as possible.
Okay, this new version has the following fixes:
After the previous blog post I worked on Jazz Jackrabbit for a while, but could not get any notable progress with it. It still needs a lot of debugging to determine the problem, and so I decided to move on to some other issues that might be faster to solve.
First, I decided to add HIMEM.SYS emulation, as I could simply port the code from the most recent DSx86 version. This in itself didn't take long, although I ran into a very strange problem where the whole DS2x86 hung immediately when launching any game. It loaded 4DOS.COM fine, and I could give internal commands like "dir" or "memory", but immediately after launching any game it hung totally. I then began commenting out the new HIMEM.SYS stuff, until the only things remaining were a couple of static integer variables in the new C++ module! Commenting these variables out got rid of the problem, but when I added these static variables back, the problem came back. One of the variables had an initial value based on the addresses of two other variables, which seemed to cause the problem. I switched to using a #define, and the problem was solved, but I still don't quite understand how a static variable initialization can break the whole software.
After the HIMEM.SYS emulation started working, Chaos Engine began to ran otherwise fine, except that the display had a moving line of black pixels, and scrolling the zoomed screen did not work properly. The game wraps the display around the A000 segment while scrolling (similarly to what Commander Keen 4 does in EGA mode), and this feature was not yet supported in my Mode-X screen blitting code. I took some time to make this work correctly in both Zoom and Scale screen modes, and now this type of wrap-around should cause no more visual glitches. This is what Chaos Engine now looks like, with a wrap around in the middle of the screen (but not showing :-).
After the HIMEM.SYS emulation in place, I wanted to see whether Windows 3.00a would run in Standard Mode. I had several problems with it, all due to various bugs in my protected mode implementation which I had to fix. There is still a problem with the Global Descriptor Table handling, as Windows uses a segment selector 0x0101 when returning from protected mode to real mode, but in DS2x86 the GDT entry points to a data segment, while in DSx86 and DOSBox it points correctly to a code segment. So, no Windows 3.00a in Standard Mode yet in this version, but I'll try to fix this problem in the near future.
Next, I was looking whether I had Hexen on my SD card. I didn't find it, but as there was Heimdall, I decided to test how it runs. It started fine, but when talking to a character I got a BSOD reporting an invalid store address of 0x600CE030. In this same situation DSx86 shows garbage on the screen, so I thought that this might be a chance to find out what exactly is wrong in my emulation code. I checked the address reported by the BSOD, but it pointed to inside the C library memcpy() function. Obviously, memcpy is used all around the code, so that was not of much help. I needed to find out what code it was that called the memcpy, so I implemented a simple stack traversal routine into my exception handler code. It attempts to find return addresses pointing to my own DS2x86 code by traversing the stack above the exception handler stack frame. When I ran Heimdall again, I got a stack trace that (after converting the hex addresses to function names using the dump file) displayed the following call hierarchy: run() -> INTHandler() -> int21h() -> DOSRead() -> fat_fread() -> memcpy().
So, the problem was that the DOS int21 handler used an invalid memory address to store the data read from disk. However, since the address is calculated from the DS:DX registers, which can only point to the real mode memory within the first megabyte of DOS RAM, it should not be possible for the address to point outside the emulated DOS RAM area. Except, that I have mapped the graphics memory segment A000 differently, in order to trap accesses to graphics memory. It had not occurred to me that a game would read data from disk DIRECTLY to graphics memory, so I hadn't trapped the graphics memory address inside the DOS file routines. And, as it turned out, Heimdall loads data directly to EGA VRAM from disk, so trapping the address in the DOSRead routine fixed this problem. This is obviously also the reason why Heimdall in DSx86 displays garbage on the screen, it just does not crash like DS2x86 did. I'll fix this problem into DSx86 also for the next DSx86 version.
At this point it was already Saturday evening, so I decided to work on the text mode cursor emulation on Sunday. I had tried to implement the cursor emulation earlier, but at that point it made the "stuck key" problem much worse (most likely because the text mode screen then needed to update much faster. I use a dirty buffer approach with the text mode handling, so that I don't need to convert those fonts to graphics where the character has not changed). But now with the 60Hz -> 59Hz screen refresh rate change, I thought I could try to add the cursor emulation again. It took me most of the Sunday to implement, as I had an unexpected bug in the dirty buffer handling. It had caused no symptoms until I implemented the cursor, so it took me a while to look into the correct routine for the bug. In any case, by Sunday afternoon I got the cursor emulation working.
On Sunday evening I then continued debugging Ascendancy, which hung DS2x86 completely (its is just one of many DOS4GW games that seem to exhibit similar behaviour). I had already a couple of days earlier found the high-level call that causes the hang, so now I just burrowed through to the lower level calls. Finally I found out that it was a simple and common REP MOVSW opcode that caused the hang. This was a bit strange, as this opcode is called thousands of times in all games, why does it hang in this case? The addresses and counts looked fine, it tried to move 512 bytes from high memory to the beginning of a segment in low memory. I had to go through my helper macros one by one until I finally found the problem. In the code that tests whether the output offset wraps around the segment, the 32-bit routine (where the segment size can be a full 4GB), a temporary variable caused an overflow when the input offset was zero. It is rare for the 32-bit input offset to be zero, as usually the protected mode programs use flat segments starting at the beginning of the RAM. Since the beginning of the DOS RAM contains real mode interrupt vectors, the protected mode offsets normally do not start from zero. In this case the segment did not start at the beginning of the RAM, so a zero offset was OK. After fixing the helper macro I got rid of this hang, so in the future I can debug these games properly.
So, that's about the amount of work I managed to do for this version. I also added many missing opcodes from the debug logs you have sent, thanks again for sending me those! I think pretty much all the opcode errors in the logs have now been implemented, except for some obscure problems that are most likely caused by the program executing data. These I will continue working on.
After I got DSx86 version 0.31 released, I went back to improving DS2x86. It has been on a small hiatus, so I first needed to remind myself at what state I left it when I began working on the DSx86 enhancements. First I changed the Scaled screen mode to use the same speed enhancements as DSx86, as that was a simple and straightforward change. Next, I went thru the debug logs that you have sent me for version 0.06, and implemented some simple still missing opcodes.
I also spent a couple of hours trying to get rid of the "stuck key" problem, that is, the intermittent problem where not all key presses/releases get recognized. I suspected that my use of the timer interrupt handler for performing the DSTwo SDK screen and audio refresh stuff might be the cause, so I hacked together a version where the timer interrupt only sets a variable to request a screen refresh, and the actual refresh routine is called in the main code. However, sadly this did not have any effect on the key problem, the keys still got stuck. But at least I found out that the timer interrupt does not cause this problem. The next thing I tested was changing the timer interrupt speed from the original 60Hz speed to 59Hz. Curiously, that seemed to help at least somewhat. It seems like the key reading issue is caused by some synchronization problem between the SDK timing and my internal timer. I think I will leave the timer interrupt running at 59Hz in the next version, and I'll let you test and report whether the "stuck keys" are still a problem.
The next step was to download some games that have some more difficult problems, like hanging, jumping to zero segment or BSOD exceptions. I found five games to test; Pinball Fantasies, Xargon, The Chaos Engine, Micro Machines and Zool 2. This is the current status with those test games:
Currently I am working on the Jazz Jackrabbit game, or more generally, the Borland RTM DOS extender support. I last worked on it during January, and it has not progressed further since that. I have now been debugging it for a few hours, and have determined that some data that it copies from the low memory to the extended memory does not contain the values that it should, but I haven't yet determined where the original data is loaded and why it is different in DS2x86 to what it contains in DOSBox.
During the next week I plan to continue working on Jazz Jackrabbit, and possibly implement the HIMEM.SYS emulation for DS2x86. There are also some opcodes whose implementation is still missing, so all these will keep me busy for the next week. After these fixes I think it is time to focus on the audio features. I hope to release a new version of DS2x86 during the next (Easter!) weekend.
This is mostly a small bug fix version, as the 0.30 version had a couple of annoying bugs caused by the new HIMEM.SYS emulation. I also added the Smooth scaling option to some new graphics modes. The main changes in this version are the following:
Thanks for the bug reports for 0.30 that brought the first two problems to my attention! I hope this version will run the Windows programs better, and does not cause so many problems in the games that used to run in the previous versions.
This is not version 0.26, but version 0.30! I decided to jump the version number, as this version has such a major change. This version emulates a 80286 processor, instead of the 80186 processor that all previous versions have emulated. The list of changes in this version is as follows:
Here below are some screen copies from the latest version, showing the Windows 3.00a About dialog in DS Lite and DSi mode, and two screen copies showing the result of the EGA 640x200 mode Smooth scaling.
Please test this new DSx86 version, as I might have broken some games with the extensive internal changes I had to do for protected mode support. Also feel free to test various 16-bit Windows games, many of those should now run (as long as they don't try to use some 386 processor features).
Next, I think I will get back to working on DS2x86. I would like to get started on the proper audio support for DS2x86 in the near future, so I'll probably look into that, along with trying to fix some bugs and test some misbehaving games. Thanks again for your continued interest in DSx86 and DS2x86!
For the past week I have been continuing my work on adding the 286 protected mode features to the original DSx86. Finally this morning I got Windows 3.00a to actually boot up in Standard Mode!
I have been doing this work by bundling all the files that Windows 3.00a needs into the DSx86.nds file itself, so that I can quickly test it using No$GBA and iDeaS emulators on my PC. So far I have had to include 885 kilobytes worth of files, and thus the amount of memory that DSx86 has available for the extended memory emulation is rather limited. Luckily Windows 3.00a is so compact that it is possible to bundle all the core files into DSx86 and still have just about enough memory free for the actual emulation. It also looks like I can make the Standard Mode available in DSx86 even when running on DS Lite, instead of requiring DSi mode. Of course it will run faster and have a lot more memory available when running in DSi mode.
Next I'll need to remove the bundled files and start testing it on real hardware with a proper Windows 3.00a installation. I have not yet implemented all the protected mode opcodes, and especially the code that handles task switching is still very limited, so I fear I still have a lot of work to do before the Standard Mode will be actually usable. There are also still problems with keyboard and mouse handling. But in any case, the core features have been implemented and seem to work, so things are looking quite good at the moment.
This version does not have any major new features, mostly just minor enhancements and improvements. A few new protected mode opcodes are supported, and a few more software interrupts are handled. I have not had time to work on DS2x86 (nor DSx86) much during the last week, as I have been busy with other things. Sorry about that, I hope to be able to work on both DSx86 and DS2x86 more during the next two weeks.
Next I plan to continue working on the DSx86 286-specific protected mode features, in an attempt to get Windows 3.0 running in Standard Mode.
For the past week I have been working on the original DSx86 instead of DS2x86. I began implementing the required changes in order to enable running in 286 protected mode (for Windows 3.0 Standard Mode). I have changed the memory access method to enable access to full 16MB of memory (of which around 12MB would be actually available in DSi mode, and some hundred kilobytes in the normal DS mode). It looks like Windows 3.0 only requires 24KB of extended memory to be able to run in Standard Mode, so this should make it possible to run Windows 3.0 in this mode also with DS Lite and the original "phat" DS. It will run much better on a DSi when using DSi mode, though.
The current status is that Windows 3.0 enters protected mode fine, does some setup operations, then returns to real mode (which on a 286 processor means resetting the processor using a Triple Fault exception). After that things start to go wrong somewhere in my code, but I need to add better protected mode debugging features before I can properly start working on this problem.
I had some trouble implementing proper triple fault handling, as my first information source, the DOSBox source code, was of no help. I tried to force DOSBox to report that the processor is a 80286 when Windows 3.0 checks for processor type, but when Windows 3.0 then causes a triple fault, DOSBox simply crashes with a stack overflow. Thus, I had to hunt the net for better description about what exactly should happen when the CPU gets reset in a 80286 machine. The best source I have found so far is the Protected Mode Basics document by Robert Collins. However, it seems that the actual behaviour that happens after a reset depends on the system BIOS, and this document only shows a method where the CMOS Shutdown Byte is set to a value of 0x05. Windows 3.0 however gives it a value of 0x09, which does not seem to work similarly. I think I have now been able to determine what the BIOS is supposed to do for Windows 3.0, but since something still goes wrong there I am not absolutely sure. I am still hunting for more information, and also debugging the Windows 3.0 code further.
Besides this problem, it has been quite easy and fun to work on the old DSx86 code, as I have been able to use No$GBA and iDeaS for testing, so that I haven't needed to copy anything to real hardware. With iDeaS I can even debug and trace through my code, which has been a great help when implementing the more difficult protected mode features. I even found a bug in a code I ported from DS2x86, so this will help me in improving DS2x86 protected mode features as well.
I will probably not be able to make the code run completely by the next weekend, so I plan to switch to working on DS2x86 for the next week. It has a lot of opcode and other work still remaining, thanks for all the debug logs you have been sending since the last version! Those should keep me busy for the next week.
The major improvements and fixes in this version are the following:
During the past week I also started working on the 286 protected mode features for the original DSx86, specifically when running in DSi mode. Since Windows 3.0 needs HIMEM.SYS to be installed when running in Standard Mode (meaning the 286 protected mode), I started by implementing the HIMEM.SYS features. The next (rather big) step is to change the memory access methods to support accessing memory beyond the first megabyte of RAM. This will sadly make the code slightly slower, as I can not keep all the needed variables in registers any more. This difference should not be anything major, though. I plan to release the next DSx86 version only after I have made this change, so no new DSx86 version today, sorry.
I had partly forgotten how easy and fast it is to work with devkitARM and libnds, after working for over half a year with the DSTwo SDK. You can build the software straight from the Programmer's Notepad, and after that you can test the build using No$GBA. The whole thing takes a few seconds. With DS2x86 it takes nowadays a bit over 8 minutes to FTP-transfer a new build to the DSTwo cart, which is the only place where the new build can be tested. So, I'm very much looking forward to moving my main development focus back to DSx86. I still need to work on improving DS2x86 also for quite a while, though.
Firstly, sorry that I had failed to mention in my previous blog posts that the Worms and Warcraft versions I used to test DS2x86 were shareware demo versions. It seems that the proper games do not yet work in the current DS2x86 version. Sorry about this. I managed to find a proper version of Warcraft, and am currently testing it, so there is a bigger chance that it at least will work in the next version.
Since last week I have been working on improving the opcode support. I started going through all the opcodes in order, and implementing the missing versions. There are practically two groups of opcodes, the normal opcodes 0x00-0x0E and 0x10-0xFF, and then the extended group of opcodes beginning with the 0x0F byte. I have now implemented opcodes 0x00-0x0E and 0x10-0x7F (that is, half of the normal opcodes) for all the 16bit/32bit and real/protected mode variations. I am currently working on the second part of the normal opcodes. I doubt I will have time to implement all of them before the next weekend, but some more games might again work in the next version.
I also increased the emulated EMS memory size to 4MB, mainly for The Elder Scrolls: Arena, for which there seems to be interest. After incresing the EMS memory and implementing a couple of new opcodes, it seems to at least start up. I have managed to create a character, save the game, load the game and walk around a little bit. Other than that I don't know how far it would get (as I don't actually know how to play it properly), but it is worth a try in the next version. Oh, and I believe this time I have the proper version of the game. :-)
I have also managed to get little bit further along in Jazz Jackrabbit, as I implemented true General Protection Fault handling. However, it still fails to start, and it seems like the current problem is caused by it not copying correct data into memory. I am debugging it trying to determine the cause for this problem. I am also debugging Warcraft, which currently hangs when giving an order to dig for gold. This seems to be a "soft" hang, the routine it never returns from looks like code that determines the best route to take to reach the digging area, so I plan to compare the behaviour of that algorithm in DS2x86 emulation with that of DOSBox, and the difference should point out the misbehaving opcode in DS2x86.
So, my primary goal for the next version is to get more opcodes implemented, and some bugs in the opcodes fixed. No major new features planned in the next version yet.
The changes in this version are:
The longer term plans are to add 80286 protected mode features, and taking advantage of the larger RAM in DSi mode, to be able to run Windows 3.0 in Standard mode on a DSi.
This version has a lot of new protected mode opcodes supported, based on the debug logs you have been sending. Thanks again for those! This version might now run a few more 386-specific games, for example I have been able to make Warcraft: Orcs & Humans start up into the actual game. Every now and then it fails with an unsupported I/O port, which seems to be caused by the game sometimes detecting the SoundBlaster as using DMA channel 3, while in reality it uses DMA channel 1. I suspect there are still some rather serious problems in my audio handling.
I have also made some minor performance improvements, the things mentioned in my previous blog post, and I also moved the temporary variables used by the Mode-X graphics mode opcodes into the small data segment which is accessed by the GP register. This makes the Mode-X graphics handling (as used in Doom, for example) slightly faster.
While making the performance improvements, I again ran into the weird keyboard reading problem I originally fought with at the beginning of this year. After various tests I was able to determine that when I used the new improved Mode-X code (which was slightly smaller), the keyboard behaved very erratically, but going back to the original slightly larger Mode-X code got rid of the problem! The weird thing is that this problem happens immediately in the 4DOS prompt, when none of the changed Mode-X routines have even been run yet!
So, in the end I had to add 1000 bytes of filler at the end of the Mode-X graphics code to make the keyboard reading work properly! This obviously makes absolutely no sense, and I will remove the extra filler bytes as soon as I can figure out what the real problem is. It seems like some sort of alignment problem in the DS2 SDK code that handles the communication between the ARM side and the MIPS side, but that is just a theory and without knowing the internals of the communications code I have no way of properly testing this theory.
I haven't been able to handle all the issues mentioned in the debug logs yet, but many games should at least progress further. Please send me the new debug logs again for this version, and I'll again try to implement as many fixes as possible to the next version.
During the past week DS2x86 has progressed well, and I have also started (or gone back to) working on improving the original DSx86. By the way, the version 0.24 of DSx86 was in fact not built with the absolute latest version of libnds, as a new version of libnds was released on the 5th, while I downloaded it on the 1st of February. I did not notice the new update before releasing it on the 6th. Anyways, my focus has still been on DS2x86, but I plan to slowly get up to speed with improving DSx86 as well.
As I mentioned in the previous blog post, I wanted to add a similar profiling system into DS2x86 to what I have been occasionally using with the original DSx86, to find the performance bottlenecks and to get a feel for the overall performance of my emulator. I first coded the main profiler system (calculating the number of times each opcode in the main opcode table is called, and saving this data into a file on the SD card after the most often called opcode has been executed around a million times). This time I coded the main profiler code in ASM, while in my DSx86 version it was coded in C (which slowed down the emulation quite a lot). Now DS2x86 runs with the profiler active still a little bit faster than DSx86 without the profiler.
The next step was to add a timer to count the number of CPU cycles it takes to execute each opcode, and this is where I run into some difficulties. I had to test various methods before I found something that worked. Here are the things I tried, in order:
Here is the first profiling result, while running the Doom demo. The first table shows the opcodes with the lowest minimum tick counts (ordered by that value), and the second table shows the opcodes taking the most total number of ticks (again ordered by that value):
opcode | byte | count | min ticks | avg ticks | total ticks | % of total | command |
---|---|---|---|---|---|---|---|
NOP | 90 | 47999 | 12 | 16.46 | 790106 | 0.1921% | No operation |
JNZ | 75 | 561135 | 13 | 16.65 | 9340201 | 2.2708% | Jump if not equal |
CLC | F8 | 5 | 14 | 14.00 | 70 | 0.0000% | Clear Carry flag |
DEC EDX | 4A | 30422 | 14 | 16.45 | 500370 | 0.1216% | Decrement EDX register |
opcode | byte | count | min ticks | avg ticks | total ticks | % of total | command |
??? r/m32,+imm8 | 83 | 607110 | 20 | 31.33 | 19021718 | 5.3878% | Operations with signed immediate byte |
??? r/m32,imm32 | 81 | 748927 | 21 | 28.09 | 21040513 | 5.9596% | Operations with immediate doubleword |
Size prefix | 66 | 406693 | 22 | 66.13 | 26894836 | 7.6179% | Operand-size prefix |
Opcode prefix | 0F | 616388 | 20 | 44.40 | 27370335 | 7.7525% | Various 386-opcodes |
MOV r32,r/m32 | 8B | 797681 | 17 | 37.42 | 29849028 | 8.4546% | Move to 32-bit register |
MOV r8,r/m8 | 8A | 1048576 | 19 | 34.11 | 35768283 | 10.1312% | Move to 8-bit register |
Not surprisingly, the fastest opcode is NOP, which is just a jump back to the opcode loop. Curiously though, even when the minimum ticks it takes is 12 (including the profiling overhead, which I estimate to be only 1 tick), on the average it takes over 4 ticks more! This might be due to some cache misses, but I'm a bit surprised that the cache misses happen so frequently that the effect is that big! But in any case, the 12 ticks is the baseline and that in principle shows how many ticks the main opcode loop takes.
Obviously, immediately after I had implemented the profiler, I wanted to use it to improve the speed of my emulation. The biggest improvement to the overall speed would be if I could improve the main opcode loop, as that would speed up everything. I had attempted to use self-modifying code back in October of last year, but could not get it to work reliably back then. I had attempted it again during my Xmas vacation, with the same results. However, I now thought that I finally understood what I did wrong during my previous attempts, and thus decided to try one more time.
The original opcode loop in DS2x86 looked like this (with some macros expanded for clarity):
loop: lbu t0, 0(cseip) // Load the opcode byte from CS:EIP addu cseip, 1 // Increment the instruction pointer lw t1, SP_OP(sp) // Load address of the current opcode table from stack sll t0, 2 // t0 = 4*opcode addu t1, t0 lw t1, 0(t1) // t1 = opcode_table[opcode] move eff_seg, eff_ds // Set DS to be the effective segment ori flags, FLAG_SEG_OVERRIDE // Fix the CPU flags, telling we have no segment prefix jr t1 // Jump to the opcode handlerAfter compilation the code stays pretty much the same, with some defines replaced and the assembler reordering the jump to fill the branch delay slot:
800b6020 <loop>: 800b6020: 93c80000 lbu t0,0(s8) 800b6024: 27de0001 addiu s8,s8,1 800b6028: 8fa9000c lw t1,12(sp) 800b602c: 00084080 sll t0,t0,0x2 800b6030: 01284821 addu t1,t1,t0 800b6034: 8d290000 lw t1,0(t1) 800b6038: 01e0f821 move ra,t7 800b603c: 01200008 jr t1 800b6040: 37390002 ori t9,t9,0x2I wanted to get rid of the opcode table address load from stack (the line in bold text in the code snippets). This address changes very infrequently, it only changes when an IRQ needs to be handled (a few hundred times per second), or when the processor switches between modes (real mode / protected mode / USE16 code segment / USE32 code segment). Thus it feels very wasteful loading the address from memory for every single opcode. In theory I could keep the address in a register, however it would then be very difficult to change the address from an interrupt handler, as that would require some obscure stack frame handling to make the interrupt return pop a different value to the register. Very ugly and error-prone.
A simpler solution would be to have the address as an immediate value of the opcode, as that is in memory and can be changed from the interrupt handler, and it does not need an extra memory load in the opcode loop. The problem I had been running into was that the processor did not always see that I had changed the code it was running. At the time I did not realize that I had not used the correct cache commands to force the data cache to write it's value into memory, and to invalidate this address from the instruction cache.
I coded the new opcode loop handling and changed all the code that previously wrote the opcode table address to the stack to write the value directly to the immediate values of the opcodes, using the correct cache commands this time, and it worked! No hangs or other weird behaviour, everything seemed to work fine!
I'll show the new loop code in a moment, but as I also made another improvement before I profiled it, I'll show the cache handling first, then talk about the other optimization, and only after that show the new code and the resulting profiler info. The correct cache commands to force the flushing of the data cache and invalidating the instruction cache are the following (when the AT register contains the memory address of the opcode to invalidate):
cache 0b10101, 0(AT) // Primary Data Cache - Hit Writeback Invalidate - Address sync cache 0b10000, 0(AT) // Instruction Cache - Hit Invalidate - Address
The MIPS architecture defines one register, the Global Pointer (GP) for use by the toolchain to speed up memory access of often-used variables. Since the MIPS architecture has only 16-bit immediate values, all 32-bit values that need to be put into registers have to be built by two 16-bit parts. Similarly with memory addresses, for example the assembler expands this:
lw t0, VGA_latchinto this:
lui t0, %hi(VGA_latch) lw t0, %lo(VGA_latch)(t0)That is, first the high 16 bits of the memory address are loaded into the t0 register, then it is used as a base register, with the low 16 bits of the variable address as an immediate 16-bit offset, to load the actual value. Thus, all simple-looking variable accesses actually take two CPU instructions to execute. To speed up the memory accesses, the toolchains use the GP register to point into the middle of a 64KB-sized memory area (a small data region with a segment name .sdata). Since the immediate 16-bit offset is a signed value, the GP register needs to point into the middle of this area to be able to access the full 64 kilobytes.
I had purposefully left the GP register unused in my ASM code, as I did not know how best to take advantage of the 64KB area, and as I did not know whether the C modules already use this area. However, now that I was able to use self-modifying code, I thought that if I had the opcode tables in this small data area, I could simply change the low 16 bits of the address (in other words the immediate offset) to have the main opcode loop point to different opcode tables. Each opcode table has 256 entries, and I have 8 opcode tables plus the IRQ opcode table, totalling 4*256*9 = 9216 bytes. Well within the 64KB limit, and I could still fit some other frequently used variables in there, provided the C code does not take all of the space.
I then looked into how the SDK uses the GP register and the small data area, and somewhat to my surprise, the start.S does setup the GP register to point to the _gp memory address, which the link.xn linker script has created between the data and bss segments, but it looked like no code actually uses it for anything! The symbol dump file showed that the bss area began immediately after the _gp variable. So, the whole 64KB was free for my own use! Actually, it looked like the linker script creates the _gp variable at the beginning of the small data segment, so it can actually address only 32KB of memory. But even that will be plenty for my needs.
So with the self-modifying opcode table offset, and the GP register containing the start of the small data area, I was able to change my main opcode loop to look like this:
loop: lbu t0, 0(cseip) // Load the opcode byte from CS:EIP addu cseip, 1 // Increment the instruction pointer sll t0, 2 // t0 = 4*opcode addu t1, gp, t0 // t1 = small data segment address + opcode*4 sm_op: lw t1, 0(t1) // Imm16 offset self-modified to contain the opcode table offset within the small data segment move eff_seg, eff_ds // Set DS to be the effective segment ori flags, FLAG_SEG_OVERRIDE // Fix the CPU flags, telling we have no segment prefix jr t1 // Jump to the opcode handlerOr shown from the dump file:
80101e20 <loop>: 80101e20: 93c80000 lbu t0,0(s8) 80101e24: 27de0001 addiu s8,s8,1 80101e28: 00084080 sll t0,t0,0x2 80101e2c: 03884821 addu t1,gp,t0 80101e30 <sm_op>: 80101e30: 8d290000 lw t1,0(t1) 80101e34: 01e0f821 move ra,t7 80101e38: 01200008 jr t1 80101e3c: 37390002 ori t9,t9,0x2This is the full macro that changes the opcode table offset to point to the opcode table, whose address is in the register given as a parameter to the macro:
.macro set_current_opcode_table reg .set noat la AT, _gp // Get the address of the _gp variable subu \reg, AT // Subtract the _gp address from the table address, to get the imm16 offset la AT, sm_op // Get the address of the opcode we are to modify sh \reg, 0(AT) // Store the 16-bit (halfword) offset value into the opcode cache 0b10101, 0(AT) // Primary Data Cache - Hit Writeback Invalidate - Address sync cache 0b10000, 0(AT) // Instruction Cache - Hit Invalidate - Address .set at .endm
Okay, so now I was ready to run my profiler again, to see what kind of an effect this change had. I expected to see improved performance for every opcode. Here are the profiling results after the change:
opcode | byte | count | min ticks | avg ticks | total ticks | % of total | Improvement | command |
---|---|---|---|---|---|---|---|---|
NOP | 90 | 45790 | 11 | 13.78 | 630763 | 0.1853% | 16% | No operation |
JNZ | 75 | 560268 | 13 | 15.84 | 8876231 | 2.6082% | 5% | Jump if not equal |
CLC | F8 | 3 | 13 | 13.00 | 39 | 0.0000% | 7% | Clear Carry flag |
DEC EDX | 4A | 13658 | 13 | 16.50 | 225334 | 0.0662% | 0% | Decrement EDX register |
opcode | byte | count | min ticks | avg ticks | total ticks | % of total | Improvement | command |
??? r/m32,+imm8 | 83 | 584328 | 19 | 30.07 | 17569712 | 5.1627% | 4% | Operations with signed immediate byte |
??? r/m32,imm32 | 81 | 772904 | 20 | 26.64 | 20586425 | 6.0492% | 5% | Operations with immediate doubleword |
Size prefix | 66 | 402726 | 21 | 65.60 | 26417998 | 7.7627% | 1% | Operand-size prefix |
Opcode prefix | 0F | 644778 | 20 | 42.16 | 27182175 | 7.9873% | 5% | Various 386-opcodes |
MOV r32,r/m32 | 8B | 787838 | 17 | 36.61 | 28843074 | 8.4753% | 2% | Move to 32-bit register |
MOV r8,r/m8 | 8A | 1048576 | 17 | 33.90 | 35550596 | 10.4463% | 1% | Move to 8-bit register |
All in all, the performance improvement was rather minor. I had hoped this change would have caused more of an improvement. In any case, now I am happy with the main opcode loop, it is now as fast as I can make it, so I need to look elsewhere for extra performance. The next performance improvent task I plan to do is to move all the EGA/VGA variables into the small data segment and access them using the GP register. I also plan to look into those most time-consuming operations more closely, to determine if there are some optimization possibilities there. The opcode prefix 0x66 at least could also be made to use the GP register, for example.
After working on the profiler and performance improvements for a few days, it was time to get back to adding the missing opcodes. Based on the log files you have been sending (thanks!), I selected a couple of new DOS4GW games to download and test myself. I first started with Worms by Team 17, and after several iterations of adding various missing opcodes, it started working! I actually have no idea how to play it, so I'm not sure if it works properly, but at least the beginning of the game looks to behave very similarly to how it behaves in DOSBox, so I'll leave the rest of the testing to you (when I release the next version).
The game I am currently working on is UFO: Enemy Unknown. It has some strange problem that it hangs after the intro when running the start.bat, but when running the go.com it progresses up to the start of the game, where it encounters an unsupported opcode. Around this opcode are a lot of floating point operations (which I do not plan to support), so I am not yet sure whether the actual game will work. We shall see.
All in all the opcode changes I have needed to do for DS2x86 have been quite straightforward, I haven't had to make any changes to the more difficult protected mode features. Looks like my emulation already handles the features the DOS4GW extender needs, so many games should run in DS2x86 after I get the plain opcode support more complete. That is very encouraging, it looks like DS2x86 might become quite useful in the near future.
I started working on the better scaling methods for the original DSx86. I began with the MCGA 320x200 256-color mode, as that is the easiest, which was a good thing as I had been working so long with the MIPS assembler that going back to the ARM assembly language was somewhat difficult. I couldn't immediately remember what opcode it was to store a byte into memory, how to branch after a subtraction if the result is zero, etc. These operations are so different in MIPS that it took a while to get back up to speed with the ARM assembly.
I kept coding the scaling routine during the 7-minute FTP transfers of the DS2x86 to the SD card via WiFi. Actually during one such transfer I was able to build, test, fix and build DSx86 again four times! Testing DSx86 is so much faster than testing DS2x86 that it felt pretty good getting back to working on it!
In any case, I managed to code the smooth scaling for the MCGA mode (which will replace the current Jitter mode, which I think has never been all that useful). However, at least in No$GBA DSx86 hangs immediately when using the smooth scaling mode and attempting to set the screen refresh rate to 60fps. This suggests that the smooth scaling routine takes more than 1/60th of a second to run, which also means that at 30fps it takes more than half of the available CPU cycles! The code is currently pretty much a direct port from the MIPS code in DS2x86, so I might be able to improve it's performance a bit. However, looks like the smooth scaling method will not be very useful unless you have a DSi and you are able to run DSx86 in DSi mode. In the smooth scaling version I still use hardware scaling to scale vertically from 200 to 192 rows, I'm not sure if I will keep it this way or if I will use the same system as in DS2x86, where you still need to scroll vertically even when using the scaled screen mode.
The following No$GBA screen copies are from my BIOS graphics routines test program, which I have used to test all the graphics modes of DSx86. They give some sort of an idea about the difference between the hardware scaling and the new smooth scaling.
Well, that's it for this blog post, which actually became quite long and full of various things I thought might be worth mentioning. Hope you didn't get bored reading it! Next weekend I plan to release DS2x86 0.04 and DSx86 0.25, if all goes well.
It has been a long time since I last released a version of the original DSx86. The version I released today, 0.24, is built with the latest libnds, so that it can run in DSi mode if you have a Nintendo DSi and a suitable flash cart that enables DSi mode. I am only aware of one such flash cart, CycloDS iEvolution. Running in DSi mode means that the CPU runs at 133MHz instead of the normal 66MHz, so the emulation runs at double speed (20MHz 286 instead of 10MHz 286). If you don't have such a flash cart or you run DSx86 on a DS Lite (or original DS Phat), this new version does not bring any enhancements, sorry. I plan to add the smoother screen scaling features, and other improvements on my TODO list, in the future, though.
The DS2x86 version 0.03 has a lot of work done in the protected mode features, so that it currently runs Doom. I was able to fix the problem I had last weekend with the textures (the cause was a bug in my 64-bit division algorithm), and I also added some preliminary audio support. The problem with the audio in Doom is that it request an interrupt after every 128 samples, while the shortest interrupt interval my current SB emulation allows is 3*128 samples (but adjusted by the playing frequency). Thus, to make the audio in Doom work, I had to adjust the playing frequency to be only 22050/3 Hz, which makes the interrupts happen at about every 128 input samples. I will improve my audio support in the future, but I did not have time to code a better emulation method by today. There are some other minor improvements and bug fixes as well, but no major new features. The high-resolution screen modes and AdLib audio are still missing, for example. It is possible (though not very likely) that this version runs also other DOS4GW games, so feel free to test it!
The next things I plan to do are to look into enhancing the original DSx86 with some proper DSi mode support, and I also want to add profiler features to DS2x86 so that I can start improving it's performance. I believe it should run Doom better than what it currently does, so I want to see what are the most time-consuming operations and try to improve the speed of those operations. I also want to continue work on the Borland DOS Extender (using the Jazz Jackrabbit game) and implement the higher-resolution screen modes.
Again, please send me the debug logs, as those will help me in developing DS2x86 (and DSx86) further!
This is a bit of an unscheduled blog post, but as my friend just lent me his Nintendo DSi, I decided to immediately test how DSx86 runs in the CycloDS iEvolution flash cart in DSi mode. The current (old) DSx86 version 0.23 does start fine, but the touchscreen does not work so it is pretty much useless. I believe CycloDS is working on a compatibility layer that might make it work, but my understanding is that making DSx86 run in DSi mode might simply need a recompilation with the latest libnds. Thus, I downloaded the latest libnds version and recompiled DSx86 with it. No errors when building the software, and indeed the brand new DSx86 version 0.24 does run fine in DSi mode!
The speed is about twice that of the "DSL mode" (as the CycloDS firmware calls the normal working mode). I'm not sure if this speed is yet enough to warrant adding 386-opcodes, but at least the smoother screen scaling features should work fine in DSi mode in the original (meaning non-DSTwo-specific) DSx86. I'll probably release the newly built DSx86 version 0.24 next weekend, so you can test that the latest libnds version did not break anything that used to work in 0.23 version. If/when you get the CycloDS iEvolution flash cart (or in case the DSi mode gets enabled in some other flash carts) you can then run DSx86 at double speed. The added performance will certainly help in some games that have been running too slowly in the current 0.23 version.
I doubt I will have time to add any enhancements (like the smooth scaling methods) by the next weekend yet, but I'll see if I can work on both DSx86 and DS2x86 side by side from now on, enhancing both of them simultaneously with new features.
Okay, I'm back from my trip but somewhat tired so I don't think I will get much programming done today. However, just before I went on my trip I got Doom to actually run in DS2x86! One milestone reached! It does not play any sounds yet, which makes it not all that immersive or even playable yet, but it does run and it is possible to evaluate the performance of my protected mode 32-bit emulation with it. Doom runs only at a marginally playable framerate using the default settings, which is not all that unexpected considering that the emulation speed is only about 25MHz 486. I remember when I had a 486/33 machine and played Doom against a friend who had a 486/66 machine, and I usually lost simply because I had a slower machine. Things improved when I also got a 486/66 machine. However, some settings in both DS2x86 and Doom can be adjusted to make it run better, and the best settings I have found so far seem to be the following:
There are still problems with the texture mapping of the sprites, at times the texture is not mapped correctly but has a weird vertical wrapping problem. Also, I want to look into adding some audio support (if not very difficult) for Doom, so I won't release the DS2x86 version 0.03 until the next weekend. Sorry for the wait, but at least you now have something specific to look forward to in the new version. :-)
I also hope to implement some fixes to the problems in the DS2x86 debug logs you have been sending, thanks again for those! I have been skipping them when trying to make Doom run, so I think it is time I look into those as well.
I also received my pre-release developer copy of the CycloDS iEvolution flash cart last week. I haven't yet had time to do anything with it, and since I don't even have a DSi (only a DS Lite) myself, I need to wait for a friend of mine to borrow his DSi to me while I look into taking advantage of the DSi mode with the original DSx86. If I understand correctly what the people in the thread at http://www.teamcyclops.com/forum/showthread.php?t=10826&page=3 talk about, there are still some problems with the ARM9/ARM7 FIFO handling when trying to take advantage of the DSi mode in homebrew software. I trust these issue will be fixed in the near future, but I think I will still work on the DSTwo version until libnds and iEvolution work fine together in DSi mode. It looks like I don't necessarily need to do all that much work in DSx86 to have it running at twice the current speed on a DSi, but I will know more after I have studied and understood this issue better.
It has been two weeks since I released the 0.02 version, and I had planned to release the 0.03 version today. However, I have only been working on the protected mode opcodes and features for Doom, which does not yet run, so I decided against releasing a version that has pretty much no noticeable improvements. This weekend would have been a good release weekend, as I need to take a trip next weekend and won't be able to release a new version then either. So, you need to wait for at least two more weeks for the next release. Sorry about that, but things don't always work according to plan.
I have managed to make Doom progress a lot further than what it did last weekend, though. It performs the DOS4GW protected mode stuff fine, and begins running the Doom-specific initialization stuff. I am currently at the machine state initialization (which detects mouse, joystick and other hardware of the system). It uses some new protected mode opcodes that are a little bit more difficult (and error-prone) to add, so I need not to rush when adding them. This is what the Doom startup screen now looks like in DS2x86 (my BIOS output routine seems not to handle the TAB character properly, instead of moving the cursor it draws the font of ASCII character code 9):
I have also gone through the debug logs that you have sent, and have added the problems in them into my TODO list. I haven't actually implemented any of the needed fixes yet, though, as I have been concentrating on getting DOOM to run. There are many unsupported opcodes in the logs which I have added for Doom, though, so many of the games you have been testing will progress further in the version that is able to run Doom. Thanks for taking the time to test DS2x86 and send me the error logs!
I hope I can get Doom and some other protected mode games running during the next two weeks. I also need to add proper unit tests for the new protected mode opcodes in the near future, as I have spent a couple of days hunting for bugs that a unit test would have found immediately. Creating the unit tests will take several weeks, though, and I'd rather get a new version released first and then start working on the unit tests.
Thanks again for your interest in DS2x86, and sorry for the no-release blog post!
For the past week I have been adding the protected mode opcodes for the games I have been using for my tests. The current status of the test games is as follows:
The problem in Doom that I mentioned in the previous blog post, where it jumped to a row of INT 3 opcodes, turned out to be a simple issue. I had forgotten to remove an instruction pointer masking (to a 16-bit value) from one of the jump opcodes I copied from the 16-bit protected mode code, so when Doom attempted to jump to offset 0x00101234 (near the beginning of the extended memory) it jumped to 0x00001234, which happened to have some data containing 0xCC (INT 3) bytes. However, after fixing this issue and adding a few new opcodes, I fought for two days with a problem where Doom suddenly attempted to return from a subroutine to a segment that was not marked executable. Finally after a lot of debugging I found the problem in my LEA opcode handling. I had used the output register as a temporary register in the opcode handler, which was not a smart thing to do when the input and output registers could be the same register! For example, with opcode LEA EBP, [EBP+12], both the input and output register is EBP. I first loaded the immediate byte 0x12 into the output register (EBP), then added the input register (EBP) value to that and finally put the result to the output register (EBP)! After fixing this problem I have been able to add opcode after opcode without any new problems.
Pretty much all of yesterday went to other things besides working on DS2x86. I rearranged the furniture in my appartment, which also meant rerouting all the cabling for my home theater system. That took quite a long time, so I had practically no time left to do any programming. Finally today I have been adding the opcodes that Doom needs, but it keeps encountering new opcodes that I haven't yet implemented for 32-bit protected mode.
That's it for this short update, let's see if I can get Doom actually running by the next weekend!
This version does not have any major new features, mainly just some minor fixes and enhancements compared to 0.01. Here are the most notable changes:
I have continued adding the missing opcodes and other features for DOS4GW and other DOS protected mode extenders during the last week. After adding a few opcodes for each of the three games I tested, Zone 66, Jazz Jackrabbit and Doom, I had to drop Zone 66 from the list. That was because Zone 66 seems to go to Virtual 8086 mode to handle it's DOS needs, and I did not want to tackle that mode yet, until I have progressed much further with the plain protected mode handling. So, I continued adding features for Jazz Jackrabbit and Doom.
Curiously, both Jazz Jackrabbit (using Borland's RPM extender) and Doom (using DOS4GW) have stayed in the 16-bit 286-compatible protected mode, and have not gone into 32-bit protected mode at all. Zone 66 went into 32-bit protected mode immediately, so I had assumed that the extenders would use the 32-bit mode if it was available. However, it looks like they have been made to be compatible with 80286 processors, and thus use only the 16-bit protected mode. When I originally got my Trekmo demo ruuning, I only needed to add 32-bit protected mode features, so I have had to add quite a lot of new things to support 16-bit protected mode.
The current status with the three test games is the following:
I'll continue working on the protected mode features for now, so I hope you can wait a bit further for other enhancements (like high-resolution VGA screen and audio support). Those are coming, but I am currently more interested in getting some games that can not be run at all in the original DSx86 running in DS2x86.
Happy New Year! Thanks for all the feedback and bug reports you have been sending from the DS2x86 alpha 0.01 version. Those will help me focus my development efforts.
Sadly, the past week was mostly spent fighting with the DS2 SDK. Just before I released the alpha version, I noticed a problem with the key reading. I could not figure out what caused the problem, but I noticed that setting the screen refresh rate to 15fps made the problem much less severe, so I did that as a first aid fix in order to get the alpha version released. On Monday this week I then began working on a small test program that would display similar problematic key reading behaviour, and after a few hours of work I managed to get exactly the same symptoms.
When quickly pressing keys (for example D-Pad up/down keys), every now and then (every 15 seconds or so) there was a period of almost half a second when no key events (presses or releases) got recognized. I first noticed this when testing Wolfenstein 3D, in which a problem like that is very annoying. My small test program exhibited the same behaviour, and also stopped updating the lower screen during some testing runs, at about the same time when the first key reading problem for that test run appeared. I sent my test program as an email attachment to the SuperCard SDK contact person "king d", but haven't yet received a reply.
I assumed the most probable cause for the problem was my using the timer interrupt to update the screens, but curiously, after I changed the test program to only update a flag in the timer interrupt and perform the actual screen update in the main loop (when the flag is on), the problem continued to appear. At that point I had no theories what could cause the problem, so I began digging into the SDK internals in an attempt to get a better understanding of how it works and what I should do to get rid of this problem.
I spent all of Tuesday deciphering the dump file and testing and debugging various things, and learned quite a lot of interesting information about the SDK internals. I first started by hooking into the main interrupt handler (the source code for which is provided by the included specs/start.S file in the SDK). I found out that the actual interrupts that the SDK uses have something to do with the GPIO2 I/O system of the processor, and are numbered 155 and 156, and the handlers for these interrupts are called cmd_line_interrupt and data_line_interrupt. I found out that the cmd_line_interrupt is called on the average 128 times per second in my test program, while the data_line_interrupt is called over 3000 times per second.
I looked at the dump file for the cmd_line_interrupt, and noticed that it reads 4 halfwords (8 bytes) from address 0xB4000000 (which is not documented in the JZ4740.h header file, so I believe it is something specific to the DSTwo), and stores these into memory area called cmd_buf32. Then it jumps to different locations in the code based on the first byte of this command buffer. Looking at what happens in the different command handlers (and testing the contents of the command buffer using my test program) I was able to determine that command 0xC3 is a key event command, command 0xC1 seems to clear the "buffer busy" flag (meaning it is some sort of an acknowledgement for a received screen or audio buffer), and command 0xC5 gets sent after every 30 seconds or so if nothing else is happening (so it might be some sort of a keep-alive or idle command).
So, finding the command 0xC3 gave me the idea of hooking directly into this interrupt (instead of using my timer interrupt) to handle key presses and releases. This should mean no missed key events, as immediately when I get the interrupt from the SDK interface I can call the ds2_getrawInput() function of the SDK, put the key event into a buffer and launch an emulated IRQ9 (which is the x86 keyboard interrupt). I did this, and the key input began to work properly in the test program, but the weird occasional hangs still remained.
On Wednesday I then continued looking into the hanging problem, and after various failed tests I suddenly noticed that I had used the wrong offset into the pmain_buf variable when checking whether the lower buffer is free! Argh! Well, I fixed this to use the correct offset, and after that my test program began to work correcly, even when updating the screen from inside the timer interrupt.
I then made the same changes to DS2x86 itself (changed the key input to use the cmd_line_interrupt hook and using the correct offset in the pmain_buf to check whether the lower screen buffer is free). Quite frustratingly, though, this did not fix the problems in DS2x86 itself. I am currently pretty much at loss as to what exactly is wrong in my method of using the SDK, as no changes I do seem to fix the problems completely. By Thursday morning I got fed up with this problem, and reverted to the same code I used in the Alpha 0.01 version (but with the pmain_buf access fix), so that it still keeps missing the key presses every now and then, but the lower screen seems to update properly. I hope I will eventually get an idea about how to fix the key input.
There is already a simple exception handling code in the 0.01 Alpha version, but I have now improved it a bit further. This new exception handling has helped me a lot while coding, as I occasionally write some severe bug in the code that crashes the system completely. Earlier when I did not have the exception handler installed the DSTwo would just hang, and then I had a lot of trouble trying to guess which of my changes caused this and why. Now I get an exact address in the code, and I also this week added some more information, like a text message describing the reason (in addition to printing "Exception 5", I print "Address error on stote at address 0x12345678!" or something like that. That will show me also the faulting data address together with the code address. I can then check the dump file to see exactly where in the code the problem is. I found a good list of the possible exception codes at some Harvard University course notes.
This exception message printing only works because I use the timer interrupt to handle the screen updating. Even when the actual emulator code has crashed and will not progress further, the timer interrupts still run and can send data to the NDS side from the MIPS side. I added some code to the specs/start.S code provided with the SDK to store the exception address, cause and failing address to global variables, and in my timer interrupt handler I can then check whether these variables are set, and if so, show the "Blue Screen of Death" on the lower screen. The code that I added to the start.S exception_handler routine looks like this:
mfc0 k0, C0_CAUSE ori k1, zero, (0x08<<2) //Only detect SYSTEM CALL exception andi k0, k0, (0x1F<<2) beq k1, k0, 1f //is SYSTEM CALL exception move a0,sp // ----- DS2x86 addition ----- la AT, ds2x86_exception_address // Get address of the exception address store sw k0, 4(AT) // Save exception cause lw k1, (4*30)(sp) // Get the exception address sw k1, 0(AT) // Save exception address mfc0 k1, C0_BADVADDR // Failing address (on certain exceptions) sw k1, 8(AT) // Save failing address lw AT, (4*27)(sp) // Restore AT // ----- DS2x86 addition -----and in the same start.S source code I added the global variables that can then be accessed from my timer interrupt handler:
// ----- DS2x86 addition ----- .global ds2x86_exception_address ds2x86_exception_address: .word 0 .global ds2x86_exception_cause ds2x86_exception_cause: .word 0 .global ds2x86_exception_vaddr ds2x86_exception_vaddr: .word 0 // ----- DS2x86 addition -----
On Thursday, after I got fed up with the SDK problems, I then went through some of the log files you have been sending (thanks again for those!), and downloaded a couple of games to use as a test bench when improving DS2x86. I decided to try and get the protected mode features working a bit further, so I selected three games, Zone 66 which uses a newer version of the same PMODE header that my Trekmo demo used, Jazz Jackrabbit, which uses some Borland DPMI extender, and of course Doom, which uses the DOS4GW extender used by many other DOS 386-specific games as well. To save time (as the DS2x86.plg has grown so big that it now takes about 7 minutes to FTP-transfer to my SD card), I test each of those three games, and then add all three opcodes (or other required features) before testing the three games again. Currently all three are in protected mode, and I just added support for changing the interrupt vector start address (for Zone 66), and am about to add protected mode LES opcode handling (for Jazz Jackrabbit) and the protected mode LAR opcode handling (for Doom).
Before I started work on these games, though, I increased the emulated PC RAM size, which also meant increasing the page map table. Now I emulate 16MB of RAM (1MB conventional and 15MB extended), so the 4DOS memory command shows the following. I had to fix the CPU flags handling before the 4DOS "memory" command started working at all, as my flags worked like Pentium flags so that 4DOS thought it was running on a Pentium and tried to use the cpuid opcode.
Curiously, after I had increased the PC RAM size, the keyboard reading problem got much worse. Almost every second keypress was not recognized even on the DOS prompt, and playing Wolfenstein 3D was pretty much impossible, as you had to keep clicking on the keys for several times before any key events happened. This was quite weird, and it began to look like the key reading problem has something to do with the size of the plg file, and especially with the .bss section size of the file. I looked at the symbol dump, and noticed that the main irq_table was AFTER my emulated PC RAM area (which now was 16MB, half of the total memory size of the DSTwo). So, I then spent some time looking into ways to make the linker put my emulated RAM area last in the unitialized data section. I finally found a GNU LD manual that showed me a way to change the specs/link.xn file so that my RAM area was put last. Immediately after I did this, the key reading began to work like it did before my RAM increase, so that the DOS prompt seems to work fine but Wolfenstein 3D still experiences some problems. Interesting that the location of variables in the memory causes such problems! Perhaps this the reason why my test program works fine, as it is so much smaller. But, in any case, now I have had to make changes to both of the files in the specs directory, which will probably get overwritten when installing a new SDK version, so I need to make sure I have my own copies of these files in a safe place.