DSx86 - Blog

Dec 25th, 2011 - DS2x86 progress

Merry Christmas! Since this Sunday happened to occur on the Christmas day, I decided to write most of the post beforehand, and then just add a few words and post it on the actual Christmas day. I don't want to spend all of my Christmas working on DS2x86, I'll need to have some free time to enjoy the holidays as well. :-)

Doom audio work

During my first Xmas vacation week I have continued working on the SB digitized audio handling for the new transfer system. On Monday, as I was testing the clicking problem, it occurred to me that it would help me in pinpointing the location of the clicks if I could see the actual waveform produced. Thus I began recording the audio from the Nintendo DS headphone connector to my PC. This was a big help, for example when I added four samples of value 0x90 (where the SB middle point is at 0x80) into a certain position of every 128-byte sample block, I could see that the clicks always happened at offset 32 into this 128-byte block. Due to the fact that Doom sound initialization plays first a small sample before starting the auto-init looping mode, offset 32 meant that the clicks were always at the beginning of the transfer buffer. The actual clicks looked like the following (with the 4 samples of 0x90 value being at the beginning of each 128-sample block):

Finally by Monday evening I found the cause for the clicks in the audio. These were caused by the card FIFO containing an extra word in it before the actual audio sample data, and as the receiving routine on the ARM side assumed the whole transfer buffer consisted of audio samples, these 4 invalid samples caused the clicks. The actual reason why there were an extra word in the FIFO is still a bit of a mystery to me. After spending most of Tuesday experimenting with this, it seemed that the extra word was the result of a previous command to the card when the MIPS side was not able to respond quite as fast as the ARM side expected. The problem got worse when I reduced the MIPS CPU speed, which also pointed to a timing problem. After some more experimenting I finally managed to adjust my new transfer system so that it now seems to work with all possible CPU speeds and with the card FIFO always containing only correct data.

After I got rid of the clicks, I decided to record the whole Doom demo, so that I could compare the audio quality of the original (current) DS2x86 transfer system with the new system. I noticed that in the old system (with the audio playing at a constant 22kHz rate) the audio had repeating 512-sample dropouts, which I had not been able to fix using the original DSTwo audio transfer buffering system. Here below are images of the first AdLib note in Doom, on top is the waveform from the old transfer system (with one 512-sample dropout selected) and below it the waveform from the new system (where AdLib audio is generated on ARM7 and played at 32kHz, so there are no dropouts and the quality is in general much better).

After getting Doom to work, I continued with getting the SB digitized audio of my own LineWars II working. It does not use auto-init audio, instead it does the buffer swapping in software. It re-starts the audio playing from the other buffer inside the SB IRQ (which happens when playing the current buffer finishes). It then prepares the next buffer inside the same IRQ handler, and then returns to the actual game play. This system is currently almost working, but there are gaps in the audio, which I believe occur when the SB IRQ on the MIPS side happens just after the previous 1/30th second block has been transferred to the ARM side. I think I need to delay the audio playing so that I don't keep running out of the currently playing buffer on the ARM7 side before the next buffer transfer has started from the MIPS side to the ARM9 side.

New transfer system performance

As I was recording the Doom demos from both the old system and the new transfer system, I noticed that the recording from the old system was considerably longer than the recording from the new system. Seems like this is one way to calculate the real-world speedup that I was able to achieve with the new transfer system. The new system is not fully optimized yet, but it does transfer all graphics and audio data, so the results should be quite comparable. The time it took from the start of the first AdLib note to the end of the final sound was 3:34.90 (or 214.90 seconds) using the old transfer system, and 3:06.94 (186.94 seconds) using the new transfer system. That is, with the new system Doom uses only 87% of the time to run the whole demo, so the new system does give a nice performance boost. Both of these timings were taken with the 360MHz CPU speed and 30fps screen refresh rate with scaling mode Zoom.

Plans for the next week

During the remainder of my Xmas vacation I plan to first work on the more essential missing features of the new transfer system, like the whole EGA graphics mode support, and then hopefully still improve the SB digitized audio somewhat. I believe the SB audio support is not quite essential to be included in the next version, but the EGA support pretty much is. I also just added PC speaker audio support to DS2x86, as that is a new feature that has until now been completely missing. Using the old transfer system I would have needed to create a waveform and mix it into the 16-bit stereo buffer (which already had AdLib and SB digitized audio). I thought this was too much work for such a small feature so I never got around to implementing it. However, with the new transfer system all I need to do is to send the frequency value (a single 16-bit value) from the MIPS side, and the ARM7 can then use the PSG sounds to create a beep at that frequency. Very simple, and very similar to the way it works in the original DSx86. The only difference is that in DS2x86 the speaker frequency will only change at most 30 times a second, while in DSx86 it can change much faster.

The next released version of DS2x86 will have the new transfer system, but it will very likely still miss a lot of features. One feature that will probably be completely removed is the ability to take screen copies. This is because the actual screen image is now created on the ARM9 side, and there is no (fast) way to transfer it to the MIPS side for writing to the SD card. If I wanted to have screen copies, I should duplicate the screen handling to the MIPS side, and that would just increase the size of the code too much. In any case, the only thing I can promise for sure is that the next version (with version number 0.30) will run Doom faster than ever before, with much improved audio quality. Any other game or software might not work properly, so please be prepared to have many games still fail to run in the next version. But, I hope that when I get the next version released, you would be interested enough to test your favourite games and report to me in what way they fail to run. :-)

Anyways, have a nice holiday and the end of this year! It might be that I will release the next version on the January 1st, 2012!

Dec 18th, 2011 - DS2x86 progress

Since the last blog post I have focused on getting the SB digitized audio working. That has been much more difficult than I originally thought. First, it took me several days to come up with a data transfer method that might support all the features that the SB digitized audio needs. Then I began implementing the auto-init DMA version as used in Doom. It took me several days to even get any sound out of that system, and the current status is that the audio is full of clicks and pops, and severe distortion.

I think the biggest problem in getting the audio working is the timing issue. Playing audio is very time-critical, the new samples need to be in the ARM7 playing buffer some time before they are actually getting played. I can only transfer new audio data from the MIPS side to the ARM9 side when the graphics are not getting transferred, so that means I can only transfer new audio samples at about 30 times per second. Using a 512-sample FIFO transfer buffer, that would give the maximum sample rate of 15360Hz. Doom actually uses SB TIME_CONSTANT of 0xA6, which corresponds to (1000000 / (256-TIME_CONSTANT)) = 11111Hz. So, during every 1/30 transfer window I should move around 370 samples. My buffer copying works with 128-byte increments, so most of the time the transfer handles 384 samples, and occasionally only 256 samples.

Currently I use the same ring buffer on the NDS side for receiving the data from the MIPS FIFO (by the ARM9 processor), and for filling the audio playing buffer (by the ARM7 processor). I believe the clicks and pops are caused by this sharing of the buffer. The ARM9 does not know the position where the ARM7 is copying data from the buffer, so when the timing of the audio playing differs from the timing of the buffer filling (which will always happen as the ARM and MIPS have different base speeds for their timers), occasionally wrong samples are getting copied to the playing buffer.

I think I need to add a new ring buffer to the ARM9 side, and also attempt to sync the ARM7 and ARM9 buffer accesses so that I avoid accessing the same area of the ring buffer from both processors simultaneously. I am not yet sure how to avoid buffer under- or overflows when the actual audio playing speed differs from the buffer filling speed, and the audio is looping for long periods of time. I might need to do some on-the-fly adjusting of the playing frequency, or some such trickery.

All in all, this is quite a difficult challenge to get working properly. Luckily, I have a two-week Christmas vacation starting tomorrow, so besides the family- and Christmas-related things to do, I should now have more time to work on DS2x86. I still hope to get a version released during this year, although it looks like that version will not yet support all the graphics and audio features that the current DS2x86 does. But we shall see.

Dec 11th, 2011 - DS2x86 progress

For the past week I have been working on the new transfer system for the DSTwo SDK. The current status is that some of the graphics (TEXT, CGA and MCGA) work, AdLib audio works, and keyboard and touchpad handling works using the new transfer system. I am currently working on getting the SoundBlaster digitized audio working, as that is the most challenging data to transfer.

The original transfer system of the SDK is driven by the MIPS side, so that the MIPS side sends an interrupt to the ARM side whenever it has some data to send, and the ARM side then sends commands and requests data from the MIPS side. My new system works in a completely opposite way, the MIPS side does no transfers unless the ARM side requests data from it. The basic idea of the new transfer system is this:

Both the ARM and MIPS processors start running their code at about the same time. Both sides perform their initialization work etc.
The ARM side hooks into IRQ_CARD_LINE interrupt signal and begins waiting for an interrupt on that line.
The MIPS side sends an interrupt to the ARM side as the last operation of the initialization routine, immediately before it calls the user's main() function. This allows the ARM side to proceed.
The ARM side sends command 0xC5, which contains the current Real Time Clock values and works just like in the original SDK. The MIPS side receives this command in the cmd_line_interrupt handler and stores the RTC values into variables, for later use.
The ARM side waits for VBlank interrupt, the MIPS side runs the CPU emulation.
The ARM side sends command 0xC2 at every other VBlank interrupt (in the future I hope to make this selectable for 15/30/60fps screen refresh rate, but for now it is fixed to 30fps). This command was originally used in the SDK for sending audio data. In my new system the command parameters contain the current key and touchpad status, RTC seconds value, and a flag byte.
The MIPS side interrupt handler then sends the current palette data, current configuration state, and AdLib buffer data in the first 1024-byte block. It also tells the FPGA to send data_line_interrupt when the FPGA FIFO becomes empty.
The ARM side reads this whole 1024-byte block (where the configuration status contains the current graphics mode, for example), and goes to the appropriate screen blitting routine to wait for the MIPS side to send the actual graphics data.
The MIPS side gets a data_line_interrupt interrupt when the ARM side has read the full 1024-byte FIFO buffer, so it can fill the next 1024-byte block with the graphics data.
The ARM side keeps reading the graphics data until it has received enough for the full screen, at which point it sends command 0xC2 again, with the flag byte telling the MIPS side to stop transferring more graphics data.
The MIPS side receives this command, clears the FIFO and turns off the data_line_interrupt.
At this point the MIPS side just runs the CPU emulation, the ARM9 waits for the next VBlank interrupt, and the ARM7 plays AdLib audio, if any. The transfers will continue from the phase 6 above.

Adding the SB digitized audio into this system is rather difficult, as the audio buffers can be of any length, and they can be either one-shot or looping buffers, and at various sample rates. I am currently trying to add the SB audio transfer to the buffer end command (phase 10 above), so that the digitized audio would get transferred always after the graphics have been transferred.

With this new system everything related to data transfer on the MIPS side is handled within the cmd_line_interrupt and data_line_interrupt handlers, so that there is never any need to call the update_buf or such functions of the original SDK. I have actually removed most of the unneeded functions, but some of these still remain. I plan to eventually get rid of all the unnecessary overhead.

One additional advantage I got from this change, was that I don't need to allocate a timer interrupt on the MIPS side to emulate the VBlank signal. I can now get the timing from the interrupt the MIPS side receives from the ARM side, so that I can use the free timer for better SoundBlaster IRQ emulation with much more accurate timing.

Dec 4th, 2011 - DS2x86 progress

The only big improvement regarding DS2x86 I have achievent since the last blog post, is that I have managed to get the AdLib audio emulation working. The AdLib audio emulation now runs fully on the ARM7 processor also in DS2x86, using the exact same code as in the original DSx86. Only the data transfer system is different (and it is actually still just a quick hack, until I get the whole MIPS-to-NDS transfer system redone). The primary goal for this enhancement was to make sure that the ARM7 audio system does run normally and that it is possible to handle the AdLib emulation on it.

I am currently working on the transfer system restructuring. I began by removing code that seemed non-essential, checking that the system still works, then removing some more code, and so on. It is nice to see how the plugin size gets smaller and smaller, with no change in the behaviour. There is actually A LOT of unneeded code and handshaking for the card interface. I'm not sure if much of it is there simply as a leftover from the early experimenting by SuperCard, or if there are some situations where all of that is needed. Almost none of it seems to be needed in DS2x86, though.

I have a long weekend to work on DS2x86, as next Tuesday is the Independence Day of Finland, and I also took the Monday off from work. I hope to be able to finish the new simplified transfer system during this time, so that I can then continue adding all the required functionality to take advantage of the new transfer system. It still looks like the new transfer system will be MUCH better than the original used by the DSTwo SDK. The only frustrating thing with the new system is the fact that I have spent a year fighting with the original SDK! During all that time the possibility to properly use the available hardware features existed, but SuperCard just did not make that available! Well, better late than never, I suppose.

Nov 27th, 2011 - DS2x86 progress

Yet another busy week at the office, so again the progress with DS2x86 has been rather limited. I have however managed to get the TEXT, CGA and MCGA mode graphics transfer to use the new faster system, where the MIPS side simply transfers the emulated VRAM contents using DMA to the DSTwo FPGA FIFO, and the ARM9 then doing the actual work of translating the linear graphics buffer data to the NDS VRAM. Oops, quite a lot of acronyms in that sentence, hope it all made sense. :-)

The Mode-X graphics copying also partially works already, but I have not yet done anything for the EGA (16-color) graphics modes. I coded the small Mode-X enhancement to the MCGA mode so that I could get Doom graphics working. Doom seems to use triple-buffered Mode-X graphics mode, so all I needed to do was to send the 320x200 area starting from the correct position within the virtual 320x819 (= 256KB) frame buffer area. I was interested in getting Doom to work in the new transfer system, as it is currently running a bit too slow for it to be properly playable. I am happy to report that Doom does run noticeably faster with the new transfer system! There is no audio support yet, so adding that back might make it run marginally slower than without any audio, but that slowdown would only mean copying a few hundred bytes at 60 times a second or so, which will be pretty much unnoticeable. I don't have a proper method of calculating the actual speed difference, so sadly I can't show any numbers. The DOS SYSINFO that I have used for calculating the emulation speed does not show any difference, as I stop the emulation timers while the graphics and audio data are copied from the MIPS side to the NDS side. In the new system that copying takes so little time that I don't actually need to stop the timers, so it might be that SYSINFO will actually show the new system being slower, while in reality it is much faster! :-)

I ran into a small problem with the DSTwo transfer system when I first started working on the MCGA mode graphics copying. The card interface allows the transfer of 0, 4, 512 or 1024 bytes with one command. In addition, the MIPS side gets an interrupt when the NDS side has read all the data from the FIFO, so sending more than 1024 bytes is also possible. I wanted to first transfer the palette (256 * 16-bit palette value), and then the 64KB of screen VRAM contents. I thought I can simpy use first a 512-byte transfer, and then continue with the 1024-byte transfers whenever the MIPS side gets the interrupt. However, I never got the interrupt when I attempted to do that. It turned out that the interrupt only happens after sending a full 1024-byte block, not after any other size. So, I had to send the palette in the first 1024-byte block, with nothing useful in the second half of that block, to get the MCGA mode to work. I hope to use that second part of the first 1024-byte block to send SoundBlaster audio data in the future, so that space won't get wasted.

I have also been cleaning up the whole transfer system, with the goal of eventually having a very simple system where I only need to send a single command at 60fps, with a single reply containing all the video, audio and configuration data I need. The commands sent from the NDS side can have 8 bytes worth of parameters. I would need to send key presses, touch screen coordinates, and RealTimeClock values to the MIPS side, and I think I should be able to fit all of those into the 8 bytes. The current DSTwo transfer system uses separate commands for all of those, and the video/audio transfer even has two commands, with the first one being a sort of "I see you have something to send, what is it?" command, for which the MIPS side then replies with "I want to send video data" or "I want to send audio data" or such, and then the ARM side sends a second command requesting that specific data, with the MIPS then finally sending the actual data.

Anyways, quite a bit of work still remaining, but it was interesting to see Doom running reasonably fast using the new graphics transfer buffer. I do believe my reworking the transfer system will make DS2x86 much better than what it was. :-)

Nov 20th, 2011 - DSx86 version 0.40 released!

DSx86 0.40 release notes

Sverx has again been working on improving the screen scaling algorithms for DSx86. This time he figured out a smart new way to take advantage of the NDS hardware scaling and blending features in the Jitter mode. The new and improved Jitter mode in this version is just as fast as the plain Scale mode (as it is handled completely in hardware), but it also produces a result that is very close to the software-based Smooth scaling algorithm (in all the low-resolution modes)! Big thanks again to Sverx for his ingenious new scaling method!

DS2x86 progress

I have not made huge progress with DS2x86 during the past week, as I have been busy with some work-related things, including a business trip. I have however managed to get the text mode screen handling moved from the MIPS side to the ARM side. The strange palette problem I mentioned in the previous blog post was actually not related to palette handling, instead, I found an interesting feature in the FPGA code that the original DSTwo graphics transfer code uses. When sending the video data (with command 0xC1), the MIPS side code in the cmd_line_interrupt() routine in game.c module has the following code:

    case 0xc1://VIDEO 512*n
        isc1cmd = cmd_buf[7] & (( 1 << enable_fix_video_bit) | ( 1 << enable_fix_video_rgb_bit));

    case 0xc2://AUDIO 512*n
        //*(fpgaport*)write_addr_cmp_addr = (0x400-0x380) ;

        SET_ADDR_GROUP(GPIO_ADDR_GROUP1);
        *(fpgaport*)(cpld_base_addr + cpld_base_step) = (0x400-0x380) ; //(0x400-0x300) ;
        SET_ADDR_DEFT();

        *(fpgaport*)cpld_ctr_addr = (1<<fpga_mode_bit) |  (1<<fifo_clear_bit) | isc1cmd;
        *(fpgaport*)cpld_ctr_addr = (1<<fpga_mode_bit) | isc1cmd;

In the routine MP4_init_module() at the end of the same source file, where the transfer buffers and commands are prepared, is the following code:

    buf_st_temp.nds_cmd=(((1 <<enable_fix_video_bit ) |(1 <<enable_fix_video_rgb_bit ))<<24) | (buf_video_up_0<<16) | (VIDEO_UP <<8) | 0xc1;
    buf_st_temp.type= 0;
    pmain_buf->buf_st_list[buf_video_up_0] = buf_st_temp;

The transfer system on the NDS side sends the highest byte of the nds_cmd field of the buffer struct as the cmd_buf[7] content when requesting the video data, and thus the command 0xC1 always has those enable_fix_video_bit and enable_fix_video_rgb_bit bits set. My understanding is that when those bits are set, the FPGA code always turns on the highest bit of every 16-bit halfword in the transfer buffer (while sending it via FIFO to the NDS side). This creates 16-bit ARGB values with the alpha bit set on-the-fly for the 16-bit color values that the DSTwo SDK normally uses. But, when sending graphics data in some other format, you need to NOT set those bits on! In my case when sending the text mode data (which on the x86 is formatted as an 8-bit character followed by an 8-bit attribute byte), the attribute byte always had the highest bit set, which caused the background color to use the light colors. This caused the DOS background to be gray when it should have been black, so I first thought I had a problem with the palette. As I don't plan to send any 16-bit color values in DS2x86, I simply commented out those two bits, thus forcing the MIPS side to always send the graphics data as-is.

Currently I am working on the CGA graphics mode data transfer. This is rather similar to the text mode (on the MIPS side), it can simply send 32KB from the emulated x86 segment address 0xB800. The NDS side just needs to handle the received data differently. After I get this working, the next mode will be the MCGA graphics mode. In that mode I need to send 64KB from the x86 segment address 0xA000, plus additional 512 bytes (or 768 bytes if I move the 24-bit palette -> 16-bit palette conversion to the NDS side) for the palette.

The remaining graphics modes (EGA and Mode-X modes) are somewhat more difficult, as sending the whole graphics memory area would be 256KB, which is more data than I want (or probably even can) send per frame. I think I need to limit the data to be sent to 256x192 (or 320x200) bytes, but as that won't be enough for the high-resolution modes like 640x480 16-color VGA mode, I still need to figure out a smarter way to handle this. Only after these changes I can then start working on the audio stuff.

Nov 13th, 2011 - DS2x86 progress

For the past week I have mostly been just studying the DSTwo SDK sources. I have not made much changes to DS2x86 yet, but I have been thinking of ways to improve the data transfer, and I have also been cleaning up the interface code that transfers the data between the MIPS side and the ARM side. I think I now have a rudimentary understanding of how the system works, so I can finally start changing it. The problem with these changes is that they will make DS2x86 not run properly for quite a long time, as I need to change a lot of the core functionality.

I plan to make several major changes to the data transfer interface, with the most notable changes being the following:

The lower screen keyboard handling moved completely to the ARM side. This will make the resulting file smaller, as I only need to use a 4bpp (16-color) version of the keyboard image instead of the 16bpp version that DS2x86 currently uses. Also, changing any config texts required me to send the whole 256x192x2 byte image from MIPS to ARM. This is already mostly working, except the HDD led functionality is still missing, and I still need to completely redo the debug screen handling (and the BSOD system). I also plan to add the keyboard key color change when you press a key (as in DSx86), as I can now handle that similarly to DSx86.
I also plan to transfer the top screen data directly from the x86 emulated data (which can be in various formats depending on the graphics mode, anything between 1bpp and 8bpp). Until now I have had to convert and copy this data into the 256x192 16bpp buffer that the DSTwo SDK can then send to the ARM side. All of this converting and copying has taken time from the CPU emulation. The actual transfer uses DMA, so it should not take much time away from the CPU emulation any more. I have just today began experimenting with this, by sending the 32KB text mode data (from x86 address B800:0000) to the ARM side, and having the ARM side then convert and write this data to the actual ARM VRAM using the 6x8 font. This seems to work in principle, I am just having some palette-related problems with that currently. Of course I would not absolutely need to copy more than 80x25 (or even only 42x25 if I really wanted to minimize the transfer size) characters in text mode, but even the 32KB is much less than the 256x192x2 = 96KB that the original method required. And, if I were to use code to select the amount of data to be sent on the MIPS side, I would lose some of the speed of simply letting the DMA system send the whole 32KB memory area.
The next big change is that I would like to do is to let the ARM side command the screen copying. The original DSTwo SDK method is such that the MIPS side commands everything, the ARM side just waits for commands and then executes them. I would prefer the ARM side to use the VBlank interrupt to make the MIPS side start sending the screen data, so that I could keep the data sending synced to the actual VBlank interval. Until now I have used a timer on the MIPS side that tries to emulate the VBlank interval, but using the actual VBlank interval would make more sense. This change would switch the MIPS/ARM master/slave relationship upside down, so I'm not sure how much work this would mean and if it would work in practice.
After the screen handling is done, I need to look into the audio handling. AdLib emulation should be simple as it is not so timing critical and sends very little data, but the SoundBlaster digitized DMA audio is more difficult. It needs two-way data transfer, as the emulated DMA registers on the MIPS side should be updated as the data buffer gets played on the ARM7, so this is something that I still need to spend some time thinking about. I have some ideas, but I need to first see about how I can get the video side working before I can then start working on the audio system.

All in all, a lot of work still remains before DS2x86 will take better advantage of the new SDK possibilities. No new versions for several weeks yet, possibly not even this year. But, I hope the next version will then have some noticeable improvements over the current version.

Nov 6th, 2011 - ds2_firmware enhancement work for DS2x86

Bug in original ds2_firmware sources

First off, if you have downloaded my ds2_firmware sources before yesterday, those still had a problem. I found and fixed the problem yesterday, so please download them again (or, if you have already made changes to them, read on for my description of what the problem was).

I was aware that my ported ds2_firmware sources did not produce a fully working firmware for DS2SDK 1.2 already when I released them. However, the firmware compiled from my ported sources misbehaved similarly to the firmware built from the original sources released by SuperCard. Thus, my port was OK, the problem seemed to be in the original sources, which is why I decided to release my port at that time. After that I then started to make my port work properly, by comparing the disassembly of the working ds2_firmware.dat (dated April 30th, 2010 with a size of 415 744 bytes, released in the /tools directory of the SDK 1.2 package) to a dump of the arm9.elf file built from the SuperCard sources.

The first difference I found was that the working ds2_firmware.dat had the while (CARD_CR2&CARD_BUSY); delay loops still in place, even though the sources released by SuperCard had all of those commented out. Those are in the beginning of practically all the routines in the iointerface.cpp source file. I uncommented those, but that did still not fix the problem. That did however point out that the source code released by SuperCard is actually not the same source code that they have used when building the ds2_firmware.dat themselves!

It took me considerably longer to find the next difference, which then turned out to be the actual problem in the sources. The original sources have a routine that waits until the MIPS side has sent a certain number of bytes to the ARM9 side using the card fifo:

int waitfifo_full_len(int len)
{
    u32 temp;
    delay_times_0 = fifo_over_time ;
    while(1)
    {
        temp=cardcommand_r4_nowait(nds_fifo_cmd_read_state,0,0);
        if (((temp>>nds_fifo_read_full_bit)&1) ==1)
        {
            break;
        }
        else if (((temp>>nds_fifo_len_bit)&nds_fifo_len_mask)  >=len)
        {
            break;
        }

        if (delay_times_0 == 0)
        {
            return 1;
        }
    }
    return 0;
}

I used iDeaS to debug and disassemble the original ds2_firmware.dat. A disassembly of this routine begins at logical address 0x02002590, and the disassembled routine looks like the following:

A dump from the ELF file of the ARM9 code created from the SuperCard sources instead looks like this:

020019c8 <_Z17waitfifo_full_leni>:
 20019c8:	e92d4038 	push	{r3, r4, r5, lr}
 20019cc:	e59f4058 	ldr	r4, [pc, #88]	; 2001a2c <_Z17waitfifo_full_leni+0x64>
 20019d0:	e3a03014 	mov	r3, #20
 20019d4:	e1a05000 	mov	r5, r0
 20019d8:	e5843000 	str	r3, [r4]
 20019dc:	ea000004 	b	20019f4 <_Z17waitfifo_full_leni+0x2c>
 20019e0:	e1530005 	cmp	r3, r5
 20019e4:	2a00000c 	bcs	2001a1c <_Z17waitfifo_full_leni+0x54>
 20019e8:	e5943000 	ldr	r3, [r4]
 20019ec:	e3530000 	cmp	r3, #0
 20019f0:	0a00000b 	beq	2001a24 <_Z17waitfifo_full_leni+0x5c>
 20019f4:	e3a01000 	mov	r1, #0
 20019f8:	e1a02001 	mov	r2, r1
 20019fc:	e3a000e0 	mov	r0, #224	; 0xe0
 2001a00:	ebffff63 	bl	2001794 <_Z21cardcommand_r4_nowaithjj>
 2001a04:	e59f3024 	ldr	r3, [pc, #36]	; 2001a30 <_Z17waitfifo_full_leni+0x68>
 2001a08:	e1a029a0 	lsr	r2, r0, #19
 2001a0c:	e2100002 	ands	r0, r0, #2
 2001a10:	e0023003 	and	r3, r2, r3
 2001a14:	0afffff1 	beq	20019e0 <_Z17waitfifo_full_leni+0x18>
 2001a18:	e3a00000 	mov	r0, #0
 2001a1c:	e8bd4038 	pop	{r3, r4, r5, lr}
 2001a20:	e12fff1e 	bx	lr
 2001a24:	e3a00001 	mov	r0, #1
 2001a28:	eafffffb 	b	2001a1c <_Z17waitfifo_full_leni+0x54>
 2001a2c:	02063b28 	.word	0x02063b28
 2001a30:	000003fe 	.word	0x000003fe

The sources have obviously been compiled with a different GCC version, but the most peculiar difference is that the code tests the cardcommand_r4_nowait() function ("bl 0x0200242C" in the working version and "bl 2001794" in the compiled version) return value (in register r0) for bit 1 in the working version (using tst opcode), but for bit 2 (using ands opcode) in the compiled version!

I checked whether the cardcommand_r4_nowait works differently in the two versions, in case that would explain the different bit, but it seemed to be similar. So, I next looked at where the nds_fifo_read_full_bit comes from, and found out that it is defined in game_define.h like this:

#define cpu_write_Full_bit      1

#define nds_fifo_read_full_bit  cpu_write_Full_bit

So, the check for ((temp>>nds_fifo_read_full_bit)&1) ==1 results in ((temp>>1)&1) ==1 which is the same as (temp & 2). So the compiled version is correct as far as the sources are considered, but it is very strange that the original ds2_firmware.dat tests a different bit! I decided to check what happens if I simply change the code to test for the same bit as the original ds2_firmware.dat (in this routine and also in the waitfifo_empty() routine, which had a similar but opposite difference). And, curiously, after this change the new ds2_firmware.dat began to work correctly! So, the ds2_firmware.zip source code package I have on my download page does now have this change in the iointerface.cpp source file, and I also decided to get rid of the game_define.h file completely and defined the few needed values at the top of the iointerface.cpp itself. This made the source code package somewhat smaller and clearer.

DS2x86 enhancement work

After I got the ds2_firmware to work correctly, I began looking into enhancing DS2x86 to take advantage of the new possibilities. The first thing I wanted to do was to have the lower screen (the virtual keyboard) updated on the ARM side, so that I did not need to send the whole screen image (256x192 pixels at 16-bit color!) every time a simple config text or HDD "led" changes. After some experimenting I managed to have the ARM side show the virtual keyboard. I commented out all the lower screen sending routines from my DS2x86 sources, so at first I could not see any config strings any more. I then looked into how the commands (like ds2_setSwap()) work, and noticed that they are actually quite simple. All commands are sent as a 512-byte block, with the first 60 bytes containing info of up to 20 different commands that can be sent simultaneously, and the remaining bytes being a free data area for the commands to use.

I quickly implemented a new IS_SHOW_CONFIG command, with the data containing all the strings that need to be shown on the lower screen. The MIPS side builds and sends the command, and when the ARM side receives this command it parses the strings from the command data and displays them on the lower screen config areas. This seems to work fine as long as I don't attempt to send another command immediately before or after this new additional command. So, I still need to look more closely into how the commands interact with the screen and audio sending and such. Perhaps I need to change the I/O interface to always send a combined command, and then have my routines simply append their commands to this master command structure.

I haven't yet looked into how to send more data from the ARM side to the MIPS side. I would like to move the whole keyboard/touchpad handling to the ARM side, so that the MIPS side would just receive the x86-style key scancodes to put into the keyboard buffer. But, in any case, it looks like enhancing the I/O layer with additional commands is pretty simple, so I should not have any major problems moving the AdLib emulation to the ARM7, for example. Of course I might still run into some problems with that, but at the moment it looks quite doable.

Nov 2nd, 2011 - Port of ds2_firmware sources for a recent DevKitPro released!

Okay, the full source code for my port of the ds2_firmware sources to a recent DevKitPro are now available for download from my DSx86 download page. Note that this is not a complete implementation with full functionality, this is just a simple starting point for your own experiments.

Nov 1st, 2011 - DSTwo SDK 1.2 work continues

First off, thanks for letting me know that I had mistakenly dated the previous blog entry to Oct 23rd, although it was obviously meant to be Oct 30th. Yet another copy-paste error, I seem to do those a lot. :-)

Anyways, just a quick update about the ds2_firmware situation. After writing the previous blog post, I have managed to debug both the ARM7 and ARM9 code, and currently it looks like all the processors (ARM9, ARM7 and MIPS) start up fine, and succesfully run their initialization code. ARM7 goes to the idle loop of the DevKitPro template, ARM9 starts running the main routine and gets the ds2_io_init messages from the MIPS side, but does not seem to get any messages after that. The MIPS side obviously also runs up to the main code where it calls the ds2_io_init() function. So, there is still some problem with the data transfer routine, either I broke something when I fixed the numerous warnings caused by the original code, or my skipping the audio initialization on the ARM side skipped too much stuff and the data transfer stops working because of that. But, all in all, it looks like I am reasonably close to getting the system to actually working.

Edit: Okay, I got it to work! Looks like the problem was caused by my commenting out too much of the audio code. Seems that the audio buffer handling on the NDS side is essential for the whole data transfer system to work. Next I'll clean up the code and remove all non-essential extra code I have added, so that I can then release a really bare-bones source package that can be built using the latest DevKitPro, for all of you who are interested in working on this to continue and enhance the system for your own needs.

Oct 30th, 2011 - DSTwo SDK 1.2 work continues

For the past week I have been porting the ds2_firmware.dat (the NDS side of the DSTwo SDK) to the latest devKitPro. I have now managed to compile and link the new ds2_firmware fine, but I have not yet managed to get it to actually work. At first I was a bit at loss as to how to determine what goes wrong, until I noticed that iDeaS emulator can actually run the firmware. Obviously it does not get further than where the communications with the MIPS processor side begin, though. However, when testing my own version I noticed that it does not even get that far.

I then debugged in iDeaS both my firmware version and the original working version, and immediately noticed that the entry points differ. In the original firmware the ARM9 side begins running at address 0x2000800, and the ARM7 side begins running at 0x2380000, while in my new firmware the entry points are 0x2000000 and 0x37F8000. I then spent quite a while trying to get the entry points in my ds2_firmware to match those of the original firmware. The original firmware Makefile gives these entry points as parameters to the ndstool command, but I noticed that when using .elf input files, the entry point parameters were ignored by the ndstool. So, next I attempted to use similar Makefile as in the original firmware, where the .elf files are first converted to binaries with suffixes .arm9 and .arm7, and these are then given to ndstool. This way I could change the entry points, but it also mixed up the function call addresses and such so that the code just crashed. I returned to using the .elf files, and attempted to change the entry points with the linker scripts ds_arm7.ld and ds_arm9.ld. This seems to work for ARM9, but for ARM7 the original firmware linker script still keeps the entry point at 0x37F8000, even though the code actually starts running from 0x2380000. The ds_arm7_mp4_crt0.s code contains a special code that moves the data from 0x2380000 to 0x37F8000 when it starts executing, but I am currently not quite sure how to make the code load to 0x2380000 using the .elf file, or if that is even required.

When attempting to run the firmware using No$GBA, it gives the following error for the original firmware: "Secure Area below 4000h, cartridge won't work on real NDS", and clicking on OK the No$GBA crashes. However, I don't get this error message when attempting to run my firmware, No$GBA simply crashes immediately. So, there is still something different in my firmware version compared to the original, but I have not yet been able to determine what the difference is. I have also compared the files using a hex editor, but that has not shown any differences that would immediately point to the problem. In both firmware files the ARM9 code begins at offset 0xA00, and the header at the start of the file seems to contain proper data. So, I suspect that it is the ARM7 side that is currently the problem.

All in all, this has mostly been somewhat frustrating trial-and-error, as I have no idea how close to a solution I am. The next step I plan to do is add some debug stuff at the beginning of both ARM9 and ARM7 init codes, so that ARM9 would show something visible on the screen and ARM7 would start some audio, so that I can determine if the processors at least start running the firmware code.

Oct 23th, 2011 - DSTwo SDK 1.2 work

During the last week I was rather busy with other things besides this project. We have a deadline approaching for the project I work on in my daytime job, so after a full workday of coding I have been a bit too tired to work on DS2x86. I have mainly done some experiments and studying of the recently released DSTwo SDK 1.2 sources, especially on the NDS side.

The only progress in DS2x86 I have managed to make with the DSTwo SDK is that I have switched DS2x86 to use the new version 1.2. At first I got a black screen, but pretty soon I realized that this was caused by the SDK 1.2 again using the same format of the st_buf structure as what was in use in the original SDK. The SDK version 0.13 had an extra member in this structure, and as I use the fields of this structure directly in my DS2x86 code, this caused the problem. After I aligned the fields properly in DS2x86, it began to work with SDK 1.2. It almost looks like the new 1.2 version is built on top of the original 0.12 (or 0.11) version of the SDK, instead of the previous 0.13 version. I haven't studied the differences very closely, so I'm not sure if this is really the case.

The next step was to try and build the NDS side from the sources, and this is where I spent most of last week. I needed to search for the correct DevKitPro revision from the svn repository, but even after I found revision 1949 (from August 2007) with which I managed to compile the code (with quite a few warning messages, though), I was not able to link it due to some missing files. Luckily a GBATemp user called Normmatt was able to provide me with a more complete old DevKitPro version, which was able to link ds2_firmware.dat completely, though again with a lot of warnings. Anyways, this version of ds2_firmware.dat is still slightly different from the original included in the SDK 1.2, but it mostly works. I can start up DS2x86 with it, and type commands on the DOS prompt, but whenever there is disk activity (so the bottom screen should blink the "HDD" text), something goes wrong and eventually the system hangs.

I am not terribly concerned about that problem, though, as my current goal is to port the ds2_firmware.dat sources over to the latest DevKitPro, so that I can really start working on using the ARM7 for AdLib emulation and such. My primary goal is simply to have the touchscreen/key input transferred to the MIPS side, and the top screen data transferred back. After I get this done, if there is interest I might release the ported sources so that if any of you are interested in completing the lower screen and audio transfer for the SDK you can do so. I will continue with the data transfers specific to DS2x86, so after that the sources will not be of much use to other coders.

I have just managed to create the build environment for ds2_firmware.dat using the latest DevKitPro, but it does not have any DSTwo SDK -specific features in it yet. That's what I plan to do during the upcoming weeks. I'll keep you updated on my progress with my blog posts. This work means that no new versions of DSx86 or DS2x86 will be released for a few weeks now, sorry for that. I hope that I can eventually make some really worthwhile improvements to DS2x86 using the new SDK 1.2, though.

Oct 16th, 2011 - DS2x86 version 0.25 released!

Version 0.25 Release Notes

First off, sorry for the stupid bug that crept into the previous version. I copied some paging-enhancements to several string opcodes, and forgot to change the jump address in one of the code clips I copied, so that when the game tries to move unaligned 16-bit values in memory, it actually jumps to a code that moves 32-bit values. This means that too much data is copied, overwriting something in memory, which in turn causes either erratic behaviour or in many cases a blue screen crash. I use Doom as my sanity-check game, and always test that I haven't broken the code completely before release by running it for a while. However, Doom did not happen to use this unaligned memory copy, so I did not find this problem before the release.

All in all, this version is mostly just a maintenance version and has the following improvements:

Fixed a copy-paste bug in REP MOVSW string opcode (as mentioned above). This fixes the BSOD in Heretic, Hexen, and various problems in many other games.
Implemented INT10 calls AX=1008 (READ OVERSCAN (BORDER COLOR) REGISTER), AH=12/BL=34 (ALTERNATE FUNCTION SELECT (VGA) - CURSOR EMULATION), AH=F1/DX=0020 (EGA Register Interface Library - WRITE ONE REGISTER - Miscellaneous Output register), AH=F1/DX=0028 (EGA Register Interface Library - WRITE ONE REGISTER - Feature Control register).
Implemented missing 66-prefix variations for LFS and LGS opcodes (NORM).
Implemented missing RCL and RCR opcodes using 32-bit registers (SWS).
Implemented read/write to/from CPU debug registers (RAYMAN).
Enabled directory access using the alias of a long directory name. I had to hack into the DSTwo SDK directory.c source to enable this. I compared the source with the current NDS libFAT sources (which it is based on), and found a difference. In the DSTwo SDK version the check if the directory alias (the short 8.3 version of the long name) matches the requested directory name was simply commented out. I have no idea why, and hopefully my re-enabling this feature does not cause any new problems in file and directory handling. It is always rather scary changing the low-level file access library functions.
Ignore writes to I/O ports 0x140..0x14F (DESCENTR).

DSTwo SDK v1.2 released, with full source!

I got a very interesting email early Saturday morning, letting me know that SuperCard has released DSTwo SDK v1.2! This turned out to be a major thing, as they have now finally released the full source code, including the Nintendo DS side! This opens a lot of new possibilities when working with the SDK, the most imediately interesting to me is the possibility of moving my AdLib emulation from the MIPS side to the ARM7 side. This would make the actual CPU emulation run faster, and would also decrease the amount of data that needs to be transmitted between the processor. In the long term it might also be possible to move the bottom screen (keyboard) handling to the ARM9 side, or perhaps even moving everything besides the CPU emulation away from the MIPS processor! It would be very cool if the ARM9 would handle the full VGA emulation and ARM7 the full SoundBlaster emulation, with the MIPS side simply feeding them data in the original format of the x86 hardware!

However, there is a problem in compiling this source code, as it looks to have been built with an ancient DevKitPro r17 (the current version is r36). I haven't been able to locate this old version, so if you happen to know where to get it please let me know! Of course it would be best to update the SDK parts so that it could be compiled with the latest DevKitPro, but even to do that it would be very important to have the r17 source code to be able to distinguish between differences in the DevKitPro versions and the actual code needed by the DSTwo SDK.

The source code is not very well commented either, so it will take a while to decipher what each of the function does and how the data is actually transferred, but I am very much looking forward to changing the DS2x86 architecture to take advantage of the new possibilities. This is so interesting that I will probably put the paging features and such on hold and switch my focus to taking better advantage of the SDK. This might mean a few week's break until I release the next version, but hopefully this 0.25 version has no major new problems.

Oct 9th, 2011 - DS2x86 version 0.24 released!

Okay, finally by the end of last week I had managed to fix the most frequent bugs in my preliminary paging support, so that Descent 2 Demo ran the intro demo and also allowed me to start a new game fine (most of the time). It still occasionally crashes, but this happens very rarely. I even added some additional logging to try and figure out the reason for the occasional crash, but when I then tried again and again to make this happen, it ran fine for the whole two hours I spent with this. Quite annoying.

Anyways, this version now has the basic paging (virtual memory) framework in place, but there is a large number of opcodes that do not support paging properly yet. Also, only the MCGA graphics mode is supported while paging is on, using 16-color modes or Mode-X modes will display corrupt graphics (if anything). It is quite likely that any software using paging will crash at some point. If you can send me the debug logs from these situations, and let me know what software it was that crashed, it might help me focus on the correct opcodes to enhance for paging support in upcoming versions.

I also fixed a couple of bugs that had been in all prior versions, the most serious of these was a bug in the REPNE (repeat while not equal) string opcodes, which would occasionally run the REPE (repeat while equal) version instead. This obviously caused the exact opposite effect to what was supposed to happen, so this may have caused a variety of problems in various games.

Next I would like to work on adding JEMM support, so that I could perhaps eventually get Wing Commander Armada running. JEMM is a version of Expanded Memory Manager, and it looks like DOSBox has that built in, so I think DS2x86 could also include that. There is a small compatibility problem in that JEMM (like any other expanded memory manager) assumes that expanded memory and extended memory share the same memory space. In DS2x86 I have completely separate memory areas for expanded and extended memory, just to keep things simpler. I need to look into either switching to the same memory area and letting the JEMM emulation handle the EMS, or still keeping the areas separate and only implementing the JEMM features that are needed to make Wing Commander Armada (and perhaps other games that use an expanded memory manager for their protected mode needs) running.

Also, adding paging support has slowed down some parts of the code more than I would have liked, so I also plan to profile the performance of my emulation and attempt to figure out ways to again speed it up somewhat. These things will surely keep me busy for the coming weeks.

Oct 2nd, 2011 - DS2x86 progress

My work with the paging support for DS2x86 still continues. I did manage to find and fix one problem that caused the internal sanity check failures I mentioned in my previous blog post. This problem made the game every now and then drop into debugger with an error message stating that the NT (Nested Task) flag was not correct in IRET opcode. This turned out to be caused by a race condition between my TaskSwitch handling and the hardware IRQ handling. The TaskSwitch handler sets the NT flag on, which disables interrupts in DS2x86, but it was possible that the hardware interrupt happened after the TaskSwitch handler began executing but before it had turned on the NT flag. In this situation the hardware interrupt had already changed the opcode table to point to the IRQ handling code, and thus the interrupt got executed even when the NT flag was on after the TaskSwitch handler returned. I added code to reset the opcode table pointer after the NT flag gets turned on, which fixed that problem.

Another problem was much more difficult to track down, and it took me pretty much the whole of last week. After the previous fix, the Descent 2 Demo usually run up to the menu fine (though it still crashed occasionally before reaching that far). When selecting New Game or View Demo from the main menu, it loaded a little while but then always dropped down to DOS with an error message Error: Error reading ControlCenterTriggers in gamesave.c. Luckily, the source code for Descent 2 is freely available, so I was able to download the source code and look into the gamesave.c source file and see the C code that causes that error. The source code snippet in question looks like this:

    //================ READ CONTROL CENTER TRIGGER INFO ===============

    if (game_fileinfo.control_offset > -1)
    {
        if (!cfseek( LoadFile, game_fileinfo.control_offset,SEEK_SET ))
        {
            for (i=0;i<game_fileinfo.control_howmany;i++)
#ifndef MACINTOSH
                if (cfread(&ControlCenterTriggers, game_fileinfo.control_sizeof,1,LoadFile)!=1)
                    Error( "Error reading ControlCenterTriggers in gamesave.c", i);
#else
                ControlCenterTriggers.num_links = read_short(LoadFile);
                for (j=0; j<MAX_CONTROLCEN_LINKS; j++ )
                    ControlCenterTriggers.seg[j] = read_short( LoadFile );
                for (j=0; j<MAX_CONTROLCEN_LINKS; j++ )
                    ControlCenterTriggers.side[j] = read_short( LoadFile );
#endif
        }
    }

    //================ READ MATERIALOGRIFIZATIONATORS INFO ===============

The next step was to trace the ASM code within DS2x86 until I found something that allowed me to see what C routine was executing. This was the part that took most of my time. I had to break into the debugger at various points, and then trace upwards along the call stack until I found the routine that failed to return but instead exited back to DOS. When tracing that routine, I finally found a PUSH opcode that pushed a value corresponding to an address of an error string similar to the error message at the beginning of the gamesave.c routine. At that point I knew I found the gamesave.c routine, and was able to start comparing the ASM code to the C source code and follow the progress until I got to the above code snippet. Here below is the ASM code corresponding to the original C source code:

    //================ READ CONTROL CENTER TRIGGER INFO ===============


    if (game_fileinfo.control_offset > -1)
    {

10088FA8    mov     esi,[101E7CF3]          ; esi = game_fileinfo.control_offset
            cmp     esi,-1
            jle     1008900A

        if (!cfseek( LoadFile, game_fileinfo.control_offset,SEEK_SET ))
        {

10088FB3    mov     eax,[esp+000000C0]      ; eax = LoadFile
            mov     edx,esi                 ; edx = game_fileinfo.control_offset
            xor     ebx,ebx                 ; ebx = SEEK_SET = 0
            call    100EF804                ; cfseek()
            test    eax,eax
            jne     1008900A

            for (i=0;i<game_fileinfo.control_howmany;i++)

10088FC7    mov     edi,[101E7CF7]          ; edi = game_fileinfo.control_howmany
            xor     esi,esi                 ; esi = i = 0;
            test    edi,edi
            jle     1008900A

                if (cfread(&ControlCenterTriggers, game_fileinfo.control_sizeof,1,LoadFile)!=1)

10088FD3    mov     eax,102B355C            ; eax = &ControlCenterTriggers
10088FD8    mov     ebx,00000001            ; ebx = 1
            mov     ecx,[esp+000000C0]      ; ecx = LoadFile
            mov     edx,[101E7CFB]          ; edx = game_fileinfo.control_sizeof
            call    100EF7CC                ; cfread()
10088FEF    cmp     eax,01
10088FF3    je      10088FFF

                    Error( "Error reading ControlCenterTriggers in gamesave.c", i);

10088FF5    push    esi                     ; Push esi = i
            push    10126DF4                ; Push "Error reading ControlCenterTriggers" address
            jmp     100E26F4                ; Error(), routine never returns

10088FFF    mov     ebp,[101E7CF7]          ; ebp = game_fileinfo.control_howmany
10089005    inc     esi                     ; i++;
10089006    cmp     esi,ebp                 ; Test for i < game_fileinfo.control_howmany
            jl      10088FD3

        }
    }

    //================ READ MATERIALOGRIFIZATIONATORS INFO ===============

    if (game_fileinfo.matcen_offset > -1)

1008900A    mov     eax,[101E7CFF]          ; eax = game_fileinfo.matcen_offset
            ...

After some tracing and studying of the code I finally figured out what went wrong. After executing the CMP opcode at address 10088FEF the cseip has a value 10088FF3. As this is very near the end of the physical page, it causes the code between 10088FF0..10089010 to get copied to the temporary area at F000:3FF0 as I described in my previous blog post. The code continues running the conditional jump je from this copied location. However, all conditional jump opcodes re-calculate the target physical address (since they can freely jump between pages). So, after executing the je opcode and jumping to 10088FFF the code is again running from the original location, and loads the last byte of this page, which is the first byte of the opcode MOV. At this point my special paging-enabled main opcode loop should have again copied the area around this page break to the temporary location, however, there was a bug in the code. After loading the opcode byte from 10088FFF the cseip variable was immediately incremented, and only after that I tested whether the current location was near the page end. Since at that point the cseip value was already 10089000, the code determined that it was not near the end and continued running from invalid physical address, loading invalid opcode bytes. By chance the invalid code did not crash but instead jumped backwards so that it eventually reached the address 10088FF5 and then exited with the error message.

The fix was not especially difficult but meant that I had to reorder some code around in the main loop. This is the most performance-critical part of my emulator, so I had to be careful not to cause additional delays to the code. I believe the new code is at least as fast, and it even has the advantage that I now only need to change one opcode for the self-modified stuff, when earlier I had to change two opcodes in the loop to make this work. So, in the end fixing this bug brought some additional advantages.

Currently I am attempting to trace and fix an annoying occasional problem when returning from a hardware IRQ. Occasionally the stack does not contain proper values when the code returns from an interrupt and returns from the routine that was interrupted. This also seems to have something to do with the page discontinuity within the stack area, but I have not been able to determine the exact cause yet. So, no idea yet when I would be able to release a version of DS2x86 that supports paging properly.

Sep 25th, 2011 - DS2x86 progress

For the last week I have continued working on the paging features for DS2x86. This time I have managed to make some visible progress, as Descent 2 Demo goes to graphics mode properly, and at times I can even reach up to the main menu! However, this happens only in about once every six attempts or so, in other cases it crashes with a failure in some internal sanity checks I have in my protected mode opcodes. I believe these crashes are mostly due to the (silly) game not keeping the stack pointer doubleword-aligned, which makes the stack access occasionally needing to read a doubleword from two separate 4K pages, which is not yet supported. The frequency of this happening depends on the timing of the hardware timer interrupts etc, so it rarely happens in the same place.

So, the next step is to improve the data accesses, first for the stack opcodes and then for all the other opcodes, to also handle the possible page discontinuity properly. I have already handled this discontinuity in some opcodes, like the REP MOVSD string handler, but it is still missing from a lot of other opcodes. The CS:EIP handling code that I described in the previous blog post seems to work properly at the moment, so I can focus on the data access from now on.

Since checking for page discontinuity is a rather slow operation, I have been toying with the idea of attempting to use the hardware (the MIPS processor memory management unit) in some way to help with the virtual memory implementation. I only yesterday thought about this, and tried to test what the current hardware TLB contains. It seems to not contain anything sensible, which probably means that the TLB feature of the hardware is not actually used by the DSTwo SDK. It seems that all the memory accesses are to the 0x80000000 memory area, which the MIPS32 documentation describe as "Kernel Unmapped" area. Unmapped here means that the TLB hardware is not needed, the memory is mapped directly to the beginning of the physical memory. The virtual memory area between 0x00000000 and 0x7FFFFFFF is called "User Mapped", so that would use the TLB hardware to map virtual addresses to physical RAM (and accessing it currently always generates a "TLB miss" exception since the TLB table values are all invalid). It would be neat if I could use the TLB to map the first 16 megabytes (virtual addresses 0x00000000 - 0x00FFFFFF) directly to my emulated DOS RAM area using the TLB, as in that case the hardware would do the memory translation and I could get rid of my EMSPages mapping table!

This direct mapping would be quite simple if the game does not use virtual memory (or even EMS memory) and uses MCGA graphics, as then I could set up a single TLB entry mapping all 16MB of RAM at virtual addresses 0x00000000 - 0x00FFFFFF to my emulated RAM area. However, as soon as EMS memory or EGA or Mode-X graphics are needed, things get more complicated. In this situation I could divide the area into 16KB pages, but since the TLB table in MIPS32 only has 64 entries, that could map only the first 1MB of RAM. Nice for real mode programs, but for protected mode I would need to handle the hardware TLB misses with a fast exception handler (the documentation tells that the TLB miss exception uses a different vector to the common exception handling vector, and I have not been able to determine from the DSTwo SDK where this special exception handler should be). Even more difficult will be the situation with virtual memory, as only the addresses below 0x80000000 can be virtualized, the addresses above that would simply fail horribly when the game will access my emulator code instead of the emulated data or code. So, I am not yet sure whether it is worth it to look into this possibility further, but it is an intriguing idea.

Sep 18th, 2011 - DSx86 version 0.39 released!

DSx86 release notes

This version has only one small fix, the smooth scaling changes I made in the previous version introduced a screen flickering problem when attempting to scroll the smooth-scaled screen. This also happens when the Touch Pad Mouse cursor got near the screen border in modes that do not need scrolling, like in 640x480 mode. This problem was caused by my changing the REG_BG3Y hardware scaling register after the screen blitting function, and as the screen blitting took longer than one VBlank period, this change also occurred at a time when the screen was already active. This register should not be changed during active screen scanning time, or flickering will occur. In this version I moved the setting of this register to before the screen blitting code, so there should not be any flickering any more.

DS2x86 progress

During the last week I have continued working on the paging features. I finally figured out a way to handle the most serious problem I have been having with the paging system, the possible page discontinuity in the code segment that I described in the September 4th blog entry below. The proper method to handle this would be to look up every single byte from the EMSPages[] LUT, but that would really kill the performance (not to mention be a really big task to implement), so I did not want to do that. Instead, I figured out that since I already need to keep the physical and logical memory addresses separate, it would not be a big problem to actually move the problematic code into some completely separate memory area, and handle the discontinuity (that is, generate a continuous code snippet) when copying the code from two separate physical 4K pages!

So I added some (self-modifying) code to my main opcode loop to perform these operations:

If paging is on and the current physical CS:EIP address is near the end of this 4K page ((physical address AND 0xFFF) >= 0xFF0), then jump to a special code that does the following:
1. If we have already relocated the 32 bytes, jump back to handling the opcode normally.
2. Check if the next logical 4K page exists in physical memory, and cause a Page Fault if it does not (handling a Page Fault will eventually restart the opcode).
3. Relocate the 32 bytes around the page border (that is, 16 bytes from the end of the previous 4K page and 16 bytes from the beginning of the next 4K page) into emulated BIOS area at F000:3FF0 - F000:400F (actually it could be anywhere, but this was a nice free already existing space).
4. Adjust the physical CS:EIP pointer and the physical-to-logical adjustment value so that the code continues executing within this new relocated area.
5. Self-modify the opcode loop to first call a special code described below.
6. Continue with the normal handling code.
The opcode loop may be self-modified to jump to a special code which does the following:
1. Test if we are still at the end of the 4K page ((physical address AND 0xFFF) >= 0xFF0). If we are, jump to continue the opcode handling normally.
2. Else, re-adjust the physical CS:EIP and the physical-to-logical adjustment value to point to the actual physical RAM address of the 4K page instead of the relocated area.
3. Self-modify the opcode loop back to using the normal handling code.
4. Continue with the normal handling code.

With these changes I can still use the quick linear memory access when loading the opcodes and their immediate bytes, and this special code (which is rarther slow) is only executed when we are about to cross a page boundary. Here is an example of one such occurence, this actually causes a Page Fault as the page starting at logical address 0x100EA000 is not loaded into physical RAM when this code gets executed:

...
0268:100E9FF3	E8E39F0000		call 100EDFDB ($+9fe3)
0268:100E9FF8	C705713D131001000000	mov  dword [10133D71],00000001
0268:100EA002	B825A00E10		mov  eax,100EA025
0268:100EA007	E8358CFFFF		call 100E2C41 ($-73cb)
...

As you can see the opcode at address ...FF8 consists of 10 bytes, and it is actually only the last two bytes that are in the next page. In DOSBox the Page Fault happens when it loads the second-but-last byte of this opcode, in DS2x86 the page fault happens already when the code reaches the ...FF3 address (because at that point it needs to relocate the 32 bytes). When this code gets relocated into the F000:3FF0 area, it looks like the following (in the DS2x86 inbuilt debugger):

The call target addresses look different, as they are relative to the current logical IP. The handlers for all the opcodes using relative jumps will need to be aware of the possible relocation, but luckily this needs to be the case even without this relocation trick I use, since the paging itself can relocate the logical addresses within the physical RAM. Also note that only the opcodes at addresses ...FF3 and ...FF8 are actually executed from within this relocated area. When the emulator reaches the opcode at ...002, it uses the second part of my special handling code which goes back to executing the opcodes from their proper physical (and logical) addresses.

The current status with the paging features is that Descent 2 Demo begins to misbehave after Page Fault number 102, and Warcraft 2 runs up to Page Fault number 192. After that they both crash with an invalid Task Switch IRET. My current theory is that the timer interrupt that should play AdLib music interferes with the Page Fault task switches, and that causes the problem. I need to continue debugging this, but I am pretty happy with the progress I have finally been able to achieve with the paging system during the last week.

Sep 11th, 2011 - DSx86 version 0.38 released!

DSx86 release notes

Sverx has continued improving my Smooth Scaling algorithms. This version has faster smooth scaling in all the 640-pixel wide EGA and VGA 16-color modes (640x200, 640x350, 640x400 and 640x480). The 640x480 mode is now fast enought to be used also on DS Lite (it used to be a DSi-exclusive feature), though the screen refresh will be limited to 15fps. Big thanks to Sverx for his hard work on the smooth scaling algorithms!

Here is a table of the speedups he managed to achieve on each of the resolutions and on both DS Lite and DSi. The new code is more CPU-bound than my original code, which is why DSi shows somewhat bigger improvements in the speed.

Speedup	DS mode	DSi mode
640x200	1.15x - 1.34x	1.17x - 1.45x
640x400	1.00x - 1.58x	1.16x - 2.17x
640x480	1.33x - 1.42x	1.76x - 1.92x

The reason the speed improvement is not constant is because the complexity of the image to show had a big impact on the speed of the original code, due to cache misses when looking up the palette values. The new code has the palette values mostly in DTCM, so it runs (roughly) at a constant speed for all types of images.

DS2x86 progress

I am still working on the paging features for DS2x86. This seems to be somewhat more difficult than I had anticipated, I have run into various problems that have taken a long time to debug, and I am still debugging a couple of weird crashes. The code seems to run 35 Page Faults properly, but after (or inside) Page Fault number 36 things go wrong and DS2x86 hangs completely. Annoyingly, when I add some more thorough debug code into DS2x86, it runs further and only begins to misbehave after Page Fault number 47. At that point my EMSPages look-up table has a wrong value for page 0x102DD000, which contains the stack. It points to physical address 0x001D5000 (which is already in use by address 0x100F7000 containing executable code), so the game crashes when it loads invalid values from stack.

So, a lot of bug fixing and debugging still ahead before the paging system will work. I am not even sure yet if I can make my physical CS:IP architecture support paging code without a major rewrite, but I'll first need to figure out the cause of the current problems to continue testing that.

Sep 4th, 2011 - DS2x86 progress

For the past week I have been continuing my work on the paging support. This work has progressed reasonably well, but I have recently run into some rather difficult issues where I have had to stop and think for a while. Most of these issues have been performance-related, if I didn't need to worry about the performance when adding code to handle the new paging support, these issues would not have been nearly as difficult to solve. Currently it seems that it is not possible to have the paging-enabled code run as fast as the non-paging code, but I am trying to use self-modifying code and other tricks to make these slowdowns only affect the situation when paging is active.

At the beginning of my development of DSx86 I made an architectural decission to keep the current physical CS:IP pointer in one register, so that I can quickly load the next opcode byte (the most common operation in an emulator) simply by incrementing this pointer and loading the byte from the memory at this new location. At some point I realized that in theory this will fail in a situation where the real-mode IP register is about to wrap around to the start of the code segment. This should not happen in practice, though, as no DOS programmer would code something like that on purpose. This same system was carried over to DS2x86.

However, when paging is on, the next logical CS:EIP byte might be physically in a completely different part of the memory (or even still on disk in the swap file!). So, to make this work reliably I should use the paging tables (via my EMSPages[] direct look-up-table) to load every opcode byte. Since some opcodes may contain up to a dozen bytes that need to be loaded before the opcode can be executed, this would mean up to a dozen extra table look-ups for each opcode! This could drop the emulation speed far below half of the current speed. Such a slowdown would make DS2x86 pretty useless, so I have needed to figure out some alternative ways to support paging of the code segment.

Paging of data and stack is not that big of a problem, as there I need to use the EMSPages[] table look-up anyways. There is a potential problem as I currently only check the start address from the EMSPages[] table, so that when loading a double word across a page boundary, this might also give a bad result. This should be a rather rare occurence, but to make the code reliable I probably need to figure out a way to handle this situation as well.

Here is an example of the code paging situation that happens in Descent 2 Demo. On the left is a part of the disassembly, and on the right is a dump of the current EMSPages[] table contents. In the EMSPages dump each row shows on the left the logical address that the game uses, and after the arrow the offset into the 16MB memory block that I use to emulate the RAM of the 386 machine (the occasional 0x0C in the lowest byte is just a flag and not part of the address). As you can see, the game runs the code in a 4K page beginning at logical address 0x100F6000, which actually is mapped to the 16MB RAM at offset 0x001D4000. The next page at 0x100F7000 has not been mapped at all yet (so it shows just zeros in the disassembly). When the CS:EIP pointer reaches that address, it should cause a Page Fault, which makes the page fault handler load that block of code from the swap file. However, there is also a jump at address 0x100F6FF2 which jumps directly inside the uninitialized page to address 0x100F7017! Both of these situations are quite difficult to handle with my current physical CS:EIP pointer (which points to somewhere between 0x001D4000 and 0x001D5000 when running code on that page). The physical CS:EIP pointer would simply roll over to the memory area at 0x001D5000 (which is actually at a logical address 0x102DD000, and actually contains stack segment instead of code)!

So, what I think I need to do, is to handle potential jumps to uninitialized pages in all the opcodes that can cause a jump (like all the conditional jumps, loop opcodes, etc), and also specifically check for possible approaching page fault when the current CS:EIP is near the end of the current 4K page. It is possible that even with these changes I can not handle all the page faults properly, but I can not know that for sure until I have coded all the needed changes and tested that. So, I have no idea yet how much work is still ahead before DS2x86 supports paging properly. I don't yet have the proper checks for jumps into uninitialized page in the code, and this is probably the reason why Descent 2 Demo currently hangs DS2x86 completely. I'll continue working on this, but it currently looks like the next version will probably not yet run games that use paging.

Aug 28th, 2011 - DS2x86 version 0.23 released!

This version has the following fixes and changes, more info about some of these can be found in my previous blog post:

The DOS file rename functionality was fixed. This will help with Albion save game handling, and very likely with other games and software as well.
Fixed a bug in the 32-bit ADC opcode (add with carry) Carry flag handling. This caused the floor and ceiling graphics corruption in Albion. I found the cause for this problem by debugging the ceiling drawing (opcode by opcode) in both DS2x86 and DOSBox, and noticed a difference after executing this opcode. After I found the misbehaving opcode, I searched all my DS2x86 source code for "adc carry" to find other similar errors, and the search resulted in the following:
```
Find all "adc carry", Subfolders, Find Results 1, "Entire Solution"
  C:\Projects\ds2sdk_r\DS2x86\src\cpu_386.S(396):	// ADC Carry: ((unsigned)lf_res < (unsigned)lf_val1) || ((flags&1) && (lf_res == lf_val1));
  C:\Projects\ds2sdk_r\DS2x86\src\cpu_386.S(464):	// ADC Carry: ((unsigned)lf_res < (unsigned)lf_val1) || ((flags&1) && (lf_res == lf_val1));
  C:\Projects\ds2sdk_r\DS2x86\src\cpu_386.S(489):	// adc Carry: ((unsigned)lf_val1 < (unsigned)lf_res) || ((flags&1) && (lf_val1 == lf_res));
  C:\Projects\ds2sdk_r\DS2x86\src\cpu_386.S(510):	// ADC Carry: ((unsigned)lf_res < (unsigned)lf_val1) || ((flags&1) && (lf_res == lf_val1));
  C:\Projects\ds2sdk_r\DS2x86\src\cpu_386.S(1808):	// ADC Carry: ((unsigned)lf_res < (unsigned)lf_val1) || ((flags&1) && (lf_res == lf_val1));
  C:\Projects\ds2sdk_r\DS2x86\src\cpu_386.S(1972):	// ADC Carry: ((unsigned)lf_res < (unsigned)lf_val1) || ((flags&1) && (lf_res == lf_val1));
  C:\Projects\ds2sdk_r\DS2x86\src\cpu_386.S(2339):	// ADC Carry: ((unsigned)lf_res < (unsigned)lf_val1) || ((flags&1) && (lf_res == lf_val1));
  C:\Projects\ds2sdk_r\DS2x86\src\cpu_386.S(2364):	// ADC Carry: ((unsigned)lf_res < (unsigned)lf_val1) || ((flags&1) && (lf_res == lf_val1));
  Matching lines: 8    Matching files: 1    Total files searched: 99
```
There are actually 8 different code branches in DS2x86 where the 32-bit ADC opcode is handled, depending on the memory/register variations. Curiously, the third match (which is for the version that adds two 32-bit registers) had a wrong comparison order even in the comment! It looks like I have copy-pasted that version from the corresponding subtract operation SBB, and had just forgotten the fact that the carry flag calculation should be reversed! If I had had proper unit tests for all the 32-bit core opcodes (as I have for the 16-bit opcodes) I would have found this problem a long time ago. However, I believe the core opcodes currently have so few bugs that spending weeks in creating full unit tests for all of them is not worth it any more. Anyways, after this fix the 3D areas in Albion began to look correct:
The screens are swapped back to the normal order when the Touch Pad Mouse gets turned off, for example when the executable program changes. Previously the screens might have stayed swapped, with no way to swap them back as the code thought the screens were already in normal order!
Major internal rewrite to enable virtual memory (paging) support. This mostly affects the graphics opcodes (EGA and Mode-X) for now, as I had to change the method the memory address table values are flagged to make a distinction between normal RAM and graphics VRAM emulation. This may cause some graphics errors if I forgot to change some of the graphics opcodes. Let me know if you notice graphics problems that did not happen in the previous versions.
I have also done a lot of work for virtual memory support itself, but this work is not yet finished and thus games that need virtual memory will not run yet. However, in this version I have not disabled the virtual memory completely, instead, the games will attempt to progress further. Since the code is not fully done yet, they will crash or drop into debugger, most likely pretty soon after they turn on paging. You can determine that paging was the cause for the drop into debugger if you see the "Paging=1" text in the debug report, like this:

The crash logs will be very useful for me when I continue my work on the virtual memory, so please send the debug logs to me again! You can attempt to continue after the drop into debugger with the B button, though this will also most likely fail. I have been using Descent 2 demo in my virtual memory tests, and it currently gets a Page Fault within the Page Fault handler, which makes it quit back to DOS with an error report. This does not happen in DOSBox, so there is something wrong with my paging code. This is what I am currently attempting to fix.

Thanks again for your continued interest in DSx86 and DS2x86, I hope this version again fixes a few annoying problems in DS2x86! Sorry I did not have time to make more fixes, but I am still focusing on the virtual memory and paging support. That will still take the most of my time, but I try to add some other fixes as well into the next version.

Aug 21st, 2011 - DS2x86 progress

For the past week I have mostly been working on the virtual memory support in DS2x86. The biggest part of this work has been to convert all the memory access routines (in each of the opcodes) to work with the new 4GB addressable memory. This work has progressed well, and practically all the opcodes have now been converted. The next step was the new InitPage routine which will copy the memory address from the paging-related two-level tree to my linear memory address table. This has also been done, and I have even managed to reach up to the first actual Page Fault exception in the Descent 2 Demo (which I use for my virtual memory tests). I have tested the same game in my debug-version of DOSBox, and luckily DOSBox gets the same Page Fault at the exact same address, so it looks like my code is correct at least up to that point.

When handling the Page Fault, the Descent 2 Demo performs a Task Switch. I had not yet implemented any task switch functionality into DS2x86, so this is what I am currently working on. The task switch routine is somewhat complex, and I have been having some trouble getting it to run properly. I am currently debugging it, so hopefully it will start to behave properly soon.

I also realized that the method I have been using for mapping certain memory locations to EGA or Mode-X RAM is not quite sufficient for virtual memory. When the whole 4GB of RAM is mappable, I can not simply use the sign bit of the 32-bit memory address to tell whether the address is in RAM or in graphics memory. I have an idea for a new system that should be just as fast, but have not had time to implement it yet.

I have also made a couple of small fixes to DS2x86, which will be included in the next release:

The swapped screens are reset when a program changes (or you switch to a new configuration which does not have TouchPadMouse active). This feature was missing from DS2x86, as the system had I used in DSx86 for this was not compatible with the DSTwo SDK. I had just commented it out when I originally ported the code from DSx86, and had forgotten about it since.
The file rename functionality did not work. This was caused by one of those annoying duplicate symbols in DSTWo SDK. The standard C library has a file rename function int rename (const char *from, const char *to) which I tried to call from my C routine. However, looking at the dump file, I noticed that it actually called a very different "rename" routine, which looked like this:
```
80201e88 <rename>:
80201e88:	24020fc6 	li	v0,4038
80201e8c:	0000000c 	syscall
80201e90:	14e0fffb 	bnez	a3,80201e80 <_IO_sscanf+0x30>
80201e94:	00000000 	nop
80201e98:	03e00008 	jr	ra
80201e9c:	00000000 	nop
```
That is, it first performed some system call with value 4038 (I have no idea what that does), and then jumped to some generic _IO_sscanf() error handling routine if the syscall failed. I went through some of the include headers, and found out that the correct file rename function to call is actually fat_rename, and after I changed my C code to call that the file rename functionality began to work.

The future plans include working on the virtual memory, and the task switch routines. I also hope to make a few more small fixes and improvements to DS2x86 before releasing the next version.

Aug 14th, 2011 - DSx86 version 0.37 and DS2x86 version 0.22 released!

Changes in both DSx86 and DS2x86

I implemented a couple of fixes that affect both DSx86 and DS2x86, as the fixes were implemented into the C language code that is very similar in both. I will perhaps at some point attempt to combine the C codes for real (using ifdefs), but for now they are separate files with mostly similar content. Anyways, the changes that affect both programs are:

New TPMXScale and TPMYScale ini file parameters. These can be used to adjust the TouchPad Mouse scaling for each game. The default values (when the parameters have not been given in the INI file) are 1.0 for both. You need to experiment with different values to find the best scaling factor for each game. Note that changing these might still not make the TPM work in a certain game, as the game might not use the mouse in a way that is compatible with the touchpad mouse emulation. These new parameters should help in some games, though.
The key repeat function in the keyboard emulation was fixed so that it repeats the actual key that was pressed, not just the non-enhanced version of the key. This will help with the stuck cursor key problem in Frontier.
The graphics mode detection code has been enhanced, so that both methods of entering 240-row Mode-X graphics modes (either 320x240 or 360x240 as in Albion) will be detected, and the correct graphics mode initiated.

Changes specific to DSx86

DSx86 has been built with libNDS 1.5.3, which should allow write access to the SD card when using Sudokuhax. I have no way of testing this feature, so it might work, not work, or corrupt your SD card completely! Please use caution and back up your SD card before using this version with Sudokuhax!

This new libNDS version 1.5.3 had the same problem as the 1.5.0 version I had been using, where ARM7 (or at least the AdLib audio emulation) freezed after a few seconds of playing AdLib audio. In 1.5.0 I got everything to work when I disabled all references to the i2C code, commented out the secondary ARM7 IRQ table handling, and removed the whole i2c.c source module from the libnds7.a sources. I did the same thing with the 1.5.3 version sources and built the libraries again, and that seemed to help again. I don't know why the i2C code is incompatible with my AdLib emulation, but as long as this hack works it is not a big problem for me.

DSx86 also has much faster Smooth scaling routines in the 320x200 256-color modes (MCGA and Mode-X). The new code is courtesy of "sverx", who kindly spent some time looking at my scaling code and inventing various speed tricks that hadn't occurred to me. For example, the new code uses DTCM for the palette lookup table, instead of the actual (and slow) BG_PALETTE VRAM memory I had been using. The 75%/25% weighted average calculation is also much faster. The new smooth scaling code is still a lot slower than Zoom, Scale or Jitter modes, but it is noticeably faster than before. Thanks again to "sverx"!

Changes specific to DS2x86

There should be no other noticeable changes (besides the common enhancements mentioned above) in DS2x86, even though the program size has increased somewhat. The size increase is due to the new paging / virtual memory -enhanced opcode handlers. This work is still in progress and thus games that use virtual memory will still not work. The updated opcodes should work like before, and the slowdown caused by the new code should be very marginal, but let me know if you find something that worked before but does not work any more. There is a chance that I have broken something when doing the virtual memory support changes.

Future work

I plan to continue working on the virtual memory support for DS2x86. This will take a while yet, so I might not be able to make any other major enhancements to the next version. I don't know yet how long it will take to make the virtual memory working, I might still run into some major obstacles. Until now it has progressed reasonably well, many opcodes already support virtual memory.

Aug 7th, 2011 - DS2x86 progress

My summer vacation ended by the previous weekend, so this week I have not had all that much time to work on DS2x86. I spent an hour or so every weekday debugging some problem games, though. I first decided to finally try and find the keyboard problem in Frontier, where the keys seem to get stuck when in the star map, so that you can't properly move around in the map. I debugged the keyboard interrupt routine that the game uses, and found out that it uses separate maps for each normal and each extended key that is either down or up. When the problem occurred, I noticed that the non-extended cursor key was down in the map, while the extended cursor key was up. After checking my keyboard emulation routines, I realized that my key repeat feature always repeats only the non-extended versions of the keys!

What this problem meant for Frontier, was that my keyboard emulation routine first sent an enhanced cursor key down code, and if you kept the key pressed, it began sending the non-enhanced cursor key down codes. When you then lifted your finger from the key, my routine sent the enhanced cursor key up code, but never an up code for the non-enhanced key. Thus, the non-enhanced key stayed down, and Frontier did not actually make a difference between enhanced and non-enhanced cursor keys when looking for keys that were down. This was relatively simple to fix, by always repeating the same key, be it enhanced or not, that was pressed. After this change the Frontier star map began to scroll properly when using the cursor keys, but there is still a problem when using mouse. I need to still work on the mouse problem, I'll see if I can fix it also before releasing the next version.

Next I looked into the weird 360x480 graphics mode problem in Albion. This graphics mode turned out to actually be the 360x240 mode (same as in Settlers, for example). I found out that there are actually two ways to make the VGA card display 240 instead of 480 rows, and I had only checked one of those ways in DS2x86. So, I added a check for the other method as well, and Albion began to look correct.

I had heard reports that the Tap functionality in the TouchPad Mouse (TPM) emulation is broken in DS2x86. I used Albion to test this, but did not find anything wrong with the functionality. There are some difficulties in the key and touchpad reading in general in DS2x86, but that is mostly due to the DSTwo SDK. Besides that, the TPMTap=TRUE setting in the INI file seems to work fine.

I did however notice that the TPM handling did not work very well in Albion, it looked like the horizontal movement of the mouse pointer was about twice that of the distance I moved the stylus, and the vertical movement was about four times that. So, I decided to finally enhance the TPM functionality by adding X and Y scaling factors to the configuration. So, in the next version you can use TPMXScale and TPMYScale game-specific INI file keys to select the mouse movement scaling factor when using touchpad mouse. For example, in Albion you could use this:

[MAIN]		; For Albion!
TPMXScale=0.50
TPMYScale=0.25

The scaling factor should be a floating point number, where 1.00 is the normal default scaling. Note, though, that Albion is a bad example in a sense that it uses the mouse buttons to make the character walk, so using the D-Pad mouse might be easier.

After those changes I started making some changes to the code that would allow me to support Paging (as in Virtual Memory) in the future. There are several games that want to turn paging on (including the Wing Commander Armada I tested last weekend), so I want to start working on this feature soon. The first change I made (which should not actually affect anything yet) was to change my memory mapping to use 4KB pages instead of 16KB pages. I had originally used 16KB pages, as that is the smallest granularity needed to support EMS memory. This was also enough to handle the special paging of the VGA memory (where the 64KB memory block at address 0xA0000 is actually accessing 256KB of VRAM). However, to support real paging each page should be 4KB in size. I had used a simple look-up table that had 1024 entries (to access all 16MB of emulated memory with 16KB pages), so I increased the table to 4096 entries so that each page is 4KB, and adjusted all the routines that access this table to handle the new page size correctly.

After I got the code running properly with the 4KB page size, I again spent some time thinking about how to add the actual paging functionality without sacrificing the performance of games that do not use paging. When paging is not in use, I can simply look up the physical memory address from my one look-up table (or Translation Lookaside Buffer, TLB), which is reasonably fast. However, when paging is enabled, the physical memory lookup is much more complex, as the following image (from Wikipedia) illustrates:

The linear address (which without paging I can simply shift right 12 bit positions to get the mapping table index, and then add the mapping table value to the remaining bits to get the correct physical RAM address) is split into three fields, where the top 10 bits index the Page Directory table, the next 10 bits the Page Table, and the remaining 12 bits the byte within the page. The CPU register CR3 gives the starting physical address of the Page Directory table, which in turn gives the starting physical address of the Page Table used for the middle 10 bits of the linear address. Each of the page entries contain the 20 highest bits of the physical address, while the low 12 bits contain all sorts of bookkeeping bits, telling if the page has been accessed, is it present in memory, is it dirty (that is, it has been written into), etc. Quite a complex (read: slow) system.

The biggest problem I have been having with this is to figure out a way to make the memory access behave differently when paging is in use. Until now all the linear page lookup code has been inlined into all the opcode handlers, so there is no subroutine that I could change to work differently. And, since my primary goal was to not make the non-paging code any slower, adding a subroutine was out of the question.

The current solution I am working with, is a bit of a brute force method: If I increase my linear page mapping table to have 1048576 entries (to map the whole 4GB of virtual RAM), I can then precalculate all the needed physical page addresses and use the exact same code I use without paging to look up the physical address also when paging is in use! This however means that the linear page mapping table will take a whopping 4MB of RAM! I checked how much memory I currently use, and noticed that my code takes about 4MB, and the emulated RAM takes 16MB. Then there is the allocated EMS memory of 4MB, plus some stack area etc. So, slightly over 24MB of the 32MB of RAM is in use. Increasing the page table to 4MB will still leave a few megabytes free, so it should actually fit. And of course only a small part of the 4MB mapping table is actually accessed, even though it needs to contain the whole 4GB address range. The game can use whatever virtual addresses it wants, as the paging tables then map these to the existing RAM.

I am thinking of using a similar system to what DOSBox uses when paging is turned on: All the lookup-up table values are cleared, so that whenever a new page is accessed it is forced to go though an initialization routine. This initialization routine then checks and fills the bookkeeping bits of the page, and in my case it can also fill the linear look-up table with the proper physical address. I already have code in (almost) all the look-up table access routines to jump to a different handler when EGA or Mode-X RAM is accessed. I just need to add a new case into this opcode-specific jump table so that in the future it selects between RAM, EGA, Mode-X and unitialized page handling routines. This should not make the code any slower, but it does make it somewhat bigger.

There are currently around 500 different locations where this inline code is used, and I'll need to generate handling code for the uninitialized page situation to all of those locations. This will obviously not be done by the next version yet, but this is what I plan to work on from now on. I'll also try to fix some bugs in the code and release new versions, but there will probably not be any other new features until I get this paging system done.

July 31st, 2011 - DS2x86 version 0.21 released!

This is mainly just a quick fix version, to fix some issues in the 0.20 version with a few more FPU opcodes implemented. The new FPU opcodes are fsincos, fptan, fprem, fyl2x, f2xm1 and fscale. The first three I have not tested properly yet, so those might have some bugs. The last three I have tested so they at least should work properly.

At least two new games seem to be runnable (or at least startable) in this version, Abuse and Comanche. Abuse needed just a few more FPU opcodes implemented to start running. Comanche however had some more problems:

It went into a weird 640x240 graphics mode. This turned out to actually be a 320x240 tweaked Mode-X, but the game went into this mode by first going into the VGA 640x480 16-color mode, and then just setting the bit in the mode register that selects between 256-color and 16-color modes. My graphics mode detection code did not check for this bit after the original mode change had occurred, so it stayed in 16-color mode and thus the graphics were completely corrrupt. I added a check for this bit into the 640x480 16-color mode handling, so now Comanche uses the correct graphics mode.
Comanche also took a very long time to start (while it started immediately in DOSBox) and ran very slowly. This was caused by a broken PC timer #2 handling in DS2x86. The normal interrupt timer #1 worked correctly, but the secondary timer (which is rarely used for other things besides the PC beeper sounds, which are not yet supported in DS2x86) only worked with the default 18.2Hz speed, all other speed options caused weird behaviour. I fixed that (for at least the most common timer modes), and now Comanche starts up properly. This might have caused weird slowdowns in some other games as well.
There were also some very rare opcodes missing and some VGA palette issues with Comanche, which I also fixed. The problem with the palette was caused by the game being in 16-color mode while it sets up the 256-color mode palette. Hopefully my change to this behaviour did not break any real 16-color games.

I also found that the old Wing Commander Armada CD-version game I have on my bookshelf is actually the DOS version, so I installed that using DOSBox and copied it to my SD card and tested running it in DS2x86. I got an error message "EMS driver is not VCPI compliant", so I will probably next add the VCPI features to the emulated EMS driver, if that turns out not to be a big issue. Even though the game recommends using a 486/33 machine, it has the option to only play the turn-based strategy parts of the game, which should work fine even on a slower machine.

The bigger issue I need to work on pretty soon is the virtual memory support. I need to figure out a way to implement this in such a way that it does not slow down all existing memory access. This is the most difficult part of implementing the virtual memory, and this is why I haven't implemented it yet. I do have some ideas, so I'll start experimenting with this issue in the near future.

My summer vacation ends today, so I will get back to the normal two-week release cycle, as I don't have all that much time to work on DS2x86 (or DSx86) during workdays. Thanks again for your continued interest in this project, and for the debug logs and other bug reports you have been sending!

July 24th, 2011 - DS2x86 version 0.20 released!

Sorry it took me over a month to get a new version released. However, this version now has two noteworthy improvements, namely FPU support and the fact that this version is now built with version 0.13beta of the SDK. This version of the SDK has already been released at the end of last year, but I only recently noticed that I still used the even older 0.12beta version to build DS2x86. This might have caused at least some of the audio problems, as the audio features were somewhat improved in SDK v0.13beta. I also decided to jump the version number to 0.20 with this version, to show that it has some major internal changes.

The biggest enhancement is the addition of FPU support. For now this only works in 32-bit protected mode (meaning those DOS4GW games), as those are the games that mostly expect the existence of an FPU. My two test games, X-COM UFO and Destruction Derby, now seem to run the FPU parts fine. The problems I mentioned in the previous blog post were actually both caused by the FPU opcodes, though in some rather interesting ways. It took me almost three days to track down and fix the problems. There is still a rather serious problem in the texture mapping in Destruction Derby, but I am not sure if that has anything to do with the FPU opcodes. In any case, Destruction Derby runs much too slow to be properly playable, so fixing the texture mapping is not a high priority. Fixing it might fix some other games as well, so I do plan to look into it at some point.

Anyways, the FPU problem in Destruction Derby was in the FCOMP (compare) opcode. When I first tested that opcode it seemed to return a correct result, so I spent a lot of time looking elsewhere for the problem. In the end it turned out that I had a small typo in the opcode handler, I had written "sll t4, t3" when I meant "sll t4, t3, 1", and the assembler had interpreted it as "sllv t4, t4, t3". In this case the assembler tried to be a bit too smart, when it replaced the sll opcode (which can only have an immediate shift value) with the sllv opcode (shift by a register value). Since I had forgotten to type the immediate value, it would have been nicer if the assembler had given an error instead. That problem meant that the comparison result was mostly random, it sometimes returned the correct result and sometimes not, somewhat depending on the lowest 5 bits in the registers I wanted to shift left by one bit before the comparison.

A more interesting problem was the one that affected X-COM UFO. It behaved very erratically, dropping to a debugger with an unsupported opcode at random locations, where the opcode should have been supported. That pointed to the interrupt routine failing to execute properly and dropping back to where the interrupt should have happened. It took me a while to notice that when it dropped into the debugger, the Virtual 86 mode flag in the flags registers was set! I had not coded support for interrupts in virtual 86 mode yet, so that was the reason for the weird unsupported opcodes. However, I was pretty sure that the game does not set this flag on purpose, so I added a check into DS2x86 to drop to the debugger immediately after this bit gets set in the CPU flags. After a few test runs, the results were quite interesting. Once the bit did not get set, but DS2x86 dropped into the debugger anyways. In other runs, twice the opcode was fsin, and once fcos, two of my new FPU opcodes.

The interesting part was that I did not touch the CPU flags in the fsin and fcos opcode handlers at all! I did store the register that contained the flags into stack, along with various other registers that calling the GCC math library sin() and cos() functions will globber, and then restored the registers after the call. I could not immediately see anything wrong with my code, so just for fun I copied the flags before the GCC library call to the EAX emulated register and the same flags after restoring them from stack to ECX emulated register, and then added a forced drop to the debugger after the call. And, much to my surprise, the value restored from the stack was completely different to the value I pushed to the stack!

I looked into the dump file for what happens in the GCC library sin() and cos() functions, and noticed that indeed, they destroy the two topmost words of the caller's stack!. I have a very hard time believing this behaviour is by design, this might actually be a bug in the GCC compiler itself. Here below is the dump from the start of the math library sin() function. The input is a 64-bit double value, in 32-bit registers a0 (low word) and a1 (high word).

80250fd0 <__sin>:
80250fd0:	3c027fff 	lui	v0,0x7fff
80250fd4:	3442ffff 	ori	v0,v0,0xffff
80250fd8:	00a23024 	and	a2,a1,v0
80250fdc:	3c033e50 	lui	v1,0x3e50
80250fe0:	27bdfda8 	addiu	sp,sp,-600		// sp -= 600, make room for local variables
80250fe4:	00c3182a 	slt	v1,a2,v1
80250fe8:	afbe0250 	sw	s8,592(sp)
80250fec:	afbf0254 	sw	ra,596(sp)
80250ff0:	afb7024c 	sw	s7,588(sp)
80250ff4:	afb60248 	sw	s6,584(sp)
80250ff8:	afb50244 	sw	s5,580(sp)
80250ffc:	afb40240 	sw	s4,576(sp)
80251000:	afb3023c 	sw	s3,572(sp)
80251004:	afb20238 	sw	s2,568(sp)
80251008:	afb10234 	sw	s1,564(sp)
8025100c:	afb00230 	sw	s0,560(sp)
80251010:	afa40258 	sw	a0,600(sp)		// Store the low word of input to sp+600
80251014:	afa5025c 	sw	a1,604(sp)		// Store the high word of input to sp+604

You can see that the routine first reserves 600 (0x258) bytes of stack space for local variables, but then stores the input value to offsets 600 and 604 from the start of the reserved space! This effectively destroys the two last words that the caller had pushed into the stack, which in DS2x86's case were the emulated CPU flags and a stack size mask register. So, after returning from fsin or fcos routines, the flags were randomly set, and also the stack-relative addressing might have addressed the wrong memory area.

This problem was easy to overcome by leaving two extra words to the top of the stack when calling these functions, but if you are writing software for DSTwo yourself, beware this problem! I spent some time googling whether this was a known problem and if a fix is available, but did not immediately find anything. Let me know if you know the correct place to look for this information!

I also made some other fixes, based on the debug logs you have sent. Still another change is that I changed the Makefile to compile the C modules with -no-long-calls option, which makes the code smaller and faster. I don't know why the SDK defaults to long-calls, as they are only needed if the code does not fit into a single 128MB block, and the DSTwo only has 32MB of RAM! Anyways, let me know if you run into any new problems with this version, and feel free to test the games that used to report unsupported FPU opcodes! Thanks again for your interest in DSx86 and DS2x86!

July 17th, 2011 - DS2x86 FPU progress

For the past week I have been implementing FPU opcodes into DS2x86. I have been using X-COM UFO and Destruction Derby for testing the FPU opcodes. The current status is that both of those games seem to run the FPU parts fine, but then crash because of some probably unrelated issues. So, before I can continue with the FPU opcodes I think I need to fix those other issues first. For example, X-COM UFO now allows the rotation of the globe in the GEOSCAPE part, but when you select the home base location it begins to misbehave.

The FPU opcodes consist of 8 actual opcodes, 0xD8..0xDF, each of which has the modrm byte, so there are actually 8*256 FPU opcode variations. I have implemented the first group of 256 variations completely, and then various other opcodes partially. The rarer opcode variations are still missing. I have been using the DOSBox implementation, along with a good FPU reference called Simply FPU by Raymond Filiatreault, to handle the communication between the CPU and FPU. For the FPU internals, I have used the SoftFloat library together with DOSBox sources as a reference. I decided to call the GCC library functions for most of the more complex operations (like the actual arithmetic operations ADD, SUB, MUL, DIV, SQRT, etc), but I coded all the conversion routines between the floating point formats and between integer and floating point values myself in MIPS ASM. That should help with the speed somewhat, as calling the C routines is pretty slow. Another drawback of using the C library routines was that the ds2x86.plg file size increased quite dramatically, first by 150KB when adding atan() call, and by another 90KB when adding sin() call. The basic routines did not increase the file size, as they are probably used by the SDK already.

Anyways, the next step is to fix the other problems in X-COM UFO and Destruction Derby, so that I can continue using them for my FPU tests, and then implementing the remaining FPU opcodes. I have only implemented the FPU handling from protected mode for now, as all the games I know that need FPU are protected mode games. I suppose I need to eventually handle FPU calls from real mode as well.

July 10th, 2011 - DSx86 version 0.36 released!

DSx86 0.36 release notes

This version has only small game-specific fixes, no major new features. I originally planned to have some new features as well, but then I got distracted by an idea to start implementing FPU opcodes to DS2x86. Anyways, here is a list of the games I worked on in this version:

Arachnophobia is reported on the Compatibility Wiki as giving an "Unsupported graphics mode 09" error, but I was not able to reproduce this. It does give a green screen at the start, as it changes the EGA palette so that the black color is replaced by a green color, but after some loading time it displays the logo screen fine. I can progress up to the copy protection screen, so it looks like the game will work in this version of DSx86.
Blake Stone 2 just quits back to DOS when launched. I spent some time debugging this problem, and then finally found the reason. The game checks the CPU type, and when it detects a 286 (or lower) processor, it simply quits to DOS with no error messages. So, this game will not run in DSx86, but it might run in DS2x86 at some point. It already starts up in DS2x86, but seems to lock up after a little while.
Fury of the Furries uses SB DSP command 0x24 (ADC DMA 8-bit) to detect the SoundBlaster. This command was not supported in DSx86, and as the game spinlooped waiting for the SB IRQ to happen, it did not progress further. I implemented this command (faked it, actually, as recording audio using DSx86 would not be all that useful), and so the game does not hang any more. The game uses some in-frame palette animation to display more than 16 colors in EGA mode within a single frame, and as the frame timings differ between DSx86 and real hardware, this will cause some annoying flickering at least in some parts of the game. This would be quite difficult to fix, and attempting to change the timings so that this trickery would work might also break many other games, so I do not plan to do that.
JumpJoe 2 uses the DOS 1.0 FCB file operation "Write random record to FCB file" which was not yet supported in DSx86. I implemented this function, but did not test the game itself so I don't know if this was enough to make the game run properly.
Risky Woods uses some opcodes with a "LOCK" prefix. This prefix is meant to synchronize bus access on a multi-CPU machine for atomic memory operations, and is very rarely used in application programs. I had not implemented this opcode mostly because the interrupt table at 0000:0000 happens to have a lot of bytes with 0xF0 (which is the hexadecimal value of the LOCK prefix), so if a game in DSx86 jumps to a zero pointer it will drop automatically to the debugger almost instantly. However, this game seems to actually use the LOCK prefix, so I implemented it (as a no-operation). This allows the game to run fine.

DS2x86 progress

I have not yet figured out the cause for the audio buffer problems in DS2x86. Since solving this problem still needs more time, I decided to instead work on enhancing DS2x86 in other ways, starting with floating point (FPU) support. It would be reasonably fast to add FPU opcodes simply by calling the GCC library floating point routines (which is what DOSBox does when not running on an x86 host), but that would mean that I would need to switch between the register convention of my emulator core and the C language register convention all the time. Those register usage conventions differ so much, that I would need to push and pop around 10 registers to/from stack for each FPU opcode I handle, and this would slow down the handling quite a bit. Also, it would be difficult to handle various floating point exceptions and rounding methods reliably. So, instead of that, I decided to code all the FPU operations in ASM, using a hybrid of DOSBox fpu.cpp source code and a SoftFloat C library by John Hauser. I am basically using the FPU API from DOSBox, but the internal algorithms are based on SoftFloat (but everything optimized in MIPS ASM).

This will of course be a lot of work, but as one major goal of the whole DSx86 project is that I want to learn how the x86 architecture works, and as I have been rather unfamiliar with the whole IEEE floating point standard until now, I'll take this as a learning experience. I should have a pretty good understanding of how floating point values work after I have coded all the needed floating point math functions in Assembler. :-)

I also plan to redo my personal homepage at http://www.patrickaalto.com during my current summer vacation, as that page is really outdated and does not even have a link to my DSx86 pages at all! Rather embarrassing...

July 7th, 2011 - DSx86 wins Homebrew Bounty 2011!

DSx86 is the winner of GBATemp's Homebrew Bounty 2011 competition, in the Nintendo DSi category! In addition, DS2x86 got a second prize in the DSTwo category. Huge thanks to everyone who voted for DSx86 (and DS2x86), and to all of you who have tested DSx86 and DS2x86 and sent me the debug logs and other information! The logs have been a big help in my improving DSx86, so without your help I doubt I would have won this prize.

July 3rd, 2011 - Slow DSx86 progress

During this past week I haven't worked much on DSx86 or DS2x86. The two main reasons for that were the broken NAS server machine, and the heat wave we have been having. The broken NAS server meant that I had to spend some time building and setting up the new machine, and the heat wave in turn makes me want to do other things besides programming. :-)

I got the parts for the new NAS server on Friday, and I got the machine running on the Friday evening. Somewhat surprisingly, everything worked on the first try! I did not have to do much troubleshooting, which usually always happens when building a new PC. All of Saturday then went to configuring and setting up the machine, as I also added a new disk so that I now have two RAID arrays on it. Today I have been moving various files around on the disks, so that each partition has some free space for future needs.

The weather is supposed to be getting somewhat cooler for the next week, so I plan to continue working on DSx86 and DS2x86 then. My summer vacation begins tomorrow, so I should have more time to work on my projects in the next four weeks. I usually come up with some additional projects during my summer vacation, so I might not spend all my time working on my emulators.

Pretty much the only thing for DSx86 that I have done during the past week is downloading a few more misbehaving games, and doing some preliminary tests with them. I have not yet figured out how I could fix the audio problems in DS2x86, so I will probably work on the misbehaving games until I get some fresh ideas.

That's it for this short situation report, hopefully next weekend I have something more relevant to report!

Previous blog entries

See here for blog entries from January-June 2011.
See here for blog entries from July-December 2010.
See here for blog entries from January-June 2010.
See here for blog entries from 2009.

Main Page | Downloads | Credits