DSx86 - Blog

Dec 26th, 2010 - DS2x86 Alpha 0.01 release!

Well, here it finally is, the first alpha version of DS2x86! Please note that this is a very early alpha, and it has a lot of bugs and missing features. You should consider this just a test bench for my DSx86 port to the built-in 360MHz MIPS processor of the SuperCard DSTwo flash cart. But, since a few games seem to be somewhat playable already, and as I sort of promised an alpha version by the end of the year, here it is!

Installation

If you already have a working DSx86 installation on your flash cart, all you need to do is to copy the "DS2x86.plg", "DS2x86.bmp" and "DS2x86.ini" files into the "_dstwoplug" directory of your SD card. DS2x86 can use the same DSx86.ini file that you have used in the original DSx86. You can also use a separate DS2x86.ini if you wish to have different settings for DS2x86 and DSx86. NOTE! Do not confuse the two DS2x86.ini files! The one included in the zip file is the one required by the DSTwo plugin system, the one in the /data/dsx86 directory (and created by you) is the one used by the DS2x86 itself for game-specific configuration.

If you do not yet have the original DSx86 on your SD card, follow the instructions on my download page (or at DSx86 Compatibility Wiki). It is recommended you familiarize yourself with the original DSx86 first, before testing DS2x86. Or, you might want to wait for a more stable version of DS2x86 before installing it.

Limitations compared to DSx86

Enhancements compared to DSx86

Issues you might run into

Thanks for your interest in DS2x86! Have fun testing it, and again, please remember this is a very early alpha version! Be surprised if something actually works, not if it fails! :-)

Dec 19th, 2010 - EGA & Mode-X work, Team Cyclops iEvolution

EGA & Mode-X work

During the past week I have mainly been working on adding pretty much all the EGA opcodes, and then adding most of the Mode-X specific opcodes. Both are currently so far along that many of the games supported in the original DSx86 seem to at least start up also in DS2x86. I also tested a couple of game trainer intros that I happened to have on my SD card. Those use 386 opcodes in real mode, and after I added the 386 opcodes that those needed, they are also starting up. These obviously do not work at all in the original DSx86, so DS2x86 will bring a little bit of extra compatibility already in the first alpha version! Here are screen copies of those, "brun_it!" by Eternity and "HexxTrnr" by Qwerty.

 

It would be interesting trying to get some proper 386-game running, but I still have so much stuff missing from the hardware support. Also the features that exist still have bugs (for example, Wolfenstein 3D does not read the keyboard properly, and Supaplex has problems in the palette animation), so that I think I need to focus on those before working on the 386 features. But, it is starting to look that even the first DS2x86 alpha version, to be released in a week or two, will be able to run a few games!

Team Cyclops iEvolution

Team Cyclops last week announced that their new iEvolution flash cart will allow the use of DSi mode for homebrew development. That means that when running on Nintendo DSi, a homebrew software has access to a faster CPU (133MHz instead of 66MHz, I believe) and more RAM (16MB instead of 4MB). They also offered a free iEvolution cart for "legitimate" homebrew coders. I contacted them, and they accepted me as one such, and will send me an iEvo flash cart. So, after I have got DS2x86 working properly, I could start working on a DSix86 version. :-)

I am not sure yet what needs to be done in a homebrew software to enable DSi mode with iEvolution, but I assume they have some kind of an SDK (or at least instructions on enabling it when using the normal devkitARM). With a two times faster processor and more RAM, I think it might be possible to add 386-opcode support into DSix86 as well. It would still not run all that fast, perhaps at a speed of a 20MHz 386 machine, but it would have some advantages over the DS2x86 version (namely the ability to use my existing ARM7 code with the AdLib emulation). I assume it would be much faster and easier to do the DSix86 port than it has been to do the DS2x86 port to a completely different CPU architecture. Some of the current DSx86 architecture (which I had to rethink for DS2x86) is not very well suited to adding support to 32-bit registers, so coding the DSix86 version will still take some time to do, but it should not take half a year! Anyways, I'll let you know when I receive the iEvolution cart and can see how it works.

Happy Xmas to everyone of you celebrating it!

Dec 12th, 2010 - DS2x86 EGA scaling

Last week I worked on the EGA opcodes. Currently the great majority of the opcodes (that are supported in DSx86) have been ported to DS2x86, so a couple of EGA games are already running fine. Some string opcodes are still missing, all the BIOS-based character output functions have yet to be coded, and I currently only have a blitting routine for the 320x200 resolution mode, so many games still fail. I just this morning decided to experiment with the EGA mode linear interpolation screen scaling routine. I first coded it by handling the palette calculations while blitting, and after I got that working I switched to a precalculated look-up-table that has the interpolated output pixel colors precalculated for each two input pixels. Since the 16-color EGA mode uses 4 bits per pixel, it is simple to lookup the output pixel value based on an 8-bit input value containing two adjacent pixels. The routine that handles the LUT filling whenever the palette value changes is obviously rather slow now, but since that specific routine is only used in the 16-color modes, and those rarely perform fast palette animations, this will most likely not be much of a problem. The actual blitting routine is almost as fast as the non-scaling routine, so there will be practically no performance penalty in using the scaled screen mode!

I was mainly interested in seeing what the scaling quality would be when scaling screens with some small text, so here below are screen copies of some screens from Duke Nukem 1 and Duke Nukem 2. In my opinion the text is for the most part quite readable, much better than with the hardware scaling as used in the original DSx86.

  

I also figured out a faster way to handle the separate memory access methods between normal RAM, EGA VRAM and ModeX VRAM. Since the graphics mode memory is organized so that each input byte addresse maps to a word address in the emulated VRAM, I now precalculate the two-bit-shifted memory addresses into the main page mapping table, so that the result address into the graphics memory can be taken simply by shifting the address generated by the common memory address calculation macro. Previously I first had to subtract the logical memory start address, then shift the value, and then add the physical graphics memory start address to the value. This simple change increased the Trekmo framerate by 0.5 fps, which together with some earlier general speedups now give a total framerate of 12.9 fps. Compared to the 11.9 fps value a few blog posts back, that is quite a nice increase. Now DS2x86 runs Trekmo at about the speed of a 40MHz 386 machine. There is most likely still room for improvement in various locations in the code, I just haven't yet figured out proper ways to improve them yet.

I have a two-week Xmas vacation starting on the 20th, so my current plan is to work on DS2x86 adding the most essential missing features, so that I can then release the first alpha version before the new year. Some of the biggest features that exist in DSx86 but are still missing from DS2x86 are:

I will certainly not be able to add all of these before the new year, so the first alpha version of DS2x86 will be quite limited. It will mostly be meant as a technical demonstration and as a test bench for you DSTwo card owners, so that you can help me in testing various games on it and report the still missing features.

Dec 5th, 2010 - DS2x86 EGA work

Last week was a very busy week at the office, we had a big customer delivery of our software, so that I did not even have time to work on DS2x86 on the evenings. Thus, nothing much has happened since the previous blog post. I started working on porting the EGA graphics code from the original DSx86 to DS2x86, though. That is a somewhat boring work, as it just means translating the same algorithms from ARM assembly to MIPS assembly. However, I decided to change the memory organization of the emulated EGA graphics to better suit the DS2x86 16-bit color screen blitting, so that will at least cause some changes.

Since the last blog post I have also improved the SB digital audio handling a bit, so that it now sounds pretty good in Wing Commander II. It is still not very good when playing LineWars II, most likely because LW2 uses very short DMA buffers and my 60Hz DMA buffer scanning rate in DS2x86 might be too slow for that. Other minor improvements include better debug screen handling, so that I can now print debug strings also while simultaneously showing the touchpad keyboard. This helps me in debugging the new features, but it won't affect the release version.

In general I have not had any problems with the DS2 SDK any more, DS2x86 seems now to start properly every time. Also my screen and audio updating routines seem to be quite robust at the moment, so that I can focus on improving the actual x86 emulation. There are still a lot of things missing, and my current focus is to get many of the same games that run in DSx86 running in DS2x86, so that I can release the first alpha version by the end of the year. After that I will focus on the 386-specific and protected mode features. The DS2x86.plg file is already almost 2 megabytes, and it takes 6 minutes to FTP-transfer to my DS Lite, which is quite annoying. I always try to think of something else to do while it transfers, but especially if there is a minor bug in the latest code, it is quite frustrating to fix the problem in a few seconds, build a new version, and then again wait over 6 minutes to see if the problem got fixed.

Tomorrow is the independence day of Finland, so it is a holiday and I can continue working on the EGA features. I hope to get something showing on the screen in an EGA game by tomorrow evening. That's all for this short blog post, hopefully I have something more interesting to tell in my next blog post. :-)

Nov 28th, 2010 - DS2x86 screen scaling revisited, audio work

Screen scaling revisited

After posting my previous blog post, I got an email from Grégori Macário Harbs, who said that he is interested in screen scaling and interpolation algorithms, and has some ideas to share. He suggested a more efficient division of the pixel weights when interpolating over 5 input pixels to produce 4 output pixels. He had even gone so far as to provide an example of how the WC2 title screen would look with different interpolation methods, and he also provided some source examples. Many thanks for the tips Grégori!

Using the ideas that he provided, I was able to make the DS2x86 interpolation algorithm much faster, and also improve the quality. The new algorithm converts the first 4 input pixels with 75%/25%, 50%/50% and 25%/75% weighting to 3 output pixels, and the fifth input pixel is directly output as the fourth output pixel. Since the 25%, 50% and 75% values can be given as 1/4, 2/4 and 3/4, they are much simpler and faster to calculate than my original 80%, 60%, 40% and 20% weights.

Some other ideas that I learned from him for future use is the possibility to precalculate the weighted palette values beforehand, which would be especially useful when interpolating the 16-color modes, and the fact that bilinear interpolation of the Mode-X 320x240 screen is actually 5/4 interpolation on both axis. Neither of these I have implemented yet, but I am looking forward to when I have time to implement and experiment with these. Here below are the screen copies from my original interpolation algorithm (on the left) and the new improved algorithm (on the right). I think the new image looks noticeably sharper with fewer artifacts.

 

Audio work

By the end of last week I had gotten the new screen interpolation working, so I finally looked into DSTwo SDK audio support. I studied the example provided with the SDK, and then began to look into ways to implement that in DS2x86. Compared to the audio features of Nintendo DS itself, the audio support of the DSTwo SDK is very limited. It only supports one 16-bit stereo audio channel, which can run at 11025, 22050 or 44100 Hz. In DSx86 I had allocated one separate 8-bit audio channel to the SoundBlaster digital audio, one PSG audio channel to the PC speaker sounds, one 8-bit audio channel to the Covox or SB Direct DAC audio, and 9 separate 16-bit audio channels for the AdLib emulation. Packing all of these to the single stereo channel will be quite a challenge.

In DSx86 I coded the audio support so that it will be as close to the x86 method as possible. The game running in the main emulator will send the audio command bytes to an I/O port, which then get sent to the ARM7 side using the FIFO mechanism. The code running on ARM7 then interpreted these bytes (like it was the actual SoundBlaster card) and performed the proper operations. Now in DS2x86 I have to use the same CPU for both running the main emulation and handling the audio emulation.

I decided to start with the simple case of trying to get the SB digital audio channel working. I still used Wing Commander II as the test bench, as it plays speech during the game intro. In DSx86 I could run the separate audio channel at whatever frequency the x86 game requested, but now I needed to select one of the supported frequencies. I decided to go with the 22050 speed, as I thought that going to the full 44100 Hz will slow the emulation down unnecessarily. Since the input audio is 8-bit and I need to convert it on-the-fly to 16-bit and use interpolation to adjust the playing frequency, the quality will be pretty poor in any case, so 44100 Hz will certainly be overkill.

The SDK example had no timing features, it just used a spin loop to check when the audio buffer is free to handle the next block of the WAV file. It was not quite clear to me how the audio buffer interaction works, and whether handling the buffer filling in a timer interrupt (like in my screen handling) will work, so I spent quite some time experimenting with different methods of filling and swapping the buffers, with different buffer sizes. I need to have a timer interrupt running at 60Hz, so that I can use it to handle screen refresh and vertical retrace signalling to the x86 game (many of which sync to the screen retrace signal). I wanted to use this same interrupt for the audio buffer filling, mainly so that the screen contents sending and audio sending from the MIPS side to the ARM side do not conflict with each other. There are some difficulties syncing this 60Hz timer with the buffer filling, as there are three different buffer sizes that need to be taken into account. The ds2 audio buffer size needs to be divisible by 128, the number of samples played during each 60Hz period is 22050/60 = 367.5, and the third size is the DMA buffer length that the x86 game (in this case Wing Commander II) uses. In WC2 the buffer is 4000 bytes (not 4096 bytes), and the frequency it uses is 10752 Hz, so that each 4000-byte x86 DMA transfer will actually generate 8203 output samples.

The current status of my audio tests is that the SB digitized audio works in WC2, but the sound has some cracks and pops. I don't handle the continuing with the next DMA transfer after the previous has finished (and left a buffer only partially filled) properly yet, so I believe that is the cause for the cracks. However, the audio buffer filling happens inside a timer interrupt (which is not recommended in the SDK documentation), same as my screen update, and the game does not see anything wrong with the SB DMA IRQs and such, so I think I will get this working fine. I am currently using a 512-sample buffer (as the buffer obviously needs to have more than one 60Hz interval's worth of samples), but I will test whether different buffer sizes will improve things. Too long a buffer will cause a lag to the audio, so that is not helpful either.

In any case, still a lot to do on the audio front, but the big step forward was that I now understand the DSTwo SDK audio buffering method and can focus on working with it to implement the audio support in DS2x86.

Nov 21st, 2010 - DS2x86 screen scaling

Since last weekend I have been working on the I/O port handling and other things that already exist in DSx86 but which I had not ported over to DS2x86 yet. When I got DS2x86 running, I wanted to immediately start working on the protected mode stuff, so I left a lot of code commented out. Now I thought is a good time to port this code as well, so that DS2x86 could run most of the same games that DSx86 runs. I got most of the EGA/VGA port addresses done, and then also coded the MCGA graphics support, so that Wing Commander II will progress up to the part where it attempts to play digitized sounds.

After I had coded the straightforward MCGA mode blitting (where only a 256x192 window of the original 320x200 screen is converted to 16-bit color and copied to the DSTwo SDK internal buffers), I thought that this might be a good time to look into improved screen scaling methods. Many users of DSx86 have requested a smoother screen scaling method, but I have not yet added that as it needs going from the 256-color palette mode to 16-bit color mode. Now with DS2x86 I need to use the 16-bit color mode in any case, so this is a suitable test bench for the better scaling method.

The biggest problem with the more advanced screen scaling methods is that they are quite expensive computationally. The direct palette conversion code is reasonably straightforward, it just needs an extra table lookup to convert the palette index to a 16-bit color that can then be written to the output buffer, like this (shown for a single row):

    la      v0, BG_PALETTE               // v0 = address of the palette table
1:  lbu     t3, 0(t1)                    // Get a byte (palette index) from VGA VRAM
    addu    t1, 1                        // Increment input index
    sll     t3, 1                        // Palette table has 16-bit values
    addu    t3, v0                       // t3 = pointer to the palette table
    lhu     t3, 0(t3)                    // Get the 16-bit color from the palette table
    addu    t0, 2                        // Increment output index
    sh      t3, -2(t0)                   // Store the 16-bit color to output table
    bne     t0, t2, 1b                   // Loop until one row done

However, even the simplest smooth scaling method, linear interpolation, needs a division and two multiplications per pixel. I am actually not sure how many CPU cycles multiplications and divisions take on the MIPS processor of DSTwo, but I assume they are more expensive than normal additions and subtractions. In any case, I used a calculator and some experimenting, and noticed that what I actually need is a way to smoothly draw 4 output pixels for every 5 input pixels (as 320/256 = 5/4). So, I have only four separate cases, and in fact the two rightmost output pixels are mirror images of the two leftmost pixels, so actually I only have two different cases. I was sure that I can come up with some shortcuts for these two situations, so I began to look into these more closely.

Linearly interpolating over 5 input pixels showed that the first output pixel should have a color of 80% of the first input pixel and 20% of the second input pixel. The second output pixel should have 60% of the second input pixel and 40% of the third input pixel. Mirroring these, the third output pixel should have 40% and 60%, and finally the last output pixel needs 20% and 80% weighting of the colors. To get rid of the divisions, I looked into multipliers that would let me divide the result by a power of two (so that I can use a shift instead of division). The first 80%/20% case is pretty close to 25/32 (78.125%) and 7/32 (21.875%), and the 60%/40% case is close to 5/8 (62.5%) and 3/8 (37.5%). I coded the first version of a linearly interpolating scaling code using simple shifts and additions to handle this pixel weighting, and I thought the result looked satisfactory, especially as the pixel weighting percentages are not quite correct. The code did not seem to cause any noticeable slowdown, even though it is still far from fully optimized. Here below are screen copies of the zoomed and scaled versions of the same Wing Commander II title screen.

 

Next I plan to start looking into the audio support. The whole audio playing technique in the DSTwo SDK is still completely unclear to me, as I haven't looked into it at all yet. That is the biggest feature still completely unsupported, so it is time to start working on it.

Nov 14th, 2010 - Trekmo running in DS2x86!

During the last week I kept adding opcode after opcode, with seemingly no change in how far Trekmo progressed in DS2x86. Trekmo uses a lot of trigonometric calculations and other algorithms before it actually draws anything on the screen, so it was rather boring adding all of these, and not even being sure whether I left some bugs in them or not. Then this Friday, after I had added yet another opcode, Trekmo suddenly began drawing the actual space scenes! So, here are some pictures of Trekmo running in DS2x86, to make up for the lack of screen copies in the last couple of blog posts. :-)

   

The interesting (and pleasently surprising) thing was that everything seemed to work properly! I had assumed that I must have coded at least some bugs in the arithmetic opcodes, so I expected to see broken polygons and possibly a complete crash after a few frames. There did not seem to be any problems, though. Trekmo progressed fine for several seconds, and then reached a yet another unsupported opcode. I thought this was enough for Friday, though, so I left adding the remaining opcodes later.

On Saturday morning I worked a while with an improved IRQ system (as the current system is far from optimized), but could not get it to work properly. I always have trouble with the IRQ system, it is by far the most difficult thing to get working in my emulator. In the end I had to roll back my changes, and decided to let the IRQ system be like it is for now, and work on getting Trekmo to run all the way thru to the end. Trekmo has a framerate calculator, so it can be used for benchmarking the combined CPU and graphics card system of a PC. In my Trekmo documentation I have a table of results from the systems I ran it on (back in 1994). The table has the following values:

MachineCPUGraphicsBusFramerate
Compaq Prolinea 4/66486DX2/66ET4000VLB40.7 fps
No-name clone486DX2/66CirrusVLB38.0 fps
Developed on this one:486/33ATIISA20.5 fps
AST Bravo LC 4/25s486SX/25CirrusVLB19.6 fps
Hyundai S386-C386/20S3ISA6.5 fps

Finally today I managed to get Trekmo to run all the way to the end, so I was able to get a benchmark reading. Trekmo runs in DS2x86 (with the DSTwo MIPS processor running at 360MHz) at an average speed of 11.9 fps, which makes it noticeably slower than a 486SX/25 machine, but nearly twice as fast as a 386/20 machine. So, my original estimate of "a bit faster than a 386/33" machine is still valid. And, if I get the IRQ handling improved, it might still get a bit faster. However, I still have no audio support, and that in turn will make the emulation run slower, as I can not offload the audio handling to another CPU as I did in the original DSx86.

So, now that Trekmo is running, I need to find a new program to test. It would be interesting to try to get Windows 95 running, but I believe that is a bit too big a project to tackle at this point. DS2x86 still lacks a lot of protected mode opcodes, so perhaps I should instead try to make some 386-specific games (Zone 66, Doom) running. I should also start working on audio emulation, but it just feels so boring. I have already coded an AdLib emulation once, so coding it again for a different assembler does not feel like an interesting task.

Oh, by the way, GBATemp recently published my interview. Go ahead and read it in case you are interested in some background information about me and DSx86. Much thanks to Another World for interviewing me, it was a fun experience! I actually had to stop and think about some of the questions, as I hadn't really thought about some of the things mentioned in the interview before that.

Nov 7th, 2010 - Division trouble

After my last blog post I continued working on the new macro system, and kept adding new 386-opcodes using the new macros. This progressed nicely, until I ran into a div ecx opcode. The problem with the x86 division opcodes is that they use double-precision dividends. For example the opcode "div cx" divides a 32-bit value in DX:AX registers with a 16-bit divisor in CX register. Similarly, "div ecx" divides a 64-bit value in EDX:EAX with a 32-bit divisor in ECX. However, the MIPS architecture only has 32-bit divide operations, so I was able to code a 1:1 mapping between "div cx" and the MIPS equivalent, but I could not do this easily with the "div ecx" opcode of a 386 architecture.

I spent some time googling for solutions to this problem. The first potential solution was to use the GCC 64-bit (long long) division code. I tested it by coding a simple 64-bit division into my C-language tester program, and then looking at the dump file to see what exactly happens there. The GCC compiler uses a helper function __divdi3 to handle the actual 64-bit division, and the assembler code dump of that looked to be very big and complex. In addition to this divisoion result, the x86 div opcode calculates a remainder of the division. This has a similar __moddi3 helper function, which was just as complex. The register usage of my emulator differs quite a lot from the MIPS C-language standard, so if I called these functions I would need to save and restore many registers, which would slow the handling down even further. I decided to abandon this possibility and look for some more optimized versions of this 64bit/32bit division.

After some more googling I found an algorithm in a source code of some software by Jan Marthedal Rasmussen. The function is called double_div and it looked to be just what I was after. It calculates both the quotient and the remainder in one go, and it has special case handling for the simple situations, before reverting to the generic difficult scenario. I implemented this algorithm first in C language in my tester program, and compared the results it gives to the results of the GCC __divdi3 function. After I was certain the C language implementation works properly, I looked at the assembler dump and began converting that to a suitable format for my emulator. I could not directly use the assembler version that GCC had compiled, as it used much more registers than I can afford in my emulator code. I have four free registers that can be used in every opcode handler, and in rare occasions (luckily this "div" opcode is one of them) I can also use the four registers allocated for lazy flags handling. The C standard however leaves 16 registers free for every function to use, without having to save/restore them. The code that GCC created looked otherwise quite efficient, it had even managed to optimize away some repeated calculations of the original C code. Too bad I could not use it.

After some careful register allocation I managed to create a division opcode handler that looks to be working, however I have not yet created a proper unit test for it. I am currently more interested in continuing with the opcode macro system, so I'll go back to making sure this division opcode works even in the more difficult situations later. That is also why I won't post my algorithm here yet, I'll do that after I have created the unit tests and made sure my implementation works correctly.

But, now I think I'll continue with the opcode macros. I don't have any new screen copies to show either, as the code I am currently at is the code that calculates the ship and polygon positions before drawing them, so nothing new has yet been drawn on the screen. Perhaps next week. :-)

Oct 31st, 2010 - First graphics from Trekmo

DS2x86 has now progressed so far that I have managed to show the first title screens from my Trekmo demo! That is good progress, especially as I fought with a weird problem for the whole Sunday evening and Monday of the past week. Observant readers of my blog might have noticed something strange in the screen copy of my previous blog post. I myself only noticed the weirdness several hours after I had posted the blog post, when I was just proof reading the blog post again. The amount of EMS memory that 4DOS uses for swapping displayed 8197K! Normally it shows 224K, and there was no reason why it would suddenly start taking that much EMS (especially since I still only provide 1.5MB of EMS in DS2x86)! I looked at my latest changes in the code but did not see anything that would cause such a change, so I ran my opcode tester program, but that did not find anything wrong in any of the opcodes. I usually try to solve such weird issues before going to bed, as otherwise I just keep thinking about the problem and can not easily fall to sleep. This time it was already late Sunday evening so I had to leave the code like that.

On Monday after work I then began to really look into this problem. Luckily I had a working version in my backup directory, so I was able to run file compares between the source codes. The strange thing was that the real-mode opcode sources did not show any changes between the versions, while the problem was obviously in the real mode handling, as 4DOS stays in real mode all the time. After several hours of head scratching I finally figured out what the problem was. I had added a call to one of the real-mode opcode handlers from the protected-mode code, but had forgotten to make the required change into the real-mode handler that would allow it to be called from both real and protected mode. I had changed the main framework for this, though, which is why also the real-mode handling got broken. The file compare did not show this as the problem was that I had not changed the code where I should have, and the opcode tester did not find the problematic opcode because it was one of the simplest opcodes which I thought did not require a unit test! Argh! Well, several lessons learned again:

Anyways, after fixing that problem I managed to add the protected mode IRQ handling during the week. By Friday I had reached the location in Trekmo code that switches to graphics mode (Trekmo uses triple-buffered Mode-X 320x240 graphics mode). On Saturday morning I added the Mode-X blitting routine, which I ported directly from the DSx86 ARM ASM code into DS2x86 MIPS ASM, I just needed to add on-the-fly conversion from palettized 256-color graphics to 16-bit graphics (which is the only supported graphics mode in DSTwo SDK). Surprisingly, my new blitting code only had a single mistake. I had accidentally used sw (store word) where I meant sh (store halfword), which caused an unaligned crash, but after fixing that the blitting routine worked fine! I got the initial title pages (two simple PCX images) to display fine, and then I again kept running into various as of yet unsupported 32-bit opcodes and modrm variations.

Now I am working on the missing 32-bit opcodes. I finally figured out a way to use the GCC ASM macro syntax to embed register names into the branch target labels, so I am currently creating a new and improved macro system that would allow me to add several modrm bytes for several opcodes with just tiny additions to a single macro. This would speed up my opcode implementation quite a lot, and also remove one potential source of bugs caused by typos. I found out that if you add \() at the end of the parameter name, the parameter will work even within a word, so that a macro call like this :

#define  eax    s0

.macro modrm_3_help oper jump reg
\oper\()_\reg\()_SIB:
    b      \jump\()_\reg
.endm

modrm_3_help mov mov_r32_t0 eax
Generates code like this (as the symbolic register name s0 means register number 16):
mov_$16_SIB:
    b      mov_r32_t0_$16

So, I think I'll get back to improving the macro system now instead of making this blog post longer. :-)

Oct 24th, 2010 - DS2x86 Protected Mode Interrupts

This past week was spent mostly just adding new 386-opcodes. Then by Friday evening I had reached a point in the Trekmo code where it tries to write a string to the screen. The easiest way to write to the screen in a DOS program is to call a DOS INT21 AH=09: WRITE STRING TO STANDARD OUTPUT routine. The DOS functions can not be called while in protected mode, so the PMODE header provides wrapper functions that first switch the CPU to real mode, then call the DOS function, and then switch back to protected mode. These functions are called via protected mode INT 33 (which curiously in real mode is a mouse interrupt), by giving the actual real mode interrupt number in AL register. The register values that the real mode interrupt needs as parameters are stored into memory variables for the protected mode INT 33 handler. Here below is the debug output of the "putdosmsg" routine in Trekmo.

 
putdosmsg PROC
        push ax
        push edx
        add  edx,_code32a
        mov  al,dl
        and  ax,0fh
        shr  edx,4
        mov  v86r_ds,dx
        mov  v86r_dx,ax
        mov  v86r_ah,9
        mov  al,21h
        int  33h
        pop  edx
        pop  ax
        ret
putdosmsg ENDP

On Saturday morning I began working on the protected mode interrupt handling. The first step was to check the contents of the Interrupt Descriptor Table (IDT) that PMODE uses, to help me in understanding the behaviour of the code. Here below is the start of the IDT table. It shows for the first 19 interrupts the segment_selector:offset of the interrupt handler, the [Present, Descriptor Privilege Level, Gate Size] values of the descriptor, and finally in parenthesis the type of the gate. All of the IDT table entries in PMODE seem to be of the Interrupt Gate type, and they all point to a USE32 code segment.

I followed the algorithm that DOSBox uses in it's protected mode interrupt handling, but left out most of the tests for priority levels and different gate types. I need to refactor this code later, but for now I just want to get the code to work in this simplified situation in the PMODE header. After a while I managed to get the code to jump properly to the protected mode interrupt handler. The interrupt handler switched first down to 16-bit protected mode, and only then back to real mode. I did not have support in DS2x86 yet for either of these mode changes, so it took me a while to add that. By Saturday evening the code progressed all the way into the real mode DOS interrupt, and printed the string to the DS2x86 DOS text screen!

That was enough work for Saturday, so that was a good place to stop coding for a while. I was pretty happy to get that far with the protected mode interrupt handling (which I had feared to be a very complex issue) just in one day. Today I have been continuing with the code after the real mode interrupt handling, which is pretty much the reverse of the real mode interrupt call. After the real mode interrupt handler returns, PMODE header first switches to 16-bit protected mode, then sets up the GDT and IDT tables again, switches to 32-bit protected mode, and then finally returns to my Trekmo code from the INT 33 handler. I just an hour ago coded the "IRET" handling for the last step, and thus this is where I am currently at.

The next thing I need to do is to add protected mode IRQ handling. Trekmo waits for two seconds after displaying that string, before switching to graphics mode. This two second delay needs a working timer interrupt in protected mode, so I can not progress further in Trekmo until I have coded the IRQ handling. It should be relatively similar to the INT handling, though, so hopefully implementing it will not be very difficult.

I have no graphics mode support in DS2x86 yet, so I need to start working on that as soon as I get the IRQ handling done. I also soon need to start implementing the modrm bytes of the 386 opcodes properly. So far I have only implemented the few modrm bytes that I have encountered (just like I did when starting work on DSx86), so the code will not run anything besides Trekmo yet. I also need to enhance my tester program to test the 386-opcodes as well, to be sure that I have not coded a lot of bugs in them. But it is nice to see some progress all the time.

Oct 17th, 2010 - DS2x86 Protected Mode work

This past week has seen slow but steady progress with the protected mode support in DS2x86. After the previous blog post, I started thinking about ways to make the existing real-mode code more compatible with the needs of protected mode, and 32-bit memory access. In the original DSx86 code I had used all the 16-bit registers shifted to high 16 bits, and I used the lowest byte of the currently effective segment (which I kept in r2 register) to tell whether a segment override is in effect. Most of the opcodes need to know whether a segment override is in effect to calculate the correct memory address when using BP-based indexing. While memory access normally defaults to Data Segment (DS), addressing memory with the BP register defaults to Stack Segment (SS). In DSx86 I kept the currently effective segment register in r2 highest 16 bits, with the lowest byte telling whether a segment override is in effect. The SS register value was kept in the high 16 bits of r3 register, with the DS register value in the low 16 bits of the same register. Thus, in the main loop I could easily reset the segment override to be off and r2 register having the default DS register value like this:

    ldrb    r1,[r12],#1			@ Load opcode byte to r1, increment r12 by 1
    mov     r2, r3, lsl #16		@ r2 high halfword = logical DS segment, clear segment override flags
    ldr     pc,[sp, r1, lsl #2]		@ Jump to the opcode handler
Can't get much more efficient than that, when trying to perform two logically different operations, making r2 contain the currently effective segment and clearing a segment override flag. The BP-based memory address handling in turn checked whether a segment override is in effect and if not, made the r2 register contain the current SS value with the following code:
.macro mem_handler_bp_destroy_SZflags
    tst     r2, #0xFF			@ Is a segment override in effect? Zero flag will be set if not
    biceq   r2, r3, #0x0000FF00		@ r2 = logical SS segment in high halfword, with garbage in low byte
.endm
I had used a somewhat similar approach in DS2x86, as it was just copied and translated to MIPS assembly from the DSx86 method. To prepare for protected mode, I wanted to change this method so that the register that keeps the currently effective segment (#defined to be "eff_seg", in reality register ra) directly contains a linear memory address (which in real mode would be the segment value shifted left 4 bits). So, I could not use the same trick of storing the segment override flag in this register. I really did not want to make the code slower than it currently was, so I actually spent two days just thinking how I could change the segment override flag handling so that the main loop would not slow down (my first priority), I would not need to waste a new register for just this flag (second priority), and that the BP register memory access would also be as fast as possible (third priority).

After spending two days thinking about this problem, the solution finally occurred to me. In the end the main loop did not get any slower, I did not need to use a new register, and the BP addressing was just as fast as before! Here is the resulting code, with some explanation following.

    lw      t1, opcode_table(t1)		// Get the opcode handler address from the opcode table 
    move    eff_seg, eff_ds			// Set DS to be the effective segment
    ori     flags, FLAG_SEG_OVERRIDE		// Fix the CPU flags, telling we have no segment prefix
    jr      t1					// Jump to the opcode handler
After assembling, the generated code looks like this:
8006453c:	8d290000 	lw      t1,0(t1)
80064540:	01e0f821 	move    ra,t7
80064544:	01200008 	jr      t1
80064548:	37390002 	ori     t9,t9,0x2
The MIPS assembler does a lot of changes to the original ASM code behind the scenes, due to the peculiar features of the processor. For example, all jumps and branches have a "branch delay slot" following them, which is actually executed before the branch is taken. The assembler reorders the opcodes so that the jump is moved before the preceding opcode, if the preceding opcode (ori in my example) has no effect on the branch instruction itself (which it does not here). If the jump can not be moved higher, then a NOP operation is added into the branch delay slot, wasting one CPU cycle. Also, as loads from memory (the lw opcode) cause a pipeline stall if the register that is loaded is used in the next opcode, you also lose a CPU cycle if you don't have any useful operations (that do not use the loaded register) to put immediately after the load opcode. Thus, there is no way to make the main loop code faster than what it currently is, so my first priority was fulfilled. I need to have one operation after the branch address loading, and I need to have an operation in the branch delay slot.

I managed to fulfill my second priority by using the x86 CPU flags emulation register (#defined as "flags", being in reality register t9). The x86 flags register has a reserved bit 1 (with value 2) that should always be set. I set this bit in the main loop, and then reset the bit to zero in all segment override handlers. Since the code that would need to use the full flags register value (practically only the PUSHF opcode handler) will never have a segment override, this will cause no problems in any code that handles the flags register.

The macro to handle the BP-register based segment handling looks like the following. The .set commands allow me to use the Assembler Temporary (AT) register myself, while normally the assembler uses this for all sorts of behind-the-scenes tricks and macro expansions.

.macro mem_handler_bp
    .set noat
    andi    AT, flags, FLAG_SEG_OVERRIDE	// at == 0 if we have a segment override
    movn    eff_seg, eff_ss, AT			// If no segment override, put SS into effective segment
    .set at
.endm
This is just as efficient as the original DSx86 code, just two assembler opcodes. The andi opcode puts just the flags bit 1 into the AT register (so AT is zero if the flag is not on, meaning a segment override is in effect), and the movn opcode moves eff_ss register into eff_seg register if the AT register is not zero (no segment override in effect). This fulfilled my third priority.

In addition to this change I changed all my memory address routines to not use shifted memory offsets, which was a lot of work. There were 266 locations in the code where the shift was used, but only about 220 of these were related to this address calculation and needed changing. I first used a simple find-replace operation in the editor to comment all of these out, and then used my tester program to see which opcodes got broken, and then fixed these one by one. In the end the whole code got about 3% faster! Not a big change, but it was very nice that adding a new feature made the code faster, and not slower as normally happens!

After such extensive code refactoring I finally got back to debugging the PMODE header of Trekmo in DS2x86. The PMODE header first goes to 16-bit protected mode (it jumps to a USE16 segment using the jmp  0020:138E opcode as you saw in the debug output of the previous blog post). Then it sets up the Interrupt Descriptor Table (IDT) while in the USE16 segment, and then goes to 32-bit protected mode (USE32 segment) using a RETF opcode. It took me the rest of last week to add support for the operations PMODE does in the USE16 segment, so that finally today I got DS2x86 to run the RETF opcode properly and switch to the USE32 segment. This is where I am currently at. There is only a small amount of code remaining in the PMODE header until it jumps to my own Trekmo code (jmp  00014ED4 in the debug output, which is the jmp _main command in the following code snippet from the PMODE sources).

I also hacked my debugger memory dump routines so that by dumping address FFFF:30 I can get a formatted output of the Global Descriptor Table (GDT). The GDT that PMODE uses is shown below. You can see that for example selector 20 is a USE16 code segment, while selector 08 is the actual USE32 segment (where the RETF opcode returned to, and where I am currently at). In this case PMODE uses a GDT with a limit of 0x8F (so that all the items happen to fit nicely into the DS2x86 debug screen) and located at linear address 0x000042C4.

 

p_start:                                ; common 32bit start
        mov eax,gs:[1bh*4]              ; neutralize crtl+break
        mov oint1bvect,eax
        db 65h,67h,0c7h,6               ; MOV DWORD PTR GS:[1bh*4],code16:nullint
        dw 1bh*4,nullint,code16         ;
        mov eax,gs:[32h*4]              ; set up for new real mode INT32
        mov oint32vect,eax
        db 65h,67h,0c7h,6               ; MOV DWORD PTR GS:[32h*4],code16:int32
        dw 32h*4,int32,code16           ;
        in al,21h                       ; save old PIC masks
        mov ah,al
        in al,0a1h
        mov oirqmask,ax
        jmp _main                       ; go to main code
The next big thing to do is to add proper protected mode interrupt handling using the IDT table, and I also need to improve my stack handling so that switching between 16-bit SP and 32-bit ESP stack pointer addressing works properly. Currently it is somewhat hardcoded to just work in the current situation in PMODE/Trekmo. Besides those features, I still have a lot of new opcodes to add, so these will again keep me busy for quite a while.

Oct 10th, 2010 - Preparing for Protected Mode

This week my work on DS2x86 has been much less frustrating. The DS2x86 framework seems to be mostly working. I still get an occasional black screen when starting DS2x86, but that happens only once in every 20 starts or so. I decided to work on the actual emulation core for now and worry about the hangs later. Early this week I tried to decide on whether to continue porting the existing DSx86 features (like graphics and audio features), or to start working on proper 386 opcodes and protected mode support. I decided to go for the 386 and protected mode direction, as that is much more interesting and lets me study and learn new things.

I decided to start by trying to make my old Trekmo demo from 1994 run. Pretty much like I started with DSx86 by attempting to run my LineWars II game on it, I again use one of my old assembly language programs for testing. Having assembler source code available makes it much simpler to compare the debug output of DS2x86 with the original sources, to see where I am in the code and what is supposed to happen there. My Trekmo demo uses the PMODE DOS protected mode extender created by Thomas "Tran" Pytel. PMODE is small, simple and straightforward, but still supports various methods of entering protected mode (like DPMI, VCPI, and direct custom code). The good thing is that it also works without any memory managers present, so I can immediately concentrate on the low-level protected-mode support without having to worry about virtual memory and things like that. I still haven't coded proper support for the touchscreen keyboard to DS2x86, but that is not a problem as I can put the launch of Trekmo to the 4START.BTM file, so it starts running immediately when I launch DS2x86. It runs until it reaches a yet unsupported 386 opcode, and so I spent the last week iteratively adding new 386 opcodes and running DS2x86 again and again.

 

I also noticed that adding all the 386 opcodes means quite a lot of work. In addition to the new opcodes (that do not exist at all in a 80186 processor), all the existing opcodes have three new address or size variations. There are two new prefixes, 0x66 and 0x67, the first of which converts the size of the values handled from 16 to 32 bits, and the latter converts memory addresses similarly from 16 to 32 bits. So, with all the combinations there are 4 different cases for almost every existing opcode, and only one of them (with neither prefix) currently exists. The worst case result is that the executable size will more than quadruple after I have added all the new opcodes! A huge amount of work, so I am looking into other ways to handle these. I do need to code some of them using the current method, to get a feel for how they should work and what I can do to optimize the amount of work, though.

Anyways, as you can see from the screen capture above, by this morning I have just reached the location where PMODE.ASM enters protected mode. Entering protected mode means doing four things:

  1. Enabling the A20 address line. This is done by setting bit 1 (with value 2) of I/O port at 0x92, called System Control Port A. This is where I was at yesterday, and I had to do some changes to the memory address calculation routines to be able to directly address memory above the first megabyte in DS2x86.
  2. Setting up the Global Descriptor Table. This is done using the lgdt ASM command.
  3. Setting bit 0 (with value 1) of the CR0 Control Register. This actually puts the processor in protected mode.
  4. Performing a far jump using a CS segment selector (looked up from the GDT we set up before) to clear the prefetch input queue of the processor.
The steps 2 to 4 are seen in the debug output above, and this is where I am currently at. I need to do some major changes to all code that uses segment registers (including all far jumps, calls and returns) to be able to distinguish between real-mode segments and protected-mode segment selectors. This will keep me busy for quite a while, but a good thing is that I can always look at how DOSBox handles these and then copy the features that seem to suit the DS2x86 architecture.

Oct 4th, 2010 - Proper DS2x86 speed test results

Just a quick update to the blog post of yesterday. After I had posted the message, I still tested the code further, and found out that the SDK timers do not run at the proper speed. This was again one cause for the weird hangs, and it also made all my speed test results invalid. The 60Hz timer ran actually at 264Hz speed, and the main PC timer which should run at 18.2Hz (and which is used by Norton Sysinfo for the speed tests) ran at almost 24Hz.

I just got the timers running at proper speeds, and ran the speed test again, using the fastest 396MHz MIPS CPU speed. Here is the result that I now believe to be correct.

The MIPS version of my x86 emulator runs at about 3.5 times faster than the ARM version, however the MIPS CPU runs at 6 times the clock speed. The difference is due to the Lazy Flags handling, the lack of various ARM tricks in the MIPS architecture, and the IRQ handling not using self-modifying code. I hope to be able to add some MIPS-specific tricks to the code while I get more familiar with the MIPS assembly, but even the current speed should make it possible and worthwhile to add 386-opcode support.

Oct 3rd, 2010 - DS2x86 frustration

The last week, and especially this weekend, has been very frustrating when working on DS2x86. It behaves very erratically, mostly hanging randomly, and when I try to add some debug output to pinpoint the problem location, it suddenly runs fine! Then I remove the debug strings, and it hangs again, but in a different location! Extremely frustrating. There is probably something seriously wrong in my IRQ handling, or perhaps I have misunderstood something about the underlying hardware.

Everything seems to work fine as long as I stay in the 4DOS command prompt. My first goal is to get Norton Sysinfo running, but that has proven to be much more difficult than I have anticipated. Yesterday I got it to run up to the main screen (where it shows the overview of the system), but even that happened only in about once every 5 tries, the other four tries it hanged before reaching that far.

This morning I then removed some of the debug printing I had used to try to pinpoint the location, and then it suddenly reached the CPU speed test part without hanging! So, here is the first test result of the DS2x86 speed.

After that I tried to test with the fastest CPU speed of 396MHz, but could not get into the CPU speed test page without the system hanging. I then reverted back to the default 120MHz speed, but then everything started running in slow motion, and the one time that I got up to the CPU speed test page the speed showed something less than 4 times original PC!

I made some more changes to the code but was not able to get back to the CPU speed test page. After a lot of tracing and debugging it finally began to look like the self-modifying code I use to handle IRQs might be the problem (or at least one reason for the hangs). I changed the code to use a normal variable to determine when to handle an IRQ, and finally got rid of the SysInfo hanging before the main screen. I made some more speed tests, the first one is with the CPU running at 396MHz and the second is with the 120MHz speed.

 

As you can see, there is not much sense in the results. The first result above with the 14.2 times original PC (when I had not specifically set the MIPS CPU speed), along with the results from 396MHz speed make sort of sense, but the result after setting the CPU speed to 120MHz is really weird. That would mean the DS2x86 would run at less than half the speed of the original DSx86. Anyways, I'll continue fighting with this and try to make some sense to the constant weird behaviour.

Sep 26th, 2010 - DS2x86 is starting up!

Okay, now I finally got some proper progress done with DS2x86! Yesterday I still fought with the screen update problem, and also had problems with the key input. Late yesterday I finally got the screen update to work without hanging, and also found the solution the the key input problem. I am using a couple of techniques that are unsupported by the current SDK, so I have had to come up with some hacks to work around the limitations of the SDK.

The first big issue was that the SDK only has one function to refresh the NDS screen, called ds2_flipScreen as it flips the two internal screen buffers. Using this function the current buffer gets sent to NDS and you can then start writing to the other buffer. I need to build the 16-bit 256x192 screen buffer from the emulated PC screen data, so I already have a sort of internal double-buffering. The screen contents that are being written to by the running software is separate to the buffer that gets sent to the NDS. I did not want to add another double-buffering layer and copy all the data for every single frame, but instead use only one buffer and (in text modes) only build the characters that have changed since the last frame.

Looking into the SDK dump files I found out that ds2_flipScreen calls a function called update_buf, that simply sends the current buffer to the NDS side, and luckily that function was not static so I could call it directly from my screen update code. However, it gets a handle variable up_screen_handle as it's parameter, and that is static to the SDK library, so I needed to add an ugly hack to find out the address of this variable from the original function, and then give this address as a parameter to the buffer update function. I added a call to this function to the timer interrupt handler, as the DSx86 architecture does not have a "main loop" that would refresh the screen contents once per frame. The underlying x86 code runs as fast as it can, and I just sample the screen contents using the timer interrupt several times a second.

I first tried to use this new method only for the upper screen and use the original SDK console functions for the lower screen, but that kept hanging the system constantly. I thought that perhaps the internal functions are not interrupt-safe, so that when the lower screen is in the process of updating the NDS side, and my timer interrupt happens in the middle of this and starts sending the upper screen data, this might be the cause for the hangs. So, I coded my own console functions for the lower debug screen and used the interrupt routine to update also the lower screen, and then finally I got the screen update system to work without hangs. This also had the advantage of increasing my debug screen from 32x24 characters to 42x24 characters. The bottom row commands got even harder to hit accurately with the stylus, though.

The next problem I had was that the SDK ds2_getrawInput function did not seem to report any key releases. Whenever a key was pressed, it stayed pressed until another key was pressed. This turned out to be caused by not calling the regist_escape_key SDK function to register a key to use for escaping to the console (or something, I am not quite clear about the use of the escape key). Why I need to register an escape key to get the keyboard input working properly is beyond me, and is probably a bug in the SDK, but in any case after calling that function I got the key input working for the debug screen. The default console in the SDK calls this function during InitConsole, so the key input works fine when using the default console.

After that it was just a matter of porting the DOS functions so that I could try actually starting up 4DOS. This morning I then finally found and fixed the most serious problems in my ported DOS functions, and got 4DOS to progress up to the command prompt! I haven't yet checked where the "Unknown command" error message comes from, so there is probably still something wrong in my routines, but I was quite pleasantly surprised that my actual x86 CPU core emulation (which was completely written from scratch for the MIPS assembler) is already robust enough to run the 4DOS kernel!

This is about as far as I can get at the moment, as there are still a lot of stuff missing before I can even give any commands to the 4DOS prompt. I haven't coded any IRQ handling stuff yet (which is needed for the x86 code to read the keyboard I/O ports), all I/O port handling is still missing (so even when I get the IRQ handling done the IRQ has nothing to read yet), and all the bottom screen keyboard graphics are still missing (as the DSx86 keyboard graphics are 16-color and the DS2x86 graphics need to be 16-bit coloured). These should all be resonably straightforward, unless I run into some unforeseen obstacles when porting them from the current DSx86 code. But, in any case, I am pretty happy with the progress I have been able to achieve during the last week. Currently it looks like DS2x86 might actually be a reality one day. :-)

Sep 19th, 2010 - Slow DS2x86 progress

Last week was a very busy week at the office (first customer deployment of the software I have been working on), so I did not have much time (nor strength) to work on DS2x86 during the evenings. Yesterday I then worked on DS2x86 again, and in the end got it to compile and link. I had to comment out a lot of code, though, which I then need to write specifically for the DSTwo hardware environment.

I have run into various problems while trying to make the code build, so progress has been rather slow. Here is a list of various annoyances and other obstacles I have had to overcome.

Anyways, I have now managed to output the boot strings (using the 6x8 pixel text font) onto the screen, but the screen update system in DSTwo SDK is still somewhat unclear to me. It uses double-buffering, but I don't need (nor want) to use double-buffering in text modes, so I am currently looking into ways to bypass the double-buffering scheme and be able to only update one buffer. The SDK does not include sources for the more hardware-specific stuff, so I have to look into the dump file to see what the internal routines do. All in all, quite slow progress, but at least I am getting forward.

Sep 12th, 2010 - DS2x86 Opcodes Converted!

Just 5 minutes ago I got all the basic 80186 opcodes (including the REP-prefixed string opcodes) converted and tested! Hurrah! :-) Well, actually it is not much cause for celebration yet, as only now I can start coding the actual emulation framework around the CPU emulation. Many parts of the DSx86 framework are coded in C++, though, so porting those will be much faster than writing ASM code, which needs to be written pretty much from scratch for a new processor architecture.

Next, I think I will work on the interrupt architecture (as I need that to be able to input any keys and Norton Sysinfo needs a timer interrupt to be able to calculate the CPU speed). I am looking forward to getting Norton Sysinfo running, but before I can attempt that I need to add quite a bit of hardware-related stuff, including disk access support and the whole screen output stuff. Also, I need to start studying the DS2 SDK stuff more closely, so far I have only used the console output and not much else. I still have a lot to learn about it as well. But, it is interesting to learn new things!

Oh, and I also finally added a PayPal "Donate" button onto the Main Page, so you can donate a small amount to show your support of my working on DSx86. As I state on the main page, don't feel bad about not donating anything, I am not doing DSx86 for money. But at least you can now stop emailing me about wanting to donate money but not being able to. :-)

Sep 5th, 2010 - DS2x86 Work Continues

During the past week I have continued porting the DSx86 ARM ASM CPU emulation code to DS2x86 MIPS ASM. This work has progressed well, and I am actually a little bit ahead of schedule. I have now worked on it for 3 weeks, and I have about 80% of the 80186 opcodes coded. I have skipped and left for later some of the more difficult opcodes, like opcodes 0xCC..0xCE (INT3, INT imm8, INTO). The reason for skipping these has been the fact that they interface with the hardware features, and I can not test them properly with the current plain opcode tester program. These I need to revisit later, but I plan to first code all the normal opcodes and then switch over to the proper DS2x86 implementation and start adding the hardware features. If all goes well I might get all the normal 80186 opcodes done by the next weekend.

The problem above is where I am currently at, and this was again a problem in my tester code. The opcode is one of the shift opcodes, rol byte ptr [BX+SI],CL to be exact. The problem was that it assumed the CPU registers to be in the high halfwords of the actual hardware registers (like they were in DSx86). In DS2x86 I keep the register values like they are in a 386 processor, AL and AX are the low byte and word values of the EAX register, etc. This is one of the changes that I need to do to tester program for nearly every opcode it tests.

Aug 29th, 2010 - DS2x86 Progress, Lazy Flags implementation

DS2x86 Progress

Sorry, no new release of DSx86 today, as I have only been working on DS2x86 for the past two weeks. This porting work is progressing nicely, over half of the opcodes have been ported over to MIPS ASM. I have to mention, though, that the opcodes so far have been the easy ones (execpt the BCD opcodes), the more difficult opcodes like the string operations, shifts, INT and IRET, and port I/O are still ahead. These will take more time, and some of them will need some interfacing to the underlying hardware, so I can not just simply port them over from the ARM ASM code.

I am currently at opcode 0x8C, which is the mov r/m16,Sreg opcode, that is, moving a value from the segment register to memory or register. The problem above was caused by my tester code not yet supporting the FS and GS segment registers, while the CPU emulation already does this. So, every now and then I need to fix my tester program instead of the emulation code. :-)

Lazy Flags

Practically at the same time I started porting the opcode handlers from ARM ASM to MIPS ASM, I started thinking of ways to handle the Lazy Flags with the least amount of slowdown possible. Yesterday I figured out a method that is a little bit faster than the way I had when I started, so I spent a couple of hours refactoring all the opcodes I had already coded to use the new method. Too bad this did not occur to me earlier, but it is to be expected that I need to recode some parts of the code several times as I am still only learning the tricks in MIPS ASM.

I again used the DOSBox sources, together with the nice description at a www.emulators.com blog post, to figure out how the lazy flags need to work. There are six flags that change after each arithmetic operation in the x86 architecture, some of which are simple and some more difficult to determine after the operation. The flags are:

The simple flags are Zero, Sign and Parity. Zero flag is set if the result was zero, Sign flag is set if the highest bit of the result was set, and Parity flag can be set by a 256-item lookup table based on the low byte of the result. These three flags behave similarly to all opcodes (that change flags), so they can be determined simply by the result of the last operation. The other three opcodes behave differently in different opcodes, so based on the calculation operations in the DOSBox sources I combined a list of the different cases, to see how these need to be handled. DOSBox names the result and operands lf_resd, lf_var1d and lf_var2d (for doubleword operands), and I named them lf_res, lf_val1 and lf_val2 in my code.

Carry

Adjust

Overflow

Based on these lists, it seemed to me that the Carry flag will be the most difficult and time-consuming to calculate. Besides the obvious conditional jump opcodes, there are many other opcodes (ADC, SBB, RCL, RCR, CMC) that need the current Carry flag value as their input. Also the shift opcodes change and use the Carry flag in various ways, so it seemed to me that using a switch statement -style code to calculate the Carry flag lazily whenever it is needed will really slow down those operations. So, I decided to see how much extra code I would need if I went for a direct Carry flag calculation in each of the opcodes. It turned out that most of the times it only takes one ASM operation to calculate the Carry flag after the operation, so this is how I currently handle the Carry flag.

I also noticed that if I calculate the Carry flag separately, I can fake the lf_val1 and lf_val2 values in opcodes like INC and DEC to give me the correct Adjust flag value when using the same calculation code as the normal ADD/SUB opcodes use. So I was able to simplify the Adjust flag calculation to the one case: ((lf_val1 ^ lf_val2) ^ lf_res) & 0x10. This just left the Overflow flag which needs separate cases for each opcode type. I use one of the MIPS general purpose registers to keep track of the last opcode type, along with registers for the last result and operands, so that the Overflow flag can be calculated lazily whenever needed. I hope to figure out some speedups for this as well, but for now it will have to do.

To show an example of the actual opcode handling and what the Lazy Flag handling requires, here is the handler for ADC r/m8,r8 opcode when the left operand is a memory address. In DS2x86 I decided to have #defines for all the registers I use for emulation, so I don't need to remember which MIPS register was which. I did not do this in DSx86, and that caused some wrong register usage from time to time.

.macro adc_effseg_reg8l reg
	get_CF_into t3                      // t3 = Carry flag value
	li      lf_type, OF_CALC_ADD | 24   // Remember the operation type and shift value for Lazy Flags
	lbu     lf_val1, 0(eff_seg)         // Load the left operand from RAM
	andi    lf_val2, \reg, 0xFF         // Remember the right operand for Lazy Flags
	addu    t3, lf_val1                 // t3 = lf_val1 + Carry
	addu    lf_res, t3, lf_val2         // lf_res = lf_val1 + Carry + lf_val2
	srl     t0, lf_res, 8               // t0 = Carry value
	sb      lf_res, 0(eff_seg)          // Save the result to RAM
	andi    lf_res, 0xFF                // Remember only the low 8 bits for Lazy Flags
	j       set_carry_from_t0           // Back to loop
.endm	
The get_CF_into macro looks like the following. It is a macro so that I can later change how the Carry flag is calculated without having to change all the code that uses it (just in case I still need to revert back to lazy calculation of the Carry flag). The set_carry_from_t0 code is immediately before the opcode loop handler, as many opcodes jump there to store the t0 register value back into the flags register lowest bit. When calculating the Carry flag immediately, Carry is simply the 8th bit of the result, so I can just shift it to the lowest bit of t0 register and don't need to handle the complex ((unsigned)lf_res < (unsigned)lf_val1) || (lflags.oldcf && (lf_res == lf_val1)) algorithm at all!
.macro get_CF_into reg
	andi	\reg, flags, 1
.endm

As you can see from this code, even just remembering the result and operands for later calculation of Lazy Flags takes a lot of code, in this case 4 of the 10 ASM operations are there just to get the later flags calculation to give correct result. When coding for the ARM ASM I did not need any of these, as the ARM can keep track of the flags by itself. Thus, DS2x86 will not be as much faster than DSx86 as the difference in the CPU clock speeds would make you think.

Aug 22nd, 2010 - DS2x86 Progress

I got my free DSTwo flash cart last Monday, so I have been working on DS2x86 ever since. It took me all of Monday evening to get the DS2x86 framework (tester routine calling the MIPS ASM CPU emulation code, and it returning properly back to the tester routine) to run without crashing, so on Tuesday evening I was able to start working on the actual CPU opcode emulation with MIPS ASM language. I am actually using a strict TDD (Test Driven Development) coding technique when working on DS2x86. With DSx86 I usually coded something, then tested it with some test games, and only if that failed I coded more thorough tests. With DS2x86 I implement the test routines (or improve the old tests I used in DSx86) first, and only then start coding the actual opcode handlers. I do this because the MIPS ASM language is very unfamiliar to me, and also because I am now using the Lazy Flags approach, so I can no longer use the ARM CPU to calculate the correct x86 CPU flags for me.

By Saturday morning I had implemented the first four opcodes 0x00..0x03 (the various ADD opcode versions). Each of these actually have 256 different modrm bytes for all the different memory address modes, and as there are 6 different segment override possibilities, plus the case with no segment override, each of these opcodes actually has 7*256 different cases. My test routine runs each of these cases with random input values, and tests for correct results and correct emulated CPU flags state after the run. So in fact I had over 7000 different cases, including their unit tests, coded and tested by Saturday morning. Pretty good progress I think, considering I was only able to work on it for a few hours every evening.

 

The images above are screen captures from the default SDK console, using a screen capture code by BassAceGold, another SDK tester (thanks for the code!). The first one is from Saturday morning, with the first four opcodes working and opcode 0x04 stopping with an invalid result (0x59+0x1C=0x1C). The second one is the current (Sunday midday) situation. I have skipped opcode 0x0F, as it contains a lot of different 386 opcodes, and also the 386-versions of the already coded opcodes are mostly missing. The Lazy Flags handling does not yet calculate correct Overflow flag for SUB/SBB opcodes, so that too still needs work. I have the 386-specific FS and GS segment registers already supported, and also the immediate 32-bit versions like ADD EAX,0x1234567 are already coded and tested. But since my first priority is to get Norton Sysinfo running, I can leave supporting most of the actual 386-opcodes for later.

I think the font used by the SDK is not the best possible, especially the small letter g is pretty unclear, but this is certainly good enough for debug printing. The MIPS cpu is set to run at 120MHz by default (I believe that is the lowest speed it can run), which is fine while coding the tester program. There is an API call in the SDK to change the CPU speed, up to 396MHz I believe, so I am thinking of adding a configuration option to DS2x86 where the user can choose the CPU clock speed. Running at higher speeds will probably drain the battery pretty fast, so always running at the highest speed is not a good idea.

Anyways, I am already at opcode 0x27 DAA. This is one of the weird BCD (Binary Coded Decimal) opcodes that use the Adjust Flag (which is not properly supported in DSx86). Now with DS2x86 I can support the Adjust Flag properly, so I can finally code the DAA opcode so that it will always give correct results. Even though this opcode is very rarely used, it will be good to have it working correctly. :-)

Aug 15th, 2010 - Version 0.23 released!

Version 0.23 info

This version has only minor fixes, as I have been busy with other things (work-related stuff and the SuperCard DSTwo version of DSx86). The changes in this version are the following:

I also spent several hours tracing the null pointer jump problem in the Superhero Legend of Hoboken game, but could not fix it yet. I found out that the problem is caused by a routine overwriting data in the stack, so that the routine then returns to address 0000:0000. This same routine is used without problems hundreds of times before it fails, so tracing the actual reason for the failure is pretty difficult and time-consuming.

  

Not a lot of changes, and as I am moving my focus from DSx86 to the new SuperCard DSTwo version, at least for the time being, it is possible that DSx86 itself will progress very slowly for a while. I will possibly increase the two-week release cycle, especially if I have not had time to do any worthwhile improvements.

DS2x86 progress

I have not yet received the free DSTwo card that SuperCard has sent me, so I have not been able to fully start coding for it yet. I have been learning MIPS assembly language and have started converting some of the ASM macros I have used in DSx86 to MIPS ASM for DS2x86, though. I am using mostly the same ideas that I have used in DSx86, but will include 386/486 opcodes from the start, and will switch from using the CPU flags directly to a Lazy Flags-type approach. MIPS has so many general purpose registers that I believe I can fit the lazy flags into registers, which should make the code run reasonably fast. Not as fast as the ARM version, obviously, but on the other hand the MIPS processor has quite a bit higher clock speed.

I hope the DSTwo card arrives next week!

Aug 8th, 2010 - DSx86 bug fixes, SCDS2 SDK betatesting

DSx86 fixes

Last week I did not have any time to work on DSx86, as I had various work-related things to do and also other "real life" issues that I had postponed during my summer vacation. Yesterday I finally began working on DSx86 again, and found the bug that caused the memory error in Maupiti Island. I had coded a shortcut for the memory allocation routines when launching an EXE program, and my shortcut did not handle the EXE header MaxAlloc field properly, it always allocated the maximum amount of memory. Normally all programs want the maximum amount of memory, but the TSR program in Maupiti Island was the first program that had the required amount of memory in the EXE header and did not adjust the memory allocation itself. After I fixed that routine the game loaded fine and seems to work.

I was also requested via email to look into a screen resolution problem in a game called Mahjong Fantasia. I noticed that it went to 640x200 screen mode, but still wanted to display 400 lines. There is no screen mode preset for 640x400 resolution, so the game accessed the CRTC registers directly. I enhanced my VGA emulation so that the vertical resolution is no longer tied to the current screen mode, but instead it is determined by the CRTC register values (like in DOSBox). This might cause problems in some games, but most likely it will work better than the mode-based resolution detection. At least Mahjong Fantasia began to display a correct resolution, and also Gods (another game that accesses CRTC registers directly) still works.

I have also added some simple missing EGA and Mode-X opcodes based on the debug logs I have received, again thanks for those! Other things on my TODO list I have not yet looked into, but at least the above mentioned improvements will be in the upcoming 0.23 version.

SuperCard DSTwo SDK betatesting

Several DSx86 users sent me email letting me know that SuperCard are giving a beta version of their DSTwo SDK to selected homebrew coders. So, I decided to contact SuperCard and request the beta version of the SDK. They accepted me to their beta test program (thanks SuperCard!) and sent me the SDK. I don't have the actual DSTwo card itself (which they also will send me) yet, so I can not properly test the SDK until I receive the card as well. However, I can start porting DSx86 to it immediately, and this is what I began doing yesterday.

The first problem was that the SDK (or rather the mipsel-linux toolchain that it uses) is meant to be run on Linux, and I have only Windows machines. This turned out to not be a major problem, though, as running VirtualBox on my Windows XP machine and installing Ubuntu Linux on it allowed me to install the SDK and compile the libraries and examples fine. I am familiar with the Linux command line tools, but I have actually never used the Linux graphical UI, which however seems to be reasonably similar to Windows so I am not totally lost with it. :-)

There does not seem to be any major problems compiling the C and C++ source codes, but obviously I need to write the ASM parts pretty much from scratch for the MIPS architecture. It will be interesting to study another new hardware architecture, and the MIPS architecture seems to be very different from ARM, even though they both are RISC processors. What little I have found out so far about the MIPS architecture compared to ARM is this:

My working on the MIPS port of DSx86 will of course take time away from improving the current DSx86, but I think DSx86 is already working reasonably well as it is. Many games still need fixing and there are many general improvements that could be made, but I can continue working on these even while I concentrate on the DSTwo SDK testing. SuperCard probably expects me to actually work with their SDK, or they would not have sent it to me. I think I owe it to them to test it properly and report the possible problems and enhancement ideas I find when trying to port DSx86 to run on DSTwo.

With the DSTwo version of DSx86 (currently named "DS2x86" :-) I plan to support 386 instructions, and probably also 486 instructions. I will first port my tester program (which simply tests each CPU opcode for correct results), as it is much simpler than the full DSx86 and with that I can concentrate on the CPU emulation and make sure I get it to work correctly. After that I will probably try to get Norton Sysinfo running in it, just to see what the emulation speed will be like.

It took me about half a year from when I started working on DSx86 to when the first alpha version was released, so it might take about the same amount of time with the DSTwo version. On the other hand I have learned quite a lot about emulation in general, and the C and C++ codes do not need major changes, so it might be that I get something working much sooner.

Aug 1st, 2010 - Version 0.22 released!

This version has the refactored internals, so it most likely runs some (if not all) games slower than the previous versions. It does however now support practically all real-mode 286 CPU opcodes (not including JPE and JPO which require game-specific hacks), and also unsupported graphics opcodes should now be quite rare. The graphics opcodes are now reported as Unsupported EGA opcode or Unsupported Mode-X opcode, and unlike in previous versions, you can continue after such an opcode using the B button. However, it is likely that you will get the same error again and again, so please send me the log file if you encounter unsupported graphics opcodes. If you get a plain Unsupported opcode error, it most likely means that the program is executing data instead of code, so something has gone wrong in the code before this happened, and thus it is not possible to continue running the program. Again, I am interested in the log files produced in these situations.

Besides the refactored internals, this version has various other fixes, based on many games and other programs I have been testing. Here is a list of the programs I tested, and the changes made into DSx86 or other information about why the program fails to run properly.

This was the last week of my summer vacation, so after today it is back to the normal slow progress with DSx86. I won't have much time to work on DSx86 during weekdays, so I can not get all that much done during each two-week period. I am glad I got the internal refactoring done during my summer vacation, though, as that was quite an extensive change. I had to change pretty much every single opcode that I have been spending the last year coding.

July 25th, 2010 - Refactoring and profiling

Internal refactoring done

This Wednesday I finally got all the internal refactoring done, and was able to do the remaining big architectural changes (that needed all the code to be refactored before I could do them). It took me a while to get the code working again, but by Wednesday evening the code finally seemed to run properly. I checked the CPU speed with Norton SysInfo to see how much the code has slowed down, and the result was pretty worrying.

I decided to see if I could include my old profiling tool (which I had last used in August last year, when the emulation core still was bundled with Wing Commander II files and could not even run 4DOS yet), and make it run with the current DSx86. I had long since stripped the code out from DSx86, but luckily I found it in an old source code backup directory. I added it back, and decided to start by profiling Norton Sysinfo.

Profiling Norton Sysinfo

Initial profiling results

Here is the first profiling result, while running the SysInfo CPU speed test. This was taken pretty much right after I finally got the new code to run properly, without any optimizing done yet. The first table shows the opcodes taking the least average number of ticks (ordered by that value), and the second table shows the opcodes taking the most total number of ticks (again ordered by that value):

opcodebytecountmin ticksavg tickstotal ticks% of totalcommandin ITCM?
NOP90174288.00139360.0082%No operationYes
CWD99781010.007800.0005%Convert word to doublewordYes
JA7743941010.01439920.0260%Jump if unsigned aboveYes
MOV DL,imm8B2521111.005720.0003%Move imm8 byte to DL registerYes
JL7C5180561111.0157043483.3762%Jump if signed lessYes
  
opcodebytecountmin ticksavg tickstotal ticks% of totalcommandin ITCM?
JL7C5180561111.0157043483.3762%Jump if signed lessYes
ADD r16, r/m16035191751428.11145916308.6363%Add 16-bit register or memory to 16-bit registerNo
ADD/SUB/AND/OR/CMP r/m16,imm16815183833232.14166605879.8608%Various 16-bit arithmetic/logical operations with imm16 valueNo
TEST/NOT/NEG/MUL/DIV r/m16F72666301667.061788056710.5829%Various 16-bit memory operationsNo
MOV r/m16,r16895252623337.181952693711.5573%Store a 16-bit register into register or memoryNo
MOV r16,r/m168B8062411424.631985936911.7541%Load a 16-bit register from register or memoryYes
POP r/m168F5180643939.162028522212.0061%Pop a value from stack to register or memoryNo
INC/DEC/CALL/JMP/PUSH r/m16FF10485763439.204110255724.3272%Various 16-bit memory operationsNo

Not surprisingly, NOP (no operation) is the fastest opcode. The ticks run at 33MHz, so 8 ticks means that handling a NOP opcode takes 16 CPU cycles (as the NDS CPU runs at 66MHz). This includes some profiling overhead, so one or two ticks can in effect be decremented from the ticks of all the opcodes to calculate the actual amount of timer ticks the opcode executing takes. The JL opcode is both one of the fastest opcodes and also one of the most frequently executed opcodes. It is interesting that the two most common opcodes are 0xFF and 0x8F, both of which should be rather uncommon in normal programs, especially in games. As opcodes 0x81, 0xF7 and 0xFF can perform several different operations, depending on the so called "modrm" byte following the main opcode byte, I wanted to see what operations exactly SysInfo does with those opcodes:

opcode 81modrmcountmin ticksavg tickstotal ticks% of totalcommandITCM?
CMP [disp16],imm163E5773295.10548730.1678%Compare global variable with 16-bit valueNo
CMP [BP+disp8],imm167E10479993232.103364085099.8322%Compare local variable with 16-bit valueNo
opcode F7modrmcountmin ticksavg tickstotal ticks% of totalcommandITCM?
DIV [BP+disp8]7620478148.66303260.0431%Divide DX:AX by local variableNo
IMUL [BP+disp8]6E7142977.13550730.0783%DX:AX = AX * local variableNo
DIV CXF122446783.051863550.2651%Divide DX:AX by CXNo
DIV BXF310430686767.016989610699.4304%Divide DX:AX by BXNo
opcode FFmodrmcountmin ticksavg tickstotal ticks% of totalcommandITCM?
PUSH [disp16]362253571.23160270.0390%Give a global variable as a parameter to a C functionNo
INC WORD [disp16]0681593941.913419200.8326%Increment a global variableNo
PUSH [BP+disp8]765209543838.101984717448.3317%Give a local variable as a parameter to a C functionNo
INC WORD [BP+disp8]465187324040.162083154450.7288%Increment a local variable of a C functionNo

Opcode 81 only used those two variations, with only the CMP [BP+disp8],imm16 operation actually relevant. Opcode F7 used several modrm variations, but again only DIV BX is called repeatedly in the CPU speed test loop. I believe this opcode is used to determine the CPU MHz number, as the DIV opcode is supposed to take exactly 22 CPU cycles on a 80286 processor. As the division seems to take 67 ticks (at 33MHz) in DSx86, that will nicely convert to 11MHz 80286 clock speed, just like Norton SysInfo reports. Opcode FF (together with 81 and 8F) are then probably the actual opcodes used to calculate the CPU speed, and all of these use the BP-register-indexed stack access.

It is also interesting that the very rarely called operations take on the average two times the minimum timer ticks, while the common operations take around the minimum number of ticks all the time. This is probably due to NDS cache misses while the less frequent operations are performed.

Optimizing opcode 03 (ADD r16,r/m16)

Next I looked into the operation that I thought is most suitable for testing the possible optimization tricks, opcode 03 (ADD reg16, r/m16). This is a good opcode for testing, as it does not need the CPU flags to be saved (the addition will change all of them anyways), and pretty much all the arithmetic and logical opcodes are very similar. So if I can figure out ways to optimize it, I can use the same tricks for a lot of other opcodes as well. The refactored code for opcode 0346 (ADD AX,[BP+disp8]) looked like this (the actual code is full of parameterized macros, so this is what the code would look like with all the macros expanded), and it takes 28.11 ticks on the average to run:

add_ax_bpdisp8:	
    @-------
    @ macro r0high_from_idx_disp8
    @-------
    ldrsb   r0,[r12],#1             @ Load sign-extended byte to r0, increment r12 by 1
    add     r0, r9, r0, lsl #16     @ r0 = (idx register + signed offset) << 16
    b       add_r16_r0high_bp_r4    @ Jump to handler for AX (r4) register with BP (r9) based indexing
    ...
    @-------
    @ macro add_reg16_r0high
    @ On input:
    @   r0 = offset within the segment in high halfword
    @   r1 = free
    @   r2 = current effective segment in high halfword, segment override flag in lowest byte
    @   r3 = current SS segment in high halfword, current DS segment in low halfword
    @   r4..r11 = AX..DI registers in high halfwords
    @   r12 = current physical CS:IP
    @   lr = current physical SS:0000
    @-------
add_r16_r0high_bp_r4:               @ This is jumped to when the offset is based on BP register
    @-------
    @ macro mem_handler_bp_destroy_SZflags
    @ Indexing by BP register, so use SS unless a segment override is in effect.
    @-------
    tst     r2, #0xFF               @ Is a segment override in effect? Zero flag will be set if not
    moveq   r2, r3, lsr #16         @ r3 high halfword contains the SS segment, so put it into r2 ...
    lsleq   r2, #16                 @ ... and shift it to the high halfword.
    @-------
    @ macro mem_handler_jump_r0high
    @ Calculate the physical RAM address, and jump to correct handler
    @ depending on the type of the memory addressed.
    @ On input:
    @   r0 = offset within the segment in high halfword
    @   r2 = current effective segment in high halfword
    @   NOTE! Nothing may have been pushed into stack before this!
    @ Output:
    @   r2 = physical memory address (with EGA/MODEX flags if applicable)
    @ Destroys:
    @   r0
    @-------
add_r16_r0high_r4:                  @ This is jumped to when the offset is NOT based on BP register
    add     r2, r0, lsr #4          @ r2 = full logical linear memory address in highest 20 bits, garbage in low byte
    mov     r0, r2, lsr #(12+10+4)  @ r0 = 16K page number
    add     r0, #(SP_EMSPAGES>>2)   @ r0 = index into EMSPages table in stack
    ldr     r0,[sp, r0, lsl #2]     @ r0 = physical start address of the page, highest byte tells type
    lsl     r2, #(18-12)            @ r2 = offset within the 16K page in highest bits
    add     r2, r0, r2, lsr #18     @ r2 = physical linear address
    add     r0, pc, r2, lsr #24     @ r0 = PC + 0x02, 0x06, 0x0A, 0x0E, ...
    ldr     pc,[r0, #-2]            @ Jump to the handler, adjust index to 0, 4, 8, or 12
    .word   .op_03_RAM_r4           @ RAM (physical address like 0x02XXXXXX)
    .word   .unknown_back1          @ MCGA Direct (obsolete!)
    .word   op_03_EGA_r4            @ EGA (physical address like 0x0AXXXXXX)
    .word   .unknown_back1          @ Mode-X (unsupported opcode!)
.op_03_RAM_r4:
    @-------
    @ Actual code for handling opcode 03 when the target is AX and the address is in normal RAM.
    @ Get a halfword from (possibly) unaligned memory address, and add it to register.
    @-------
    ldrb    r0, [r2]                @ Load low byte from RAM
    ldrb    r1, [r2, #1]            @ Load high byte from RAM
    lsl     r0, #16
    orr     r0, r1, lsl #24         @ r0 = low byte | (high byte << 8) (in high halfword)
    adds    r4, r0                  @ Finally perform the addition
    b       loop

I did a minor optimization immediately, I coded a shortcut for the situation where the memory operand is in normal RAM (which it always is in SysInfo), and then checked which operations exactly are performed:

opcode 03modrmcountmin ticksavg tickstotal ticks% of totalcommandITCM?
ADD SI,AXF03121439.47123140.0468%Add register AX to register SINo
ADD AX,BXC33641436.27132010.0502%Add register BX to register AXNo
ADD DI,[BP+disp8]7E2082567.85141120.0537%Add a local variable to register DINo
ADD AX,[disp16]063642468.85250610.0953%Add a global variable to register AXNo
ADD AX,[BP+disp8]4610459832525.032618028799.5774%Add a local variable to register AXNo

The things to note about this opcode are:

I wanted to make sure that the fluctuating ticks for the less common opcodes really is caused by the cache misses, so I experimented by moving all the register modrm operations (modrm >= 0xC0) into ITCM, and jumping directly to their handler if the modrm byte is >= 0xC0 (else I jump to the original handler). This made the average time for the register operations to be exactly 14 ticks, so this proved that the additional time is caused by cache misses. Sadly there is not enought ITCM to put all register operations there, so my best strategy is trying to optimize the extra jumps away from as many operations as possible.

Final result

I made several iterations, adjusting and improving the code and then profiling again. Finally I ran out of new ideas to test, so this is what the current code looks like. See the list of optimizations below the code for a description of each change I did. The changes are also marked in red in the comments in this code snippet:

add_ax_bpdisp8:	
    @-------
    @ new macro r0high_r2_from_bpdisp8_destroy_SZflags
    @-------
    ldrsb   r0,[r12],#1             @ Load sign-extended byte to r0, increment r12 by 1
    tst     r2, #0xFF               @ Is a segment override in effect? Zero flag will be set if not
    add     r0, r9, r0, lsl #16     @ r0 = (idx register + signed offset) << 16
    biceq   r2, r3, #0x0000FF00     @ r2 = logical SS segment in high halfword, with garbage in low byte
    @-------
    @ macro calc_linear_address_r2_from_r0high
    @-------
    add     r2, r0, lsr #4          @ r2 = full logical linear memory address in highest 20 bits, garbage in low byte
    mov     r0, r2, lsr #(12+10+4)  @ r0 = 16K page number
    add     r0, #(SP_EMSPAGES>>2)   @ r0 = index into EMSPages table in stack
    ldr     r0,[sp, r0, lsl #2]     @ r0 = physical start address of the page minus logical page start
    add     r2, r0, r2, lsr #12     @ r2 = physical linear address
    @-------
    @ Code specific to [BP+disp8] handling
    @-------
    tst     r2, #0x7C000001         @ Is the target something else than halfword-aligned RAM?
    bne     .op_03_addr_r4          @ Yep, so jump there
    @-------
    @ Halfword-aligned RAM address accessed by BP-based indexing.
    @-------
    ldrh    r0, [r2]                @ Load halfword from RAM
    adds    r4, r0, lsl #16         @ Add it to register value
    b       loop                    @ Back to opcode loop

The optimizations I made to the code are the following:

So, after I coded similar optimizations to all the [BP+disp8] based operations that Norton SysInfo uses during the CPU speed calculation, how did this affect the speed? Here first is the new profiling result, where we can see that handling opcode 03 now takes on the average only 22.13 timer ticks (while it originally took over 28 ticks):

opcodebytecountmin ticksavg tickstotal ticks% of totalITCM?improvement
JL7C5180731111.0157031314.2727%Yes0%
ADD r16, r/m16035191911422.13114909278.6088%No21.25%
MOV r/m16,r16895252842424.25127360169.5416%No34.78%
CMP [BP+disp8],imm16815183562525.12130198449.7543%No21.85%
POP r/m168F5180852828.061453907610.8924%No28.32%
MOV r16,r/m168B8062801420.041615761612.1050%Yes18.64%
DIV BX (etc)F72666361667.041787596913.3924%No0%
INC/PUSH [BP+disp8] (etc)FF10485762730.253172100623.7649%No22.82%

The operations that read from RAM (now using ldrh instead of two ldrb operations) have improved about 20%. The operations that write to RAM (this time with strh instead of two strb operations) have improved by about 30%! (The real improvement is even a little bit higher, as these percentages have the profiling overhead included in the results.) It is interesting that the memory store benefits from halfword access more than the load. Perhaps this is due to my not being able to avoid using the register immediately after load, while storing a register does not have this slowdown. And finally, here is what Norton SysInfo now shows as the speed of DSx86. I was hoping I could get back to above 10x original PC speed, and I am quite happy to see that I succeeded. All in all, looks like my refactoring the code did not completely kill the performance of DSx86.

The next program I am going to profile is Wing Commander II, as it has been pretty choppy to begin with. The last time I profiled it the MCGA graphics mode only used Direct screen access, while nowadays only blitted screen update is used. Thus the results will not be fully comparable to results from last year, but even so it will give me information on what opcodes to optimize next.

July 18th, 2010 - Version 0.21 released!

Version 0.21 info

This is mostly a fix version after the somewhat buggy version 0.20. This version includes the finished AdLib emulation, and I fixed the problem introduced in 0.20 with the Direct SB mode where the start of the BIOS F000 segment was overwritten with corrupt data. This version also has a lot more of the opcodes refactored to use the new more robust memory handling, which also means that this version will run slower than the previous version. Norton SysInfo tells that this version runs at 9.9 times original PC, while the version before any of these internal changes ran at 11.5 times original PC. I am still not even half way done with the internal refactoring, so the next version might still be slower, until I get all the refactoring done and can again start optimizing things.

I spent about half my time working on the internal refactoring, and the other half with debugging and testing programs that behaved badly in the previous version. Here is a list of the specific programs I tested and the changes they required, where applicable.

Future plans

The internal refactoring continues, and as you might have noticed, this version is quite a bit smaller than the previous version. That is due to refactored code no longer requiring separate graphics and normal RAM opcodes, but instead only the memory handlers are separate. So even though the code size gets smaller, more and more "graphics opcodes" get supported by every refactoring change I do. I am looking forward to a point where I can get rid of the separate graphics opcode framework completely, as that will free several kilobytes of ITCM for other more beneficial use.

I also hope to finally look into the mouse emulation improvements during the next couple of weeks. Adding smoother screen scaling could also help some games, but the problem with that is that it takes a lot of CPU cycles, during which time no interrupts are sent to the running x86 program, so especially Direct SB audio would become pretty much unusable. But, I'll see what I can do about that. There are also many games remaining in the Compatibility Wiki that I should look at, so I don't think I will run out of things to do in DSx86 for a while yet. :-)

Thanks again to all of you for your interest in DSx86!

July 13th, 2010 - AdLib emulation source code released

Just a quick post, I just put the source code for my AdLib emulation available for download. See my download page, or get it directly from here. Hope you find it useful or interesting, and let me know if you see some obvious bugs in it. :-)

July 11th, 2010 - Internal refactoring continues

AdLib emulation finished for now

Last Monday I added the last missing features of the AdLib emulation, frequency modulation (vibrato) and amplitude modulation. I figured out a way to handle the vibrato without slowing down the code all that much. The sound frequency should change (using the 32kHz sound output frequency) after every 674 samples, but as I fill the buffers 64 samples at a time, I decided to change the frequency after every 640 samples, so I can move the calculations outside of the sample building loop. So the vibrato is slightly faster than it should be, but I don't think that is much of a problem. I did a similar change to the amplitude modulation, it should change the sound volume every 168 samples, but I change it every 192 samples, again to move the calculations outside of the 64-sample loops.

I just need to comment the code better and create a test project that would use my AdLib emulation code and then I could release the sources, in case people are interested in those.

Some compatibility improvements

I have so far tested Adventures of Robin Hood, Silpheed and SimAnt. Adventures of Robin Hood crashed with an sunsupported opcode, and when I debugged it I noticed that it uses only a 32-byte(!) stack! That is so little space that when a timer interrupt happens when the code calls a subroutine (after receiving a keypress) it runs out of stack space. Actually, the stack pointer should wrap to the end of the stack segment (which has unused space in the game launcher program), but my DSx86 stack emulation implementation did not handle this properly and thus crashed. I changed my stack emulation to handle stack pointer wrap-arounds properly, so Adventures of Robin Hood started up fine.

Silpheed progressed pretty far with the single-opcode internal refactoring I had already done, but then it ran into a problem with my string opcodes that still didn't handle writing to graphics memory with a segment address pointing to plain RAM properly. It did draw most of the enemy ships with the plain opcodes, but their removal from the screen was done using string opcodes, so after a while the screen was full of enemy ships. :-) It needed the refactored string opcodes before I could continue with it.

I then tested SimAnt, and noticed that it used the EMS memory in a way that also was incompatible with my segment-based memory access mode handling. It got the wrong data from the EMS memory using my old string opcodes, and thus filled the screen with garbage data instead of the Maxis logo at the start, for example. So, I decided to start working on the string opcode refactoring next.

  

Internal refactoring

I have now refactored many of the simple opcodes, like all OR operation variants and most of the MOV opcodes. What was interesting when I did this refactoring was that after every build the resulting DSx86.nds file got smaller. This was due to the new memory access handling using more common code for both the normal RAM and graphics opcodes, as only after the effective memory adress is calculated the code will branch to different handlers. Originally I had completely different opcode handlers, depending on where the effective segment poinst to. Thus, the new code has a lot more branches (which makes the code slower), but on the other hand there is much more common code which will help with the cache hit percentage. So perhaps the total slowdown is not quite as drastic. I also have a couple of speedup tricks I can still use, but those I can not do until all of the code is changed to use the new memory access strategy. So, looks like the next version will be quite a bit slower than the current version, but after I have refactored all the opcodes I can make the code slightly faster again.

I am currently working on the string opcode refactoring, and now I am pretty happy with the way the code looks. I am splitting the memory moves, for example, so that they are done in blocks that fit within the same 16K memory page, both for the source and target address, and also possible SI and DI segment wrap is taken into account. So all EMS memory and graphics memory access with the string opcodes should now always handle the correct data, so I can look elsewhere for erractic behaviour in games. I have already coded most of the string opcodes for main RAM and EGA graphics memory, but Mode-X handling is still completely missing. I still have a week before the next release, so I should have enough time to handle those as well.

After I coded the main RAM and EGA string operations, both Silpheed and SimAnt looked to be playable. Silpheed still hangs after intro if Enter is not pressed, and it also hangs when going to the debugger and trying to continue, so there are still some issues. SimAnt I haven't tested any further than by going to the main menu. It uses 640x480 graphics mode so it is a bit awkward to play in any case.

The heat wave continues here in Finland (as it seems to do for most of Europe), so I shall see how much coding I can get done during the next week.

July 4th, 2010 - Version 0.20 released!

Version 0.20 info

Yes, it is version 0.20, not 0.16! I decided to jump the version number a bit again, as this version has some extensive internal architecture changes, as well as much improved audio features. I also began a new blog page as the previous one had grown quite long. See the end of this page for a link to the previous blog entries. Anyways, here is a list of the most important changes:

The new Covox and SB improvements still have some limitations. The Covox and SB Direct DAC output have distortion/warble in the sound. I believe this is mostly caused by the screen blitting code, which does not allow the PC timer interrupts to happen during the screen blitting. You can lessen the distortion by using screen update mode 15FPS, but you can not get completely distortion-free sound. The SB auto-init DMA only works when the played buffer is exactly divisible by 128, so some games might still not play all digitized sounds.

AdLib emulation improvements

As you can see from the change list, my focus last week was the audio features of DSx86. The last time I worked on the AdLib emulation was September last year, so I first had to go through the code and try to remember how it worked. Then I began by increasing the audio volume, which was the most often requested audio improvement. I remembered I had tried to increase the volume once earlier, but that resulted in bad distortion. Now I found the reason for the distortion and was able to fix that, so increasing the audio volume actually made the audio much cleaner. Also, for the first time since I had started working on DSx86 I now used Hi-Fi headphones with my DS Lite, and was surprised to find that my AdLib emulation actually produces quite convincing bass frequencies! I had only tested the audio with the inbuilt speakers and el-cheapo tiny headphones, neither of which seemed to produce any bass sounds.

After I increased the audio volume I began working on the missing rhythm instruments. AdLib has two modes of operation, it can either use all 9 channels (each with 2 FM operators) for melodic instruments, or it can use the last three channels for rhythm instruments (so that Bass Drum uses both operators of channel 6, the other four rhythm instruments each use a single operator of the remaining two channels). Back in September the rhythm instrument code in my reference fmopl.c implementation looked extremely complex and slow, so I skipped implementing these at that time. Now when I am on my summer vacation I wanted to really look into this code and understand how it works, and now was able to optimize and invent various shortcuts to make it run pretty much as fast as the normal channels.

Actually Bass Drum behaved pretty much like a normal channel, except that in the normal melodic channel operator 1 either works as a phase modulator for operator 2, or it produces sound directly (so that a single channel can actually produce two different sounds), but with Bass Drum it either works as a phase modulator or is ignored completely. Ignoring it was of course quite an easy change, so that took care of the Bass Drum. Tom Tom was quite easy as well, it just used a single operator to drive the output, so it was actually easier than the melodic channels.

The HiHat, Snare and Cymbal sounds were more difficult. They also each use only a single operator, but they need a noise generator in addition to the phase frequency counter, and also the frequency is not used as a simple 16.16 fixed point value being an index to a waveform table, but only a few bits of the frequency counter are used to create certain fixed indices to the waveform table. Both HiHat and Cymbal also use two different operators (channel 7 operator 1 and channel 8 operator 2) to produce their output. For example, this is what the HiHat phase generation looks like in the reference implementation:

    /* high hat phase generation:
       phase = d0 or 234 (based on frequency only)
       phase = 34 or 2d0 (based on noise)
    */

    /* base frequency derived from operator 1 in channel 7 */
    unsigned char bit7 = ((SLOT7_1->Cnt>>FREQ_SH)>>7)&1;
    unsigned char bit3 = ((SLOT7_1->Cnt>>FREQ_SH)>>3)&1;
    unsigned char bit2 = ((SLOT7_1->Cnt>>FREQ_SH)>>2)&1;

    unsigned char res1 = (bit2 ^ bit7) | bit3;

    /* when res1 = 0 phase = 0x000 | 0xd0; */
    /* when res1 = 1 phase = 0x200 | (0xd0>>2); */
    UINT32 phase = res1 ? (0x200|(0xd0>>2)) : 0xd0;

    /* enable gate based on frequency of operator 2 in channel 8 */
    unsigned char bit5e= ((SLOT8_2->Cnt>>FREQ_SH)>>5)&1;
    unsigned char bit3e= ((SLOT8_2->Cnt>>FREQ_SH)>>3)&1;

    unsigned char res2 = (bit3e ^ bit5e);

    /* when res2 = 0 pass the phase from calculation above (res1); */
    /* when res2 = 1 phase = 0x200 | (0xd0>>2); */
    if (res2)
        phase = (0x200|(0xd0>>2));

    /* when phase & 0x200 is set and noise=1 then phase = 0x200|0xd0 */
    /* when phase & 0x200 is set and noise=0 then phase = 0x200|(0xd0>>2), ie no change */
    if (phase&0x200)
    {
        if (noise)
            phase = 0x200|0xd0;
    }
    else
    /* when phase & 0x200 is clear and noise=1 then phase = 0xd0>>2 */
    /* when phase & 0x200 is clear and noise=0 then phase = 0xd0, ie no change */
    {
        if (noise)
            phase = 0xd0>>2;
    }

The noise value above is calculated in a noise-generator which gives a new value for each output sample, and uses the following algorithm in the reference implementation. The noise value in the above algorithm is the lowest bit of the OPL->noise_rng variable.

    OPL->noise_p += OPL->noise_f;
    i = OPL->noise_p >> FREQ_SH;		/* number of events (shifts of the shift register) */
    OPL->noise_p &= FREQ_MASK;
    while (i)
    {
        if (OPL->noise_rng & 1) OPL->noise_rng ^= 0x800302;
            OPL->noise_rng >>= 1;
        i--;
    }

My simplified and speeded-up version of the phase generation algorithm is below. The problems in my algorithm are that it does not take into account the frequency of operator 2 in channel 8 at all, as I don't have enough free registers to handle two operators simultaneously, and my noise generation is completely different. The result is that the HiHat does not sound quite like it should, it has more of a ringing and less noise to it's sound. It will have to do for now, though, until I figure out a better 1-CPU-cycle noise generator than my tst r7, r7, ror r7 opcode, or can figure out a way to calculate another operator while calculating the current operator as well.

    @-------
    @ On input:  r7 = SLOT7_1->Cnt (16.16 fixed point value, FREQ_SH = 16)
    @ On output: r1 = phase << 9
    @-------
    eor     r1, r7, r7, lsr #5                  @ r1 = (bit2 ^ bit7);   (<<16)
    orr     r1, r7, lsr #1                      @ r1 = (bit2 ^ bit7) | bit3;   (<<16)
    and     r1, #(1<<(16+2))                    @ r1 = res1 = (bit2 ^ bit7) | bit3; (== 0x200 shifted 9 bits left)
    tst     r7, r7, ror r7                      @ Carry flag = pseudo-random noise value
    orrcc   r1, #(0xD0<<(16+2-9))               @ phase = res1|0xd0;
    orrcs   r1, #(0xD0<<(16+2-9-2))             @ phase = res1|0xd0>>2;

I also managed to speed up some things in my AdLib emulation in general, for example I reordered the operand-specific values in memory so that instead of using 7 separate ldr commands to load the r4-r10 registers needed in each operator calculation loop I load them with a single ldmia opcode, and I also improved the envelope calculations somewhat. The envelope generation in AdLib has the usual four phases, Attack, Decay, Sustain and Release. Plus silence of course. The code I use for the envelope generation is the following, it is similar to both melodic and rhythm instruments:

    @-------
    @ Calculate envelope for SLOT 1.
    @ On input:
    @   r1 = scratch register
    @   r4 = sustain level (or silence level if in release phase) (in low 16 bits, 0..512)
    @   r5 = envelope increment/decrement value (16.16 fixed point)
    @   r8 = operator volume (16.16 fixed point), 0 = max volume, 512<<16 = silence
    @-------
    adds    r8, r5                              @ Adjust the volume by the envelope increment. Carry set if we are in attack phase.
    bmi     from_attack_to_decay_phase          @ Go to decay if we went over max volume
    rsbccs  r1, r8, r4, lsl #16                 @ Did we go under the SUSTAIN level (and we are not in attack phase)? Carry clear if we did.
    bcc     from_decay_to_sustain               @ Yep, go adjust the volume
env_adjust_done:	

The main idea of this code is that during normal envelope operations the program flow does not need to take any jumps, it will flow directly thru these four opcodes. The algorithm above is an ASM version of the following C language code. This is not based on the reference AdLib implementation, as I have completely re-engineered the envelope generation code to be based on running 16.16 fixed point adders instead of (slow) table lookups.

    op->volume += op->env_incr;
    if ( op->volume < 0 )
        goto from_attack_to_decay;
    if ( op->env_incr >= 0 && op->volume > op->sustain )
        goto from_decay_to_sustain;
env_adjust_done:

The interesting part of my ASM implementation is that by using the reverse subtract RSB opcode instead of the CMP opcode I was able to get rid of the extra compare to see whether op->env_incr is greater than or equal to zero. The first adds always sets the carry flag if r5 (op->env_incr) is negative, and in that case I don't want to test the sustain level. So, by swapping the resulting Carry flag of the comparison between op->volume and op->sustain I can make sure the jump to from_decay_to_sustain is never taken if either op->env_incr is negative or op->volume is <= op->sustain. This is what I especially like about the ARM architecture, the conditionally executable opcodes make all sorts of neat tricks possible! I've been using the conditionally executed compare opcodes quite a bit recently, as that is a nice way to handle "comparison AND comparison" type of tests (with short-circuit evaluation).

The only bigger issues (in addition to the HiHat and Cymbal operators) in the AdLib emulation now are the amplitude and frequency modulation support. I actually have the amplitude modulation already coded in, but it does not seem to work properly, so that I still need to debug. Last September I did not have the iDeaS emulator to use for debugging the ARM7 code, so now debugging is much easier than what it was back then. The frequency modulation I haven't implemented at all yet, as it is quite a heavy CPU burden and I'm not yet sure if I can spare the CPU cycles. I would like to test that, though. After those changes I would consider my AdLib emulation finished, so I could release the sources and/or create an ARM7 library for using it, so that it could be used in all sorts of PC game porting projects. I would like to hear the original music in the Doom and Wolfenstein ports, for example, and why not Quake too, if it used AdLib music?

Major internal refactoring starting

I have now run into several games that point the segment register outside of the graphics memory area (Gods, Silpheed) or the EMS memory area (Alone in the Dark), which breaks my speedup trick of precalculating the memory access area using only the segment register. I decided to change my memory access technique to be more compatible and robust, which will sadly also make it slower.

In this version I have started this internal reworking. Pretty much none of it should be visible yet, but a few opcodes are somewhat slower than before. I need to change every single opcode (which I have spent the last nearly a year implementing) to use this new architecture, so it will take quite a long time before this work is finished. I plan to change a few opcodes (or opcode groups) by each version, and try to keep the two-week release window even while I am doing this change. I will also implement other changes and improvements, so this major rewrite is a sort of constant background process. Removing the Direct screen update method was also partly due to this internal reworking, as it would have gotten in the way of some required internal changes.

I haven't done much about the general compatibility this last week, as I have focused on the audio issues. I hope to look into a couple of new games again for the next version, and look into improving either the screen scaling methods or the touchpad mouse emulation. I'm not yet sure which.

Anyways, thanks again for your interest in DSx86, and sorry for this long blog post! Being on vacation gives me more time to work on DSx86, and also makes it possible to write longer blog posts. Oh, and happy 4th of July to all you celebrating it! :-)

Previous blog entries


Main Page | Downloads | Credits