Well, somewhat surprisingly, I found the nastiest bugs in the code yesterday, and now DSx86 seems to be running reasonably well so that I dare release the first alpha version. Please note that it still is very much work-in-progress, so don't expect it to work properly all the time (or even most of the time).
You need to have 4DOS as a command shell, without that you can't do anything with DSx86. See the download page for download links. Naturally libFAT has to work on your setup as well, else all you get is this:
As you can see, the user interface also needs a lot of work still (if you are graphically inclined and feel like designing a user interface for DSx86, I would much appreciate your help!). You can click on the reddish items on the Screen configuration section to change the screen update behaviour on-the-fly. Other configuration changes need to be done via DSx86.ini file.
Phew, managed to get the first version released with a few days remaining on my vacation!
Well, this last week has been somewhat slow progress with DSx86, with the Christmas and all. I have however managed to code some necessary enhancement that are needed for the release, like configuration file handling (with the possibility to have game-specific key mapping and screen update methods) and crash logging to a file on the SD card. I have also added a lot of opcodes needed by the 4DOS help system and the example btm file. This still needs some work, though.
Solar Winds has been a real trouble maker recently. The first major problem was that it seemed to jump into a memory address that did not contain code (the area was full of zeroes) occasionally. When I later noticed that this happens every time I tried to fire the weapon, I began debugging this problem to see what was happening.
Pretty soon I discovered that the problem is that one of the SoundBlaster driver jump addresses point to non-existing code, but what was strange was that it pointed there even at the very start of the game, and no code ever changed this address! I ran Solar Winds in a debugger inside DOSBox to see what is going on, and in there the jump address is the same invalid address at the start of the game, but then when I fire a cannon the address has magically changed into a valid address!
It was pretty difficult (and took me two days) to find out what was really happening here, but I finally noticed (in DOSBox) that the jump address changes when a certain byte is written to the SoundBlaster DSP Command port! After some more studying I found out that Solar Winds uses a rather poorly documented SoundBlaster DSP command, that DOSBox sources call "Weird DMA identification write routine". As good a name as any, I suppose. This command uses the SB DMA channel to write one byte to the PC RAM, which is calculated in a complex way from the byte that is written to the DSP. Solar Winds SB driver uses this method to update the invalid jump address in it's jump table to a valid value! Why it does this, I have no idea. Perhaps just to make life hard for emulator coders!
I managed to code a similar routine to DSx86, and now the jump table gets a correct address and firing the weapon works. But, it only took me a few minutes of testing further to find the next big problem. When firing at another ship, Solar Winds uses an ADPCM -packed sound file that it tries to send to SB for playing. The problem is, DSx86 does not support ADPCM sound files yet. This is yet another big new feature that I need to add for Solar Winds.
I'm not yet sure when I get the first DSx86 alpha version released, but I still have a week of vacation time left, so we shall see whether that is enough time for me to finish the most important enhancements still remaining. There are also quite a few bugs still in the code, but I doubt I will be able to fix all the bugs in the next couple of days, so the first release will be buggy. Isn't this actually the way Alpha releases should be? :-)
The last week I have been working on making the old Epic MegaGames game Solar Winds to work in DSx86. I've had some problems with it, but currently it looks very promising, and I can most likely get it to work properly already in the first released version.
Since SW consists of a lot of files I could not run it in No$GBA emulator, and thus debugging it has been somewhat slow (as I need to build a new version, copy it to the SD card and put the SD card into my real DS Lite to test every little change). That also meant I could not get a screen copy from No$GBA to show it here. However, today it occurred me that I could take a screen copy directly from the VRAM of DS Lite, save it as a BMP to the SD card, and then copy this file to my PC. And thus I quickly coded such a feature to DSx86, and here is the screen capture of Solar Winds running in DSx86. Note that this shows the whole 320x200 screen, as it is taken directly from the VRAM. The NDS 256x192 screen shows this either scaled or zoomed, depending on the configuration options (which aren't coded yet, though).
The problems I have had with Solar Winds were the following:
For some peculiar reason Solar Winds copies data from normal RAM to VRAM using the source and target indices and segments opposite to the standard convention. So, instead of the usual x86 way of copying data:
mov cx,<number_of_bytes_to_copy> rep movsbwhich uses CX register as a counter, copies data from DS:SI (Data Segment:Source Index) to ES:DI (Extra Segment:Destination Index), and increments the SI and DI registers automatically, Solar Winds does the copying like this:
loop: mov al,es:[di] mov [si],al inc di inc si cmp di,4380 jne loopThat is, Solar Winds copies data FROM destination index TO source index, and then explicitly increments the indices, until the destination index (which means the source position) is at a specified location.
This makes the code run an order of magnitude slower on DSx86 than the REP MOVSB code (which has been optimized to use the ARM block copies), and it also breaks my direct screen access checking (where I only check the ES segment address). Instead of trying to find a workaround to the latter problem, I just switched back to my original blit screen update method (where the real NDS VRAM is updated by copying the whole virtual PC 0xA000 segment data once every VSync (or every other or every fourth). This will be a game-specific configuration option, as it depends on the game whether direct screen access or this blitting method works better.
This weekend was spent mostly tracking down bugs, and fixing them. I started rewriting the string opcodes on Monday and continued that on Tuesday. I get only an hour or so of free time to work on DSx86 during workdays, and being tired after eight hours at the office I usually don't get much done, so this was slow progress. On Wednesday the new string code started to mostly work, so I decided to test whether Wing Commander 2 still works in DSx86.
The last time I ran WC2 in DSx86 was almost two months ago, before I started working on the technology demo with the bundled LineWars II. Back then I had gotten frustrated with the bug that hang the game after a while in space flight, and decided to forget about WC2 for a while. I have done lot of fixes to the emulator since then, so I thought now might be a good time to test it again.
Well, of course WC2 did not work at all any more. It took me up to Saturday to get it to work again, after fixing the following problems:
The first problem was that WC2 just immediately quit back to DOS with the message "Abnormal program termination". After some debugging I found that my new memory allocation routines (which I thought were finished after last weekend) still had a "TODO!" code branch when a memory block is made larger. WC2 first minimizes it's memory block and then tries to grow it a bit, and when this returned an error it just quit.
When I had fixed that, the next problem was that WC2 tried to initialize Sound Blaster at port 0x510! My BLASTER environment variable says that the SB port address is 0x220, so that made no sense. I was not able to easily find the problem, so I added a memory watch code to DSx86 which makes it to break into the debugger whenever the contents of a memory location change. I used this to track where the port address gets that wrong value, and after a few iterations I was in the code that parses the command line parameters, and dumping these showed that one of DSx86 internal error message strings had been given as a command line parameter to WC2!
So, the problem looked to be in my new task launching code. When I debugged that, it looked like the command line segment and offset addresses were swapped, and after a little bit more digging I found that the struct I used for the EXEC_BLOCK DOS values did not have PACKED attributes in the inner structs. I think it is a little bit silly that the outer struct PACKED attribute is not inherited by the inner structs, but I guess that is how it's supposed to be. The fixed struct looks like this (the bold bits were missing, which caused the cmd_line to get it's segment from the offset of fcb_1, and it's offset from the segment halfword):
typedef struct PACKED { union { struct PACKED { u16 load_seg; u16 reloc; } _load; struct PACKED { u16 env_seg; u32 cmd_line; // Far pointer in seg<<16|offs notation u32 fcb_1; // Far pointer in seg<<16|offs notation u32 fcb_2; // Far pointer in seg<<16|offs notation u32 stack; // Far pointer in seg<<16|offs notation u32 start_addr; // Far pointer in seg<<16|offs notation } _exec; } ldata; } EXEC_BLK;After that change WC2 displayed the orchestra intro, and then hang.
The next problem was that I hadn't coded proper support for EMS memory into the new DOS file functions, so when WC2 loaded data directly into EMS page frame at 0xE000 segment, all the data went into the same physical memory area, and when after the orchestra intro WC2 tried to get the data from EMS back to normal RAM, it didn't find what it was looking for there and spinned in a loop trying to find the data. This was reasonably easily fixed by including the EMS handling from my earlier code prior to the DOS function rewrite.
On Saturday I then returned to working on AlleyCat. I had a weird problem with it, where the building windows in the background kept vanishing every now and then. There were other strange things happening as well, but that was the most consistent obviously wrong behaviour.
I used my new memory watch code and made it watch the window location in the graphics memory, and found out that it was a REP MOVSB operation within the CGA memory that overwrote the windows. It took me some more digging to find out why this happens, but finally I noticed that occasionally the emulator code called the CLD (direction bit clear = increment indixes) REP MOVSB code, even though the direction bit was set in the flags. It didn't do this all the time, only occasionally, and never when I traced thru the code in the debugger. After some head scratching it finally occcurred to me that this might happen when an interrupt handler has cleared the direction flag between the setting of the flag using the STD opcode, and the actual REP MOVSB opcode. But I did have a code in the IRET opcode that was supposed to handle this situation!
Well, turned out that the IRET handler first set the new the direction flag and then called the CLD or STD opcode, which in turn first checked whether the new direction flag differs from the existing one before they did anything. The flag did not differ since I had just set it in the IRET handler!
This was a pretty serious bug that might have caused havoc in every game, so it was very good luck that AlleyCat used REP MOVSB with the direction flag set on the graphics screen, which made the problem easy to notice.
On Sunday I then did various minor fixes, like the hardware cursor emulation I mentioned in the previous post. I also tried to run the old CGA version of Elite on DSx86, but that immediately printed "Packed data corrupt" message, and quit to DOS, and also corrupted the memory control blocks while doing it, so 4DOS also exited (and rebooted).
After some debugging I found that the internal unpacker uses the segment:offset addresses a bit strangely, the offset is always in the range of 0xFFF0..0xFFFF, and the segment is adjusted so that it points to the correct location. Works fine as long as you are using a DOS version that consumes at least the first 64KB of RAM. However, with 4DOS swapping to EMS in DSx86, DOS takes less than 8KB of the low RAM, and the segment address wraps down to 0xFFFF and below. I suppose a real PC might wrap the physical RAM address back to low memory with addresses like 0xFFFE:0xFFF0, but in any case that is NOT a proper way to code software!
Anyways, I will not fix this in DSx86, instead there is an easy workaround for this problem: Start another 4DOS shell but tell it not to swap, so it takes 200KB of RAM, and then launch Elite from within this shell. This will work fine, but currently Elite runs only up to the program protection screen, where it needs a DOS Buffered Input call that I haven't yet coded. Oh, and the screen is in 40x25 text mode, which I also added. This is a nice mode as all the text (except the last line) fits without scrolling or scaling.
Instead of continuing with Elite, I decided to download some other old DOS games to test. The first in line is Solar Winds 1, which I am currently working on. Anyways, here above are some pictures (from AlleyCat in the actual game, the Elite protection screen, and 4DOS "memory" command result), to make this long post a bit less boring, hopefully.
This weekend I finally had more time to work on DSx86. The biggest new feature I added this weekend was a proper FAT directory handling (DIR and CD commands on the DOS prompt). I also fixed the new process launching and returning to the parent process when the child process exits. There were also some missing features in the memory allocation routines, which now should be pretty much complete.
I got a bit bored working on the DOS internals for the last three weeks, so I took some time off from that work and looked into supporting AlleyCat. I had to add quite a few new modrm bytes for the opcodes that AlleyCat uses, and also the simple CGA support I added for Paratrooper was far from good enough for AlleyCat. The CGA support is still missing some important features and has some bugs, but I managed to get AlleyCat working for a little while, until it hits an unsupported CGA string instruction.
The DOS kernel now supports the most important features for launching games from the SD card, so a first alpha-version release of DSx86 is now much closer. I have a two-week Xmas vacation starting on the 21st, and my current plan is to get the first working version of DSx86 released during my Xmas vacation. That might still change if I run into some major problem in the code, but at least that is the plan. The biggest missing features that I still must add before I can release it are the following:
Btw, December 6th is the independence day of Finland. Happy Independence Day!
Last weekend I didn't have quite as much free time to work on DSx86 as I normally have on the weekends. The little time I had was spent continuing work on the internal DOS emulation. I got some of the 4DOS internal commands to work (like "memory" and "beep" :-), and finally on Sunday I got "dir" command mostly working. That needed an interface and conversion between libFAT internal structures and DOS internal structures, but I managed to get this to work and show also file dates and sizes correctly. So far this works only on the top level directory (as I don't yet have current working directory handling). Next I'll try to get launching a program from the root directory to work.
Last saturday I started working on the internal BIOS and DOS emulation, which until now had mostly been just a collection of quick hacks to get certain programs running. I decided to rewrite much of it, reorganize the source files to logical collections, etc. A lot of maintenance work, in other words.
I also downloaded source codes for FreeDOS, to learn how a real DOS version handles stuff like memory allocation and task launching. Much of saturday was spent just looking at the code and figuring out the best ways to do something similar in DSx86. Saturday evening I managed to get a proper memory allocation scheme working in DSx86.
Sunday morning I then rewrote the task launching code, which now should resemble the way real DOS works. It is still missing some features, but I managed to launch both COM and EXE files (still bundled with the core) using the new code, so in principle it seems to be working.
Perhaps the biggest reason why I wanted to make the internals work like in a proper DOS is that I *really* don't want to write a command interpreter (or shell) for DSx86. I've written about a dozen language parsers in the past, and writing yet another couldn't be less interesting. So, I try to make my DOS kernel look-a-like be able to run an existing DOS shell, like 4DOS (which is what I used instead of COMMAND.COM back in the early 90's when coding the original LineWars games).
I just got 4DOS.COM to run up to presenting the prompt inside DSx86, so I think I will by default support 4DOS as a command interpreter for DSx86. It is very feature-rich compared to the standard DOS shell, it is faster, takes less conventional memory, and best of all, it is nowadays freeware! So I think it will be a good way for me to avoid writing yet another parser.
However, 4DOS has one very annoying feature: It uses an unaligned stack pointer! It does this only in a very small snippet of it's code, but that was enough to crash DSx86 (where I had assumed that no coder in their right mind would use byte-aligned stack pointer, and thus used ldrh/strh opcodes for all stack emulation). Now I had to fall back to separate ldrb/strb opcodes, only because of a few lines of code in 4DOS.COM! Very annoying!
Well, I just hope I can figure out some new tricks to speed up the stack handling back to the previous level in the future.
I spent some time looking into supporting Master of Orion in DSx86. I found that supporting it at this stage of DSx86 development is rather difficult. The issues I ran into when attempting to run MOO in DSx86 were the following:
Last weekend I worked on getting the old CGA game Paratrooper running inside DSx86. Even though the game is very old and very simple, it was still interesting to work with it. I wanted to support both CGA graphics and PC Speaker, so such a small program that does not contain much additional code was a very good testbench.
I managed to get everything else working during the weekend, but there was still a strange problem where the intro music played fine, but when the game itself started there were no sounds (even though playing the game in DOSBox played sounds also during the game). Also the melody after your death was missing.
Today I spent some time debuging my code to see what was wrong, and finally found a bug in my implementation of REP SCASB (which the game used to determine which sound, if any, was to be played). I fixed that, and now the game works fine, just like in DOSBox. Well, except that I still don't have screen scaling implemented.
I just got SYSINFO to run up to the CPU speed test page, so let's take a look at the result! (I photoshopped two No$GBA screen copies side by side so it shows the full 80x24 screen.)
I believe the CPU MHz estimate (10MHz) is based on the speed of division opcodes. Sysinfo runs several division opcodes (which are supposed to take exactly 25 cycles on a 286 processor) in a row, and then checks how many timer ticks have passed, and then determines the MHz value based on those. Since ARM9 does not have an inbuilt division, I do the division using the math coprocessor, which obviously adds somewhat to the cost of emulating the division opcode. Also the emulated timer does not run at eaxtly the PC timer speed (1.193.182 Hz), though close to it.
I haven't checked what Sysinfo does when calculating the actual CPU speed bar, but that value of 11.6 times original PC looks pretty good in my opinion! DOSBox on my main PC with a 2.0GHz Pentium M processor gets a value of 17.8, which I think is surprisingly low.
I had to code a couple of really horrible hacks to my emulator to get Sysinfo to run this far. Firstly, I don't quite understand how Sysinfo attempts to blindly run FPU opcodes, even though I tried to use all BIOS and DOS indicators that this machine does not have an FPU! Perhaps I still missed something. Anyways, to not have to emulate an FPU, I just hacked my emulator so that when an FPU opcode is encountered, the emulator jumps ahead to the end of the routine. Problem solved for Sysinfo, but I need to fix this properly before releasing my emulator. Another similar horrible hack was with the Device Drivers detection, I don't have any proper device drivers emulated, but when Sysinfo notices the machine has EMS memory "driver", it tries to lookup the device that controls it and then goes to a never ending loop. I hacked the code to also skip this test.
There also seems to be some kind of a visual problem with the shadow on the text screen, which I need to fix at some point. But next I think I'll look into emulating a CGA display.
UPDATE: The values above are when running DSx86 in No$GBA. When running on real hardware (DS Lite), the CPU is detected as "Intel 80286, 11MHz" and the CPU Speed bar shows 11.3 times original PC. Pretty close to the values in No$GBA.
I found the old SYSINFO.EXE of Norton Utilities 5.0 on an old floppy disk. I think I'll add support for it next, as it has a CPU benchmark feature. It would be nice to see how fast it thinks my emulator is. The bad news is that I need to add a lot of BIOS and hardware emulation stuff that SYSINFO tries to find out about the machine. I would eventually need to add that anyways, so I guess now is as good a time as any for that.
Well, I think LW2 runs inside my emulator currently well enough for me to announce this project on the GBADev forum, and release this demo version to the public.
I am not quite satisfied with the code I use to map the screen coordinates from 320x200 resolution to 512x256 background on the NDS, so I decided to ask for optimization tips on the GBADev forum.
I finally got LW2 to progress up to the beginning of the demonstration game, and immediately there was a problem. The ships started to vanish one by one, until also the camera ship jumped ("Warp Factor 10, Mr. Sulu!") somewhere so far from the original location that the planets and asteroids were just tiny dots on the screen.
After some studying of LW2 source code I noticed that there was practically only one place in the code that adjusts the ship positions, and that code looked like the following:
mov ax,[DI.XMOVE] ; Get 256 * X-movement value imul bx add [DI.XLOW],ah ; Add the fractional part mov ax,dx cwd adc [DI.X],ax ; Add movement to X adc [DI.XHIGH],dl
Obviously [DI.XHIGH] gets a wrong value when the ship jumps somewhere far away.
Luckily this was the only place in LW2 where an adc [di+disp8],dl opcode is used, so I could code some debugging directly into the emulator. I made the emulator break into the inbuilt x86 debugger immediately if the value to be stored is something else than 0 or 0xFF. And soon I was in the debugger, and noticed that the full X-coordinate was 0xFEFF826D. Obviously the high byte was wrong.
The ADC code in my emulator looked like this (for a 16-bit register, slightly simplified for clarity):
ldrsb r0,[r12],#1 @ Load sign-extended disp8 byte to r0, increment r12 by 1 add r0, r11, r0, lsl #16 @ r0 = (DI + signed offset) << 16 ldrb r0, [r2, r0, lsr #16]! @ Load low byte ldrb r1, [r2, #1] @ Load high byte orr r0, r1, lsl #8 @ r0 = low byte | (high byte << 8) adc r0, #0 @ Add the carry to r0 ([DI+disp8]) value lsl r0, #16 adds r0, \reg @ r0 = [DI+disp8] + Carry + reg lsr r0, #16 strb r0,[r2] @ Store low byte to [physical segment + DI + disp8] lsr r0, #8 strb r0,[r2, #1] @ Store high byte to [physical segment + DI + disp8 + 1]
I thought the code looked fine, but to make sure, I coded a check for various input values with input carry on/off into my tester program, and ran it. But it did not find any errors. I was a bit stumped, I knew the code was broken, I just couldn't see the problem and the tester program couldn't find the problem either!
Finally after a lot of study I noticed that the problem was not in fact with the last ADC command at all, but in the previous 16-bit ADC. When the value in memory is 0xFFFF and Carry is set, adding these two together produces 0x00010000, which when shifted left becomes a zero. This should mean that whatever else is added to this value, when we leave this ADC code Carry should be set, but since the next step just adds the \reg value to zero, it will never set the Carry flag, and thus when we leave this code Carry is always clear, which is not correct!
The fix I made was to handle this special case separately, and at the same time I replaced the separate "adc r0, #0" and "lsl r0, #16" with a single "addcss" like this:
mov r1, \reg @ If input Carry is clear, the right operand is the plain register value. addcss r1, #0x00010000 @ If input Carry is set, the right operand = (register value + 1). bcs adc_pass_carry_r0 @ If Carry is now set, it means the right operand was 0xFFFF and carry was set, so need special handling. adds r0, r1 @ Perform the actual addition, setting the resulting flags. ... adc_pass_carry_r0: ands r0, r0 @ Set Sign and Zero flags, keep Carry set, Overflow flag is not changed mrs r0,cpsr @ Put the flags into r0 bic r0, #0x10000000 @ Clear the Overflow flag b restore_flags_from_r0 @ Back to loop, setting the proper flags.
The result of the addition does not need to be stored when we handle the special case, as the result will not have changed. After this fix LW2 started to work fine inside the emulator! It is very possible that this was the problem that plagued WC2 as well, but I haven't checked that yet.
Using a 8x8 pixel font on the text screen would only show 32 of the 80 character columns on the NDS screen, so I thought I'd use a narrower font, if possible. I was using Google to look for such a font, when I ran across a post that mentioned that all Windows users already have a font generator that can create various bitmapped fonts between 4x6 and 20x10 pixels! All you need to do is type a text file containing the ASCII characters you want in the Command Prompt, and then use the Properties menu to change the font size. Then take a screen copy and paste the image to your favourite image editing program. Pretty simple! See below for a 4x6 font in the command prompt. Not the most easily readable...
I decided to use a 6x8 font, as that will make the 40x25 text mode fit completely (horizontally) into the NDS screen, and will show a bit over half of 80x25 text screen.
For several weeks now I've had a bug in the emulator that makes WC2 hang. It usually happens after a few minutes of space flight. The screen just stops, and occasionally the screen gets garbled. My tester program finds nothing wrong in any of the opcodes (though I have to admit that the tester program still only tests for memory and register errors, not for actual results of the opcode). I don't feel like improving the tester program (as that would be a lot of work), so instead I thought I'd try to get my old LineWars 2 DOS game to run in the emulator. That might help me in tracking down the bug, as I can look at the source code of the DOS program to see what is supposed to happen in the code.
Making LW2 run means I have to code support for the 80x25 text mode. Until now I've only supported the MCGA 320x200 256-color mode, as WC2 does not much use the text mode. I have recently changed the way I draw stuff on the graphics screen. Instead of copying data from the virtual A000 segment to VRAM every other VBlank, I used the EMS page map to make the A000 segment point to 0x06000000 physical address (the start of VRAM), and then chek in the code whether the effective segment has this address, and if it does, map the coordinates from 320x200 resolution to the hardware 512x256 background and write the data directly to screen. Since WC2 already uses an offscreen buffer and mostly just REP MOVS the data to A000 segment, this boosted the framerate noticeably. I don't have a method of measuring the difference, but if it is plainly noticeable it must be considerable.
I am thinking of a similar technique for the text screen, by mapping access to B800 segment (as LW2 writes directly to the screen buffer instead of using DOS or BIOS printing calls) to an easily identifiable value and then mapping the access from 160x25 coordinates to 512x256 coordinates.
I just finished rewriting the REP MOVSW and REP STOSW opcode handling. I had already some time ago coded them to use ldmia and stmia instead of copying byte-by-byte, but the alignment handling was not pretty, and in general the code was very hard to read. So, I decided to rewrite them completely.
However, pretty much immediately after I had rewritten that code, I started thinking that what if using DMA would be faster than ldmia/stmia loops, especially if I could continue with other stuff while the DMA keeps transferring (as the main opcode loop is in ITCM which should not stop and wait for the DMA transfer to finish). So, I coded the moves with DMA and commented out the original code. It took a while to get the DMA transfer to work properly, as I had to also flush/invalidate cache. So the code did not turn out to be quite as simple as I originally thought it would be.
The result did not feel any faster, though, so I decided to profile both versions to properly see what the difference was. I didn't use my main profiler, instead I just printed the input count of bytes and the total ticks spent. The results are here:
Ticks/Byte | ||
---|---|---|
Method | Using DMA | ldmia/stmia |
REP MOVSW | 4.67 | 1.69 |
REP STOSW | 1.05 | 0.75 |
store to VRAM | 0.83 | 0.75 |
The conclusion is pretty clear, there is no advantage in using DMA when you need to flush cache as well, it is much faster to use ldmia/stmia. So, my complete rewrite of the ldmia/stmia was not wasted time.
I have been using Google to try to find sufficient information about the PC keyboard hardware, and especially about the full hierarchy of things that happens when you press a key on the keyboard until it is used by a DOS program. Finally I found a resource that seems to describe every step in sufficient detail at The Art of Assembly Chapter 20.
Now I can finally start coding the keyboard support. I'll use the nice NDS touchpad keyboards by Headsoft as they seem to be just what I need.
The biggest problem with the AdLib emulation I have currently, is that it skips notes and sometimes plays completely wrong notes. I haven't been able to track this down, so now I decided to log all AdLib commands that WC2 sends and write them into a file. After I had done this a couple of times, I noticed that WC2 sends completely different note evens on different runs of the same stage in the game! So, either WC2 has some weird music randomizer (highly unlikely) or there is something wrong in the core emulator, and not in the AdLib emulator where I have been looking for the problem.
Anyways, I'm fed up with working on the AdLib emulation for now, I'll fix this later (or hope it will fix itself while I continue improving the core emulator itself).
ARGH! First off, I only noticed last weekend that all the problems I was having were caused by my playing the SAME buffer that I wrote into, not the other buffer! So much for that being the easy part in the emulation...
What threw me off was that for some peculiar reason No$GBA sounds better when writing to the wrong buffer! The real hardware warbled horribly while No$GBA sounded reasonably clean with only a few clicks now and then (which I then assumed was caused by the CPU lagging behind the buffer fill). Last weekend I finally noticed that something was badly wrong when even only 2 channels caused similar problems.
I then finally after many hours of head scratching found that I had the buffers wrong, and when I switched those, real hardware suddenly sounded completely clean, while No$GBA began warbling. Strange...
Anyways, I then added some checks for CPU load and noticed that at 16kHz my code only took 20% of ARM7 power to handle all 9 channels. So, I immediately upped the mixing speed to 32kHz, and now the code takes around 40% CPU, which I think is fine as I want the ARM7 to do other things besides the AdLib emulation as well.
I still have various problems in the code that I need to fix, but looks like I have a reasonable speed margin now to add the missing features.
Okay, with the buffering issue fixed with the help from GBADev users, the next problem I have is that my code is not fast enough to play all 9 channels simultaneously. Again I posted a help request in the forum.
I had to admit defeat, I just can't seem to get rid of the clicks in my AdLib emulation, so I asked for help in the GBADev forum.
It has been harder than I had thought, getting some sounds coming from my AdLib emulation. The biggest problem is the lack of any kind of debugging ability, I just have to code something, and if I don't get any sound, I have no idea what is wrong as I can't even use iprintf! Very slow progress. Anyways, now I get something resembling music (though it is mixed with noise, clicks and other invalid sounds) playing.
I am starting to get fed up with my PSG version of AdLib music, so I am studying the possibility to code a real AdLib emulation on the ARM7. DOSBox sources contain several different AdLib emulation codes, of which fmopl.c by Jarek Burczynski looks to be the most suitable. It looks to be very performance-oriented, so it should be a good starting point. I managed to compile it, although I had to get rid of all the WiFi stuff on the ARM7 side to get it to fit in memory, but it hangs ARM7 pretty much immediately in the sample generation loop. I guess my best bet is just to use this source as an example when coding something similar in ASM.
Well, I ran into a post mentioning ITCM (Instruction Tightly Coupled Memory) on the GBADev forum, and after looking into that I decided to try and move the innermost opcode dispatcher loop of my emulator into ITCM. Looking at the map file I noticed that I have pretty much the whole ITCM unused, so I then decided to move also most of the simple opcodes, and the most often used 0x8B opcode into ITCM. Then I ran the profiler again, and the result was very encouraging!
opcode | count | min ticks | avg ticks | total ticks | % of total | command |
---|---|---|---|---|---|---|
8A | 202709 | 16 | 21.13 | 4283424 | 2.1% | MOV r8,r/m8 |
D1 | 335446 | 18 | 21.61 | 7247502 | 3.6% | ROL/SHR r/m16,1 |
26 | 694696 | 11 | 11.09 | 7704394 | 3.8% | ES: |
F3 | 221128 | 14 | 36.23 | 8011743 | 4.0% | REP |
83 | 319003 | 14 | 30.71 | 9795952 | 4.9% | ??? r/m16,+imm8 |
AE | 650595 | 15 | 15.20 | 9886529 | 4.9% | SCASB |
74 | 861436 | 12 | 12.40 | 10685264 | 5.3% | JE |
38 | 647976 | 20 | 20.19 | 13079593 | 6.5% | CMP r/m8,r8 |
75 | 1048576 | 12 | 13.11 | 13747667 | 6.8% | JNE |
8B | 831903 | 15 | 19.19 | 15965359 | 7.9% | MOV r16,r/m16 |
The minimum amount of ticks spent dropped by 2-3 per opcode, but the really big difference was in the average number of ticks, which has almost halved for some of the opcodes! It is very nice to every now and then run into a new trick that gets you a great boost in performance practically for free! These sudden jumps forward are pretty much what keeps me interested in this project, and also programming in general.
Coding the tester was a good idea! It is still in a very early stage, it only tests that the changes made by the opcode affect the correct register and/or the correct memory location, not yet even whether the result is correct. That was enough to find the bug I was having with the stack handling, and I also found another bug that would have caused problems eventually. I guess I'll need to keep working on the tester as a side project from now on.
I did some changes to the emulated stack handling, and my emulator stopped working. :-( I don't want to change the stack handling back to what it was, as the new code would be much more efficient, if it would just work. It will make stack handling and using the BP register to access the stack faster, and as shown by the opcode 0x8B profiler a few weeks ago, stack is accessed a lot in programs coded in higher level languages. I have been hunting down the problem for hours with no results. Argh.
I have many times in the past thought that a separate tester program might be useful, that is, a program that would just test my emulator by running all the opcodes and making sure they work as they should. I guess now would be a good time to start working on such a tester. It's just that the time spent coding a tester feels like a time away from writing productive code, but having a tester program might end up saving a lot of time in situations like this.
My emulator speaks! Or rather, WC2 outputs digitized speech using an emulated SoundBlaster. Pretty neat. I had to add support for EMS memory before I could start working on SB support, as WC2 does not want to play digitized speech without loading it first into EMS. I also had to recode the whole virtual IRQ support, as earlier I only supported the timer IRQ, and now I need to have both Timer IRQ and SB DMA IRQ. That was a good thing though, as I will eventually have to add Keyboard IRQ as well.
The EMS memory support meant that I had to add a layer of virtualization between the physical NDS memory and the emulated PC memory. Earlier I had just reserved a 1MB block of .bss memory area for the PC memory, but adding EMS support meant adding a map of 16K pages for segment E000 which is used as an EMS page frame. To simplify things I use the new map for all segment addresses, the first 640K are mapped simply 1:1 between the segment and the .bss memory area, while the E000 segment is mapped (as requested from the LIM EMS handler using INT 67h) to an additional 256KB memory area (which could be increased to 1MB or so, after I find out how much memory the emulator itself will use). This also allowed me to reclaim some unused memory by mapping C000 and D000 segments to F000 segment, as programs should not expect either of those to contain anything specific. Yet another possible feature this change enables is to map A000 and B800 segments directly to NDS physical VRAM, the problem is just that the screen memory layout differs. So far I have just copied the data from the PC memory at A000 to NDS VRAM every other VBlank.
The Sound Blaster support was reasonably simple, I just needed to send the PC DMA and SB port commands from ARM9 to ARM7, and as those port commands contain the address and length of the memory buffer to play, ARM7 can then do all the rest and then send a virtual IRQ back to ARM9 when the buffer is done. This was easy to do by starting a timer on the ARM7 at the same time the playing is started. The only slight problem was that NDS uses signed samples while SB uses unsigned, so every byte needs to be xored with 0x80 before playing.
I used the info in PC Game Programmer's Encyclopedia to code the SB support. This reference site has a good page about the SB DSP and DMA transfers to it.
Of course, immediately after I got launching into spaceflight to work, I wanted to profile it. The earlier profiling was done during animation clips, which is not so time critical as the actual spaceflying and fighting. So, I again started the profiler and waited (and waited and waited... With the profiler running it took 17 minutes to get airborne, even by clicking past all the dialogue in the opening intro!) to get into the Ferret and then let it run a little while, and then looked at the profiler data:
opcode | count | min ticks | avg ticks | total ticks | % of total | command |
---|---|---|---|---|---|---|
83 | 211089 | 19 | 42.14 | 8895382 | 3.7% | ??? r/m16,+imm8 |
E4 | 239960 | 25 | 37.54 | 9009158 | 3.7% | IN AL,imm8 = IN AL,40 = timer value |
E8 | 106334 | 24 | 85.01 | 9039490 | 3.8% | CALL near |
03 | 271098 | 19 | 37.19 | 10081216 | 4.2% | ADD r16,r/m16 |
D1 | 483468 | 21 | 28.34 | 13699627 | 5.7% | ROL/SHR r/m16,1 |
8A | 407838 | 18 | 33.73 | 13756593 | 5.7% | MOV r8,r/m8 |
8B | 1048576 | 17 | 34.99 | 36693613 | 15.3% | MOV r16,r/m16 |
Okay, so the simple 16-bit register loading from memory (or another register) is the most common and also most time-consuming operation. Next I wanted to see what exactly the game does when moving data to a 16-bit register, so I made a small change to the profiler to make it show the time distribution only for opcode 0x8B, grouped by the second opcode byte, the "modrm" byte:
modrm | count | min ticks | avg ticks | total ticks | % of total | command |
---|---|---|---|---|---|---|
D8 | 47790 | 17 | 30.32 | 1449120 | 4.6% | MOV BX,AX |
DE | 81049 | 17 | 19.70 | 1596567 | 5.1% | MOV BX,SI |
EC | 72516 | 17 | 24.02 | 1741828 | 5.6% | MOV BP,SP |
56 | 49083 | 25 | 35.84 | 1758508 | 5.6% | MOV DX,[BP+disp8] |
87 | 62434 | 24 | 38.42 | 2398440 | 7.6% | MOV AX,[BX+disp16] |
46 | 71935 | 25 | 39.18 | 2818487 | 9.0% | MOV AX,[BP+disp8] |
Well, this clearly shows how WC2 has been coded in a higher level language (in this case Borland C++). There is a lot of access to stack-based local variables and parameters, using the BP register, and MOV BP,SP, which initializes the stack frame at the start of a function, is the 4th most time-consuming modrm byte. Optimizing BP indexing and stack access in general would be a useful change. I just have to figure out some ways to do that.
Yeehaa, launching into the first mission succeeds in WC2! I have also a rudimentary AdLib emulation using the PSG sounds (so it sounds completely awful), but still no keyboard emulation. I entered the callsign using DOSBox on my PC and then copied the savegame so I didn't need to input any characters. The NDS keys emulate a mouse. The problem is that the game keeps dropping to the debugger whenever I do something new during the flight (like turn!) as there are still a lot of new opcodes (or rather new addressing modes for the already partially supported opcodes) that I haven't coded in yet.
Now WC2 progresses up to a point where I should input a call sign, which is rather difficult without a keyboard. I should either start working on keyboard emulation (but it still seems boring), or perhaps I could look into emulating soundcards instead. DOSBox has sources for several AdLib emulation codes, so I guess I'll look into that next. I can then again return to working on No$GBA, as I can get back to the beginning of WC2, music is among the first files it loads.
An easy first step might be to simply use the PSG sounds of NDS, trying to send the data from the emulator core running on ARM9 to a sound playing code running on ARM7. I haven't yet done anything on the ARM7 side, perhaps this would be a good time to start learning that as well.
Okay, I got the first interesting results from profiling the emulator. First, here are the top 10 most total ticks (counted as 33MHz timer ticks that were spent handling the opcode, including the overhead caused by starting the timer before handling the opcode, and stopping it after the opcode). The profiling was stopped when the REP opcode reached about 1 million hits.
opcode | count | min ticks | avg ticks | total ticks | % of total | command |
---|---|---|---|---|---|---|
2B | 418487 | 22 | 26 | 10894717 | 2.7% | SUB r16,r/m16 |
AC | 613027 | 18 | 19 | 11768748 | 2.9% | LODSB |
D0 | 447684 | 26 | 27 | 12210447 | 3.0% | RCL/SHR r/m8,1 |
3B | 490243 | 22 | 29 | 14385058 | 3.5% | CMP r16,r/m16 |
75 | 896253 | 18 | 18 | 16192888 | 3.9% | JNE |
03 | 733803 | 20 | 25 | 18584227 | 4.5% | ADD r16,r/m16 |
2E | 1568167 | 18 | 18 | 28795863 | 7.0% | CS: |
D1 | 1081387 | 24 | 27 | 29404670 | 7.2% | ROL/SHR r/m16,1 |
8B | 1583198 | 20 | 25 | 39609529 | 9.7% | MOV r16,r/m16 |
F3 | 1048576 | 22 | 61 | 64246602 | 15.7% | REP |
Well, the fact that repeated string instructions take the most time is not surprising, considering that I haven't yet optimized them at all, even REP MOVSW is done one byte at a time, to avoid alignment problems. It is good to see that future optimization there will really boost the performance.
The profiler data during the phase when WC2 scrolls the intro starfield up looks a bit different:
opcode | count | min ticks | avg ticks | total ticks | % of total | command |
---|---|---|---|---|---|---|
F3 | 1048576 | 22 | 46 | 48625989 | 4.3% | REP |
D0 | 2070151 | 26 | 26 | 54433520 | 4.8% | RCL/SHR r/m8,1 |
26 | 3422702 | 18 | 18 | 62357528 | 5.5% | ES: |
74 | 3931171 | 18 | 18 | 70965940 | 6.2% | JE |
AE | 3228014 | 24 | 24 | 77863746 | 6.8% | SCASB |
38 | 3214596 | 27 | 27 | 87465174 | 7.7% | CMP r/m8,r8 |
75 | 5601918 | 18 | 18 | 101154087 | 8.9% | JNE |
In this case most of the time is spent in various comparisons, repeated string opcode is only the 7th slowest by the total time. It is interesting that the count of JNE opcodes is over 5 times more than all the REP commands counted together!
Next I made a quick hack to the profiling code so that I could see more data for the REP opcodes. I did not feel like changing the profiler very much, so I just simply made it skip all other opcodes besides F3, and when it sees an F3, it stores two events, one with the second byte of the opcode (to see which REP version was used), and another event where the opcode is CL register value instead of the opcode. That way I could see also if it would be worthwhile to code special handling for certain number of bytes to transfer. The results looked like this:
opcode | count | min ticks | avg ticks | total ticks | % of total | command |
---|---|---|---|---|---|---|
A4 | 113144 | 21 | 26.62 | 3011422 | 7.4% | REP MOVSB |
AB | 164129 | 31 | 44.56 | 7313254 | 18.0% | REP STOSW |
A5 | 114102 | 22 | 76.89 | 8773209 | 21.6% | REP MOVSW |
AA | 657201 | 21 | 32.66 | 21465240 | 53.0% | REP STOSB |
CL-value | count | min ticks | avg ticks | total ticks | % of total | command |
4 | 11926 | 35 | 51.60 | 615394 | 1.6% | 4-byte/word transfer |
3 | 18326 | 35 | 49.91 | 914597 | 2.2% | 3-byte/word transfer |
60 | 1181 | 59 | 936.27 | 1105737 | 2.8% | 60-byte/word transfer |
2 | 36706 | 32 | 48.41 | 1777028 | 4.4% | 2-byte/word transfer |
160 | 5041 | 126 | 534.75 | 2695674 | 6.6% | 160-byte/word transfer |
0 | 192101 | 21 | 22.74 | 4368378 | 10.8% | 0-byte transfer |
1 | 738037 | 26 | 34.55 | 25496921 | 62.8% | 1-byte transfer |
What is interesting, is that by far the most time is spent in a 1-item transfer, and nearly all the time is spent transferring either 0, 1 or 160 items (160 items is most likely 320 bytes, one screen row)! Also, transferring 0 bytes, which is effectively a NOP, takes 10.8% of the total time! The high number of 0-byte transfers is explained by the fact that moving an unknown number of bytes is usually done like this:
SHR CX,1 REP MOVSW ADC CX,CX REP MOVSB
The huge number of 1-byte transfers, with a similarly high number of REP STOSB opcodes, must be some peculiarity of WC2. But in any case, this screams for a special handling for cases where CX=0 or CX=1. This change, along with minor overall optimizing of REP handling, resulted in the following table:
opcode | count | min ticks | avg ticks | total ticks | % of total | command |
---|---|---|---|---|---|---|
A4 | 113041 | 18 | 27.33 | 3088919 | 8.5% | REP MOVSB |
AB | 164220 | 26 | 49.33 | 8101263 | 22.6% | REP STOSW |
A5 | 114043 | 18 | 71.63 | 8168857 | 22.8% | REP MOVSW |
AA | 657272 | 18 | 25.19 | 16558805 | 46.1% | REP STOSB |
CL-value | count | min ticks | avg ticks | total ticks | % of total | command |
0 | 192091 | 18 | 18.40 | 3534841 | 9.8% | 0-byte transfer |
1 | 737622 | 25 | 27.92 | 20592186 | 57.4% | 1-byte transfer |
Okay, it looks better already, with over 5 million ticks shaved off from the totals. I still need to optimize the REP handling in general so that it does not do it one byte at a time, but this shall be enough for now. It is very useful to have a profiling tool, though.
Adding support for libfat turned out to be easier than I had thought, so I got WC2 to progress further, up to the intro conversation between the Kilrathi emperor and the prince. However, this conversation progresses reeeaallyyy slooooow, like 5 minutes for each row of dialogue. Something strange is happening here, so I am coding a profiler feature into the emulator, so that I can check how long each opcode takes and how many times they are called in WC2.
Ever since I started working on emulating WC2, I have just been adding the files it reads as binary files directly into the project and mapped them into memory, so that I can emulate DOS INT 21h file I/O by just copying data from memory to the emulated PC RAM area. I have done this mainly so that I can test the emulator on No$GBA without having to copy the data to a real hardware over and over, or doing other tricks to get No$GBA to load data from disk. However, now I already have over 2 megabytes of files included, and the next file it needs is about 500KB, which together with the emulator itself and 1MB of PC RAM (640K plus BIOS areas) would take the result over the maximum of 4MB.
I guess I have no alternative than to start using libfat next, and stop relying on the No$GBA for everything.
Okay, now the emulator runs WC2 up to showing a "Start New Game" button. The only problem is, I have not emulated any input devices (keyboard or mouse), so I can not tell it to continue!
I guess I'll start by emulating a mouse, as it uses simply INT 33h calls and can be made quickly. Emulating a keyboard would be a much bigger issue, with all the hardware emulation, INT 9h, and INT 16h handling stuff. It sounds much more boring than continuing work with the CPU opcodes.
Finally I got Wing Commander 2 to show something on the graphics screen in my emulator! I have been adding opcode after opcode for the last 2 weeks, and many times I have felt that I must be close to the code where WC2 switches to graphics mode and draws something to the screen, but every time there was some new DOS call or timer interrupt stuff (or Borland C++ Overlay Manager or some such that loads code on-demand using INT 3F, which makes debugging my emulator a pain!) to do instead of the interesting graphics emulation that I have been looking forward to.
But now finally, the Origin intro with the orchestra gets displayed! This is a huge motivation boost, perhaps my emulator will actually work eventually. And after a little bit of more work, the whole animated intro gets displayed, until it attempts to locate a savegame or some such. Okay, back to coding!
Annoying.. Wing Commander 2 loads itself for a while, and then just prints "Sorry, there was a problem starting Wing Commander 2. Error code 020". Now I have to figure out what it is the game does not like in my virtual configuration. Perhaps something to do with the timer (or more specifically, setting the PC timer to run at a different speed, which I don't yet support) or some other hardware difference. I guess I just have to keep debugging it, at least I can run it simultaneously in debug.exe inside DOSBox and see if my emulated registers and flags differ.
UPDATE: Ah, found the problem! A bit silly, actually... I had just copy-pasted a BIOS configuration table contents at F000:E6F5 from my real PC to my emulator, without much thought about what data it contains, and one bit there states that "INT 15/AH=4Fh called upon INT 09h". WC2 did not like it when I had my INT 15 vector pointing to 0000:0000. Well, I just had to change that vector to point to my generic IRET opcode at F000:0000 and WC2 was happy. :-)
One of the few PC games I have purchased (along with the original PC version of Elite, which was the first ever game I purchased, back in 1988 I think, and I played it for months and months!) is Wing Commander II by Origin. I just found it in my bookshelf, and decided to see if it still works (in DOSBox on my PC). I installed it from the original 720K floppy disks (all 14 of them!), all the time being worried that there would be data errors, but surprisingly all the disks worked fine, even after almost 20 years!
Wing Commander 2 has intriguing system requirements:
IBM: 100% compatible 386 or 486 PC system or 286 with VGA or EGA and no speech REQUIRED: 640K; 12+ MHz, hard drive with 12-21 MB free RECOMMENDED: Dos 5.0; expanded memory (2MB); 16+ MHz; joystick or mouse GRAPHICS: 256-color VGA/MCGA; EGA MUSIC/SOUND EFFECTS (optional): Sound Blaster, Ad Lib, or Roland DIGITIZED SPEECH (optional): Sound Blaster, or 100% compatible digitized sound board
Those requirements are pretty much at the very high end of what can reasonably be expected to run on an emulator, so if I get Wing Commander 2 to run (even if slowly) on my emulator, I shall consider it a success. Next step is to code an EXE file loader, as until now I have only had a COM file loader for the LW2.com
Okay, my tiny LW2.COM is working inside my emulator! I had to code an emulated graphics mode and setting of VGA palette registers using direct out dx,al opcodes, but now finally the launch image is drawn! Then it drops back to debugger because of a yet unsupported opcode, but it is nice to see the output of a DOS program on an NDS screen!
Okay, it was time to use Google to see if someone else had run into a similar problem, and what do you know, ARM has a different meaning for the Carry flag with sub and cmp opcodes to the x86 convention I am familiar with! After some more studying I found the info also in the ARM ARM (ARM Architecture Reference Manual), I just hadn't noticed it as it hadn't occurred to me that there *could* be different conventions!
Anyways, problem solved by always complementing the carry flag after sub and cmp opcodes. A bit annoying because of the extra CPU cycles that takes, but at least I can then forget about this difference in all other opcodes.
I got LineWarsDS released!
Next I would like to learn ARM Assembly programming, so I have been toying with various possible projects the last week or so. An emulator sounds interesting, and the emulators that do not yet exist for NDS and would interest me are a VIC-20 -emulator (my first computer!) or a PC emulator (sounds like a big project). I am leaning towards the PC emulation, as that would probably be much more useful and interesting (and challenging!).
I think an x86 emulator for Nintendo DS should be doable, after all, NDS has a 66MHz 32bit processor, 4MB of RAM and sufficiently colorful screen to emulate a VGA 320x200 256-color mode. Since emulating a different processor architecture will make it run significantly slower than 66Mhz, though, I'll be targetting a 80286 processor, at least initially.
It is curious that in the GBADev forum many have mentioned either interest for an x86 emulator, hope that someone will make such a thing, or announced that they are working on it, yet there does not seem to be any even half-finished versions released. Well, at least that means I won't be reinventing the wheel, if I happen to get my emulator released.
I'll start by trying to emulate my old LW2.COM launcher program for LineWars II, as it is only about 10KB in size, and does little else besides showing a PCX file on the 320x200 256-color graphics screen. That should be nice simple starting point for my emulator.