Sorry, no new release of DSx86 today, as I have only been working on DS2x86 for the past two weeks. This porting work is progressing nicely, over half of the opcodes have been ported over to MIPS ASM. I have to mention, though, that the opcodes so far have been the easy ones (execpt the BCD opcodes), the more difficult opcodes like the string operations, shifts, INT and IRET, and port I/O are still ahead. These will take more time, and some of them will need some interfacing to the underlying hardware, so I can not just simply port them over from the ARM ASM code.
I am currently at opcode 0x8C, which is the mov r/m16,Sreg opcode, that is, moving a value from the segment register to memory or register. The problem above was caused by my tester code not yet supporting the FS and GS segment registers, while the CPU emulation already does this. So, every now and then I need to fix my tester program instead of the emulation code. :-)
Practically at the same time I started porting the opcode handlers from ARM ASM to MIPS ASM, I started thinking of ways to handle the Lazy Flags with the least amount of slowdown possible. Yesterday I figured out a method that is a little bit faster than the way I had when I started, so I spent a couple of hours refactoring all the opcodes I had already coded to use the new method. Too bad this did not occur to me earlier, but it is to be expected that I need to recode some parts of the code several times as I am still only learning the tricks in MIPS ASM.
I again used the DOSBox sources, together with the nice description at a www.emulators.com blog post, to figure out how the lazy flags need to work. There are six flags that change after each arithmetic operation in the x86 architecture, some of which are simple and some more difficult to determine after the operation. The flags are:
The simple flags are Zero, Sign and Parity. Zero flag is set if the result was zero, Sign flag is set if the highest bit of the result was set, and Parity flag can be set by a 256-item lookup table based on the low byte of the result. These three flags behave similarly to all opcodes (that change flags), so they can be determined simply by the result of the last operation. The other three opcodes behave differently in different opcodes, so based on the calculation operations in the DOSBox sources I combined a list of the different cases, to see how these need to be handled. DOSBox names the result and operands lf_resd, lf_var1d and lf_var2d (for doubleword operands), and I named them lf_res, lf_val1 and lf_val2 in my code.
Based on these lists, it seemed to me that the Carry flag will be the most difficult and time-consuming to calculate. Besides the obvious conditional jump opcodes, there are many other opcodes (ADC, SBB, RCL, RCR, CMC) that need the current Carry flag value as their input. Also the shift opcodes change and use the Carry flag in various ways, so it seemed to me that using a switch statement -style code to calculate the Carry flag lazily whenever it is needed will really slow down those operations. So, I decided to see how much extra code I would need if I went for a direct Carry flag calculation in each of the opcodes. It turned out that most of the times it only takes one ASM operation to calculate the Carry flag after the operation, so this is how I currently handle the Carry flag.
I also noticed that if I calculate the Carry flag separately, I can fake the lf_val1 and lf_val2 values in opcodes like INC and DEC to give me the correct Adjust flag value when using the same calculation code as the normal ADD/SUB opcodes use. So I was able to simplify the Adjust flag calculation to the one case: ((lf_val1 ^ lf_val2) ^ lf_res) & 0x10. This just left the Overflow flag which needs separate cases for each opcode type. I use one of the MIPS general purpose registers to keep track of the last opcode type, along with registers for the last result and operands, so that the Overflow flag can be calculated lazily whenever needed. I hope to figure out some speedups for this as well, but for now it will have to do.
To show an example of the actual opcode handling and what the Lazy Flag handling requires, here is the handler for ADC r/m8,r8 opcode when the left operand is a memory address. In DS2x86 I decided to have #defines for all the registers I use for emulation, so I don't need to remember which MIPS register was which. I did not do this in DSx86, and that caused some wrong register usage from time to time.
.macro adc_effseg_reg8l reg get_CF_into t3 // t3 = Carry flag value li lf_type, OF_CALC_ADD | 24 // Remember the operation type and shift value for Lazy Flags lbu lf_val1, 0(eff_seg) // Load the left operand from RAM andi lf_val2, \reg, 0xFF // Remember the right operand for Lazy Flags addu t3, lf_val1 // t3 = lf_val1 + Carry addu lf_res, t3, lf_val2 // lf_res = lf_val1 + Carry + lf_val2 srl t0, lf_res, 8 // t0 = Carry value sb lf_res, 0(eff_seg) // Save the result to RAM andi lf_res, 0xFF // Remember only the low 8 bits for Lazy Flags j set_carry_from_t0 // Back to loop .endmThe get_CF_into macro looks like the following. It is a macro so that I can later change how the Carry flag is calculated without having to change all the code that uses it (just in case I still need to revert back to lazy calculation of the Carry flag). The set_carry_from_t0 code is immediately before the opcode loop handler, as many opcodes jump there to store the t0 register value back into the flags register lowest bit. When calculating the Carry flag immediately, Carry is simply the 8th bit of the result, so I can just shift it to the lowest bit of t0 register and don't need to handle the complex ((unsigned)lf_res < (unsigned)lf_val1) || (lflags.oldcf && (lf_res == lf_val1)) algorithm at all!
.macro get_CF_into reg andi \reg, flags, 1 .endm
As you can see from this code, even just remembering the result and operands for later calculation of Lazy Flags takes a lot of code, in this case 4 of the 10 ASM operations are there just to get the later flags calculation to give correct result. When coding for the ARM ASM I did not need any of these, as the ARM can keep track of the flags by itself. Thus, DS2x86 will not be as much faster than DSx86 as the difference in the CPU clock speeds would make you think.
I got my free DSTwo flash cart last Monday, so I have been working on DS2x86 ever since. It took me all of Monday evening to get the DS2x86 framework (tester routine calling the MIPS ASM CPU emulation code, and it returning properly back to the tester routine) to run without crashing, so on Tuesday evening I was able to start working on the actual CPU opcode emulation with MIPS ASM language. I am actually using a strict TDD (Test Driven Development) coding technique when working on DS2x86. With DSx86 I usually coded something, then tested it with some test games, and only if that failed I coded more thorough tests. With DS2x86 I implement the test routines (or improve the old tests I used in DSx86) first, and only then start coding the actual opcode handlers. I do this because the MIPS ASM language is very unfamiliar to me, and also because I am now using the Lazy Flags approach, so I can no longer use the ARM CPU to calculate the correct x86 CPU flags for me.
By Saturday morning I had implemented the first four opcodes 0x00..0x03 (the various ADD opcode versions). Each of these actually have 256 different modrm bytes for all the different memory address modes, and as there are 6 different segment override possibilities, plus the case with no segment override, each of these opcodes actually has 7*256 different cases. My test routine runs each of these cases with random input values, and tests for correct results and correct emulated CPU flags state after the run. So in fact I had over 7000 different cases, including their unit tests, coded and tested by Saturday morning. Pretty good progress I think, considering I was only able to work on it for a few hours every evening.
The images above are screen captures from the default SDK console, using a screen capture code by BassAceGold, another SDK tester (thanks for the code!). The first one is from Saturday morning, with the first four opcodes working and opcode 0x04 stopping with an invalid result (0x59+0x1C=0x1C). The second one is the current (Sunday midday) situation. I have skipped opcode 0x0F, as it contains a lot of different 386 opcodes, and also the 386-versions of the already coded opcodes are mostly missing. The Lazy Flags handling does not yet calculate correct Overflow flag for SUB/SBB opcodes, so that too still needs work. I have the 386-specific FS and GS segment registers already supported, and also the immediate 32-bit versions like ADD EAX,0x1234567 are already coded and tested. But since my first priority is to get Norton Sysinfo running, I can leave supporting most of the actual 386-opcodes for later.
I think the font used by the SDK is not the best possible, especially the small letter g is pretty unclear, but this is certainly good enough for debug printing. The MIPS cpu is set to run at 120MHz by default (I believe that is the lowest speed it can run), which is fine while coding the tester program. There is an API call in the SDK to change the CPU speed, up to 396MHz I believe, so I am thinking of adding a configuration option to DS2x86 where the user can choose the CPU clock speed. Running at higher speeds will probably drain the battery pretty fast, so always running at the highest speed is not a good idea.
Anyways, I am already at opcode 0x27 DAA. This is one of the weird BCD (Binary Coded Decimal) opcodes that use the Adjust Flag (which is not properly supported in DSx86). Now with DS2x86 I can support the Adjust Flag properly, so I can finally code the DAA opcode so that it will always give correct results. Even though this opcode is very rarely used, it will be good to have it working correctly. :-)
This version has only minor fixes, as I have been busy with other things (work-related stuff and the SuperCard DSTwo version of DSx86). The changes in this version are the following:
I also spent several hours tracing the null pointer jump problem in the Superhero Legend of Hoboken game, but could not fix it yet. I found out that the problem is caused by a routine overwriting data in the stack, so that the routine then returns to address 0000:0000. This same routine is used without problems hundreds of times before it fails, so tracing the actual reason for the failure is pretty difficult and time-consuming.
Not a lot of changes, and as I am moving my focus from DSx86 to the new SuperCard DSTwo version, at least for the time being, it is possible that DSx86 itself will progress very slowly for a while. I will possibly increase the two-week release cycle, especially if I have not had time to do any worthwhile improvements.
I have not yet received the free DSTwo card that SuperCard has sent me, so I have not been able to fully start coding for it yet. I have been learning MIPS assembly language and have started converting some of the ASM macros I have used in DSx86 to MIPS ASM for DS2x86, though. I am using mostly the same ideas that I have used in DSx86, but will include 386/486 opcodes from the start, and will switch from using the CPU flags directly to a Lazy Flags-type approach. MIPS has so many general purpose registers that I believe I can fit the lazy flags into registers, which should make the code run reasonably fast. Not as fast as the ARM version, obviously, but on the other hand the MIPS processor has quite a bit higher clock speed.
I hope the DSTwo card arrives next week!
Last week I did not have any time to work on DSx86, as I had various work-related things to do and also other "real life" issues that I had postponed during my summer vacation. Yesterday I finally began working on DSx86 again, and found the bug that caused the memory error in Maupiti Island. I had coded a shortcut for the memory allocation routines when launching an EXE program, and my shortcut did not handle the EXE header MaxAlloc field properly, it always allocated the maximum amount of memory. Normally all programs want the maximum amount of memory, but the TSR program in Maupiti Island was the first program that had the required amount of memory in the EXE header and did not adjust the memory allocation itself. After I fixed that routine the game loaded fine and seems to work.
I was also requested via email to look into a screen resolution problem in a game called Mahjong Fantasia. I noticed that it went to 640x200 screen mode, but still wanted to display 400 lines. There is no screen mode preset for 640x400 resolution, so the game accessed the CRTC registers directly. I enhanced my VGA emulation so that the vertical resolution is no longer tied to the current screen mode, but instead it is determined by the CRTC register values (like in DOSBox). This might cause problems in some games, but most likely it will work better than the mode-based resolution detection. At least Mahjong Fantasia began to display a correct resolution, and also Gods (another game that accesses CRTC registers directly) still works.
I have also added some simple missing EGA and Mode-X opcodes based on the debug logs I have received, again thanks for those! Other things on my TODO list I have not yet looked into, but at least the above mentioned improvements will be in the upcoming 0.23 version.
Several DSx86 users sent me email letting me know that SuperCard are giving a beta version of their DSTwo SDK to selected homebrew coders. So, I decided to contact SuperCard and request the beta version of the SDK. They accepted me to their beta test program (thanks SuperCard!) and sent me the SDK. I don't have the actual DSTwo card itself (which they also will send me) yet, so I can not properly test the SDK until I receive the card as well. However, I can start porting DSx86 to it immediately, and this is what I began doing yesterday.
The first problem was that the SDK (or rather the mipsel-linux toolchain that it uses) is meant to be run on Linux, and I have only Windows machines. This turned out to not be a major problem, though, as running VirtualBox on my Windows XP machine and installing Ubuntu Linux on it allowed me to install the SDK and compile the libraries and examples fine. I am familiar with the Linux command line tools, but I have actually never used the Linux graphical UI, which however seems to be reasonably similar to Windows so I am not totally lost with it. :-)
There does not seem to be any major problems compiling the C and C++ source codes, but obviously I need to write the ASM parts pretty much from scratch for the MIPS architecture. It will be interesting to study another new hardware architecture, and the MIPS architecture seems to be very different from ARM, even though they both are RISC processors. What little I have found out so far about the MIPS architecture compared to ARM is this:
My working on the MIPS port of DSx86 will of course take time away from improving the current DSx86, but I think DSx86 is already working reasonably well as it is. Many games still need fixing and there are many general improvements that could be made, but I can continue working on these even while I concentrate on the DSTwo SDK testing. SuperCard probably expects me to actually work with their SDK, or they would not have sent it to me. I think I owe it to them to test it properly and report the possible problems and enhancement ideas I find when trying to port DSx86 to run on DSTwo.
With the DSTwo version of DSx86 (currently named "DS2x86" :-) I plan to support 386 instructions, and probably also 486 instructions. I will first port my tester program (which simply tests each CPU opcode for correct results), as it is much simpler than the full DSx86 and with that I can concentrate on the CPU emulation and make sure I get it to work correctly. After that I will probably try to get Norton Sysinfo running in it, just to see what the emulation speed will be like.
It took me about half a year from when I started working on DSx86 to when the first alpha version was released, so it might take about the same amount of time with the DSTwo version. On the other hand I have learned quite a lot about emulation in general, and the C and C++ codes do not need major changes, so it might be that I get something working much sooner.
This version has the refactored internals, so it most likely runs some (if not all) games slower than the previous versions. It does however now support practically all real-mode 286 CPU opcodes (not including JPE and JPO which require game-specific hacks), and also unsupported graphics opcodes should now be quite rare. The graphics opcodes are now reported as Unsupported EGA opcode or Unsupported Mode-X opcode, and unlike in previous versions, you can continue after such an opcode using the B button. However, it is likely that you will get the same error again and again, so please send me the log file if you encounter unsupported graphics opcodes. If you get a plain Unsupported opcode error, it most likely means that the program is executing data instead of code, so something has gone wrong in the code before this happened, and thus it is not possible to continue running the program. Again, I am interested in the log files produced in these situations.
Besides the refactored internals, this version has various other fixes, based on many games and other programs I have been testing. Here is a list of the programs I tested, and the changes made into DSx86 or other information about why the program fails to run properly.
This was the last week of my summer vacation, so after today it is back to the normal slow progress with DSx86. I won't have much time to work on DSx86 during weekdays, so I can not get all that much done during each two-week period. I am glad I got the internal refactoring done during my summer vacation, though, as that was quite an extensive change. I had to change pretty much every single opcode that I have been spending the last year coding.
I decided to see if I could include my old profiling tool (which I had last used in August last year, when the emulation core still was bundled with Wing Commander II files and could not even run 4DOS yet), and make it run with the current DSx86. I had long since stripped the code out from DSx86, but luckily I found it in an old source code backup directory. I added it back, and decided to start by profiling Norton Sysinfo.
Here is the first profiling result, while running the SysInfo CPU speed test. This was taken pretty much right after I finally got the new code to run properly, without any optimizing done yet. The first table shows the opcodes taking the least average number of ticks (ordered by that value), and the second table shows the opcodes taking the most total number of ticks (again ordered by that value):
| opcode | byte | count | min ticks | avg ticks | total ticks | % of total | command | in ITCM? |
|---|---|---|---|---|---|---|---|---|
| NOP | 90 | 1742 | 8 | 8.00 | 13936 | 0.0082% | No operation | Yes |
| CWD | 99 | 78 | 10 | 10.00 | 780 | 0.0005% | Convert word to doubleword | Yes |
| JA | 77 | 4394 | 10 | 10.01 | 43992 | 0.0260% | Jump if unsigned above | Yes |
| MOV DL,imm8 | B2 | 52 | 11 | 11.00 | 572 | 0.0003% | Move imm8 byte to DL register | Yes |
| JL | 7C | 518056 | 11 | 11.01 | 5704348 | 3.3762% | Jump if signed less | Yes |
| opcode | byte | count | min ticks | avg ticks | total ticks | % of total | command | in ITCM? |
| JL | 7C | 518056 | 11 | 11.01 | 5704348 | 3.3762% | Jump if signed less | Yes |
| ADD r16, r/m16 | 03 | 519175 | 14 | 28.11 | 14591630 | 8.6363% | Add 16-bit register or memory to 16-bit register | No |
| ADD/SUB/AND/OR/CMP r/m16,imm16 | 81 | 518383 | 32 | 32.14 | 16660587 | 9.8608% | Various 16-bit arithmetic/logical operations with imm16 value | No |
| TEST/NOT/NEG/MUL/DIV r/m16 | F7 | 266630 | 16 | 67.06 | 17880567 | 10.5829% | Various 16-bit memory operations | No |
| MOV r/m16,r16 | 89 | 525262 | 33 | 37.18 | 19526937 | 11.5573% | Store a 16-bit register into register or memory | No |
| MOV r16,r/m16 | 8B | 806241 | 14 | 24.63 | 19859369 | 11.7541% | Load a 16-bit register from register or memory | Yes |
| POP r/m16 | 8F | 518064 | 39 | 39.16 | 20285222 | 12.0061% | Pop a value from stack to register or memory | No |
| INC/DEC/CALL/JMP/PUSH r/m16 | FF | 1048576 | 34 | 39.20 | 41102557 | 24.3272% | Various 16-bit memory operations | No |
Not surprisingly, NOP (no operation) is the fastest opcode. The ticks run at 33MHz, so 8 ticks means that handling a NOP opcode takes 16 CPU cycles (as the NDS CPU runs at 66MHz). This includes some profiling overhead, so one or two ticks can in effect be decremented from the ticks of all the opcodes to calculate the actual amount of timer ticks the opcode executing takes. The JL opcode is both one of the fastest opcodes and also one of the most frequently executed opcodes. It is interesting that the two most common opcodes are 0xFF and 0x8F, both of which should be rather uncommon in normal programs, especially in games. As opcodes 0x81, 0xF7 and 0xFF can perform several different operations, depending on the so called "modrm" byte following the main opcode byte, I wanted to see what operations exactly SysInfo does with those opcodes:
| opcode 81 | modrm | count | min ticks | avg ticks | total ticks | % of total | command | ITCM? |
|---|---|---|---|---|---|---|---|---|
| CMP [disp16],imm16 | 3E | 577 | 32 | 95.10 | 54873 | 0.1678% | Compare global variable with 16-bit value | No |
| CMP [BP+disp8],imm16 | 7E | 1047999 | 32 | 32.10 | 33640850 | 99.8322% | Compare local variable with 16-bit value | No |
| opcode F7 | modrm | count | min ticks | avg ticks | total ticks | % of total | command | ITCM? |
| DIV [BP+disp8] | 76 | 204 | 78 | 148.66 | 30326 | 0.0431% | Divide DX:AX by local variable | No |
| IMUL [BP+disp8] | 6E | 714 | 29 | 77.13 | 55073 | 0.0783% | DX:AX = AX * local variable | No |
| DIV CX | F1 | 2244 | 67 | 83.05 | 186355 | 0.2651% | Divide DX:AX by CX | No |
| DIV BX | F3 | 1043068 | 67 | 67.01 | 69896106 | 99.4304% | Divide DX:AX by BX | No |
| opcode FF | modrm | count | min ticks | avg ticks | total ticks | % of total | command | ITCM? |
| PUSH [disp16] | 36 | 225 | 35 | 71.23 | 16027 | 0.0390% | Give a global variable as a parameter to a C function | No |
| INC WORD [disp16] | 06 | 8159 | 39 | 41.91 | 341920 | 0.8326% | Increment a global variable | No |
| PUSH [BP+disp8] | 76 | 520954 | 38 | 38.10 | 19847174 | 48.3317% | Give a local variable as a parameter to a C function | No |
| INC WORD [BP+disp8] | 46 | 518732 | 40 | 40.16 | 20831544 | 50.7288% | Increment a local variable of a C function | No |
Opcode 81 only used those two variations, with only the CMP [BP+disp8],imm16 operation actually relevant. Opcode F7 used several modrm variations, but again only DIV BX is called repeatedly in the CPU speed test loop. I believe this opcode is used to determine the CPU MHz number, as the DIV opcode is supposed to take exactly 22 CPU cycles on a 80286 processor. As the division seems to take 67 ticks (at 33MHz) in DSx86, that will nicely convert to 11MHz 80286 clock speed, just like Norton SysInfo reports. Opcode FF (together with 81 and 8F) are then probably the actual opcodes used to calculate the CPU speed, and all of these use the BP-register-indexed stack access.
It is also interesting that the very rarely called operations take on the average two times the minimum timer ticks, while the common operations take around the minimum number of ticks all the time. This is probably due to NDS cache misses while the less frequent operations are performed.
Next I looked into the operation that I thought is most suitable for testing the possible optimization tricks, opcode 03 (ADD reg16, r/m16). This is a good opcode for testing, as it does not need the CPU flags to be saved (the addition will change all of them anyways), and pretty much all the arithmetic and logical opcodes are very similar. So if I can figure out ways to optimize it, I can use the same tricks for a lot of other opcodes as well. The refactored code for opcode 0346 (ADD AX,[BP+disp8]) looked like this (the actual code is full of parameterized macros, so this is what the code would look like with all the macros expanded), and it takes 28.11 ticks on the average to run:
add_ax_bpdisp8:
@-------
@ macro r0high_from_idx_disp8
@-------
ldrsb r0,[r12],#1 @ Load sign-extended byte to r0, increment r12 by 1
add r0, r9, r0, lsl #16 @ r0 = (idx register + signed offset) << 16
b add_r16_r0high_bp_r4 @ Jump to handler for AX (r4) register with BP (r9) based indexing
...
@-------
@ macro add_reg16_r0high
@ On input:
@ r0 = offset within the segment in high halfword
@ r1 = free
@ r2 = current effective segment in high halfword, segment override flag in lowest byte
@ r3 = current SS segment in high halfword, current DS segment in low halfword
@ r4..r11 = AX..DI registers in high halfwords
@ r12 = current physical CS:IP
@ lr = current physical SS:0000
@-------
add_r16_r0high_bp_r4: @ This is jumped to when the offset is based on BP register
@-------
@ macro mem_handler_bp_destroy_SZflags
@ Indexing by BP register, so use SS unless a segment override is in effect.
@-------
tst r2, #0xFF @ Is a segment override in effect? Zero flag will be set if not
moveq r2, r3, lsr #16 @ r3 high halfword contains the SS segment, so put it into r2 ...
lsleq r2, #16 @ ... and shift it to the high halfword.
@-------
@ macro mem_handler_jump_r0high
@ Calculate the physical RAM address, and jump to correct handler
@ depending on the type of the memory addressed.
@ On input:
@ r0 = offset within the segment in high halfword
@ r2 = current effective segment in high halfword
@ NOTE! Nothing may have been pushed into stack before this!
@ Output:
@ r2 = physical memory address (with EGA/MODEX flags if applicable)
@ Destroys:
@ r0
@-------
add_r16_r0high_r4: @ This is jumped to when the offset is NOT based on BP register
add r2, r0, lsr #4 @ r2 = full logical linear memory address in highest 20 bits, garbage in low byte
mov r0, r2, lsr #(12+10+4) @ r0 = 16K page number
add r0, #(SP_EMSPAGES>>2) @ r0 = index into EMSPages table in stack
ldr r0,[sp, r0, lsl #2] @ r0 = physical start address of the page, highest byte tells type
lsl r2, #(18-12) @ r2 = offset within the 16K page in highest bits
add r2, r0, r2, lsr #18 @ r2 = physical linear address
add r0, pc, r2, lsr #24 @ r0 = PC + 0x02, 0x06, 0x0A, 0x0E, ...
ldr pc,[r0, #-2] @ Jump to the handler, adjust index to 0, 4, 8, or 12
.word .op_03_RAM_r4 @ RAM (physical address like 0x02XXXXXX)
.word .unknown_back1 @ MCGA Direct (obsolete!)
.word op_03_EGA_r4 @ EGA (physical address like 0x0AXXXXXX)
.word .unknown_back1 @ Mode-X (unsupported opcode!)
.op_03_RAM_r4:
@-------
@ Actual code for handling opcode 03 when the target is AX and the address is in normal RAM.
@ Get a halfword from (possibly) unaligned memory address, and add it to register.
@-------
ldrb r0, [r2] @ Load low byte from RAM
ldrb r1, [r2, #1] @ Load high byte from RAM
lsl r0, #16
orr r0, r1, lsl #24 @ r0 = low byte | (high byte << 8) (in high halfword)
adds r4, r0 @ Finally perform the addition
b loop
I did a minor optimization immediately, I coded a shortcut for the situation where the memory operand is in normal RAM (which it always is in SysInfo), and then checked which operations exactly are performed:
| opcode 03 | modrm | count | min ticks | avg ticks | total ticks | % of total | command | ITCM? |
|---|---|---|---|---|---|---|---|---|
| ADD SI,AX | F0 | 312 | 14 | 39.47 | 12314 | 0.0468% | Add register AX to register SI | No |
| ADD AX,BX | C3 | 364 | 14 | 36.27 | 13201 | 0.0502% | Add register BX to register AX | No |
| ADD DI,[BP+disp8] | 7E | 208 | 25 | 67.85 | 14112 | 0.0537% | Add a local variable to register DI | No |
| ADD AX,[disp16] | 06 | 364 | 24 | 68.85 | 25061 | 0.0953% | Add a global variable to register AX | No |
| ADD AX,[BP+disp8] | 46 | 1045983 | 25 | 25.03 | 26180287 | 99.5774% | Add a local variable to register AX | No |
The things to note about this opcode are:
I made several iterations, adjusting and improving the code and then profiling again. Finally I ran out of new ideas to test, so this is what the current code looks like. See the list of optimizations below the code for a description of each change I did. The changes are also marked in red in the comments in this code snippet:
add_ax_bpdisp8:
@-------
@ new macro r0high_r2_from_bpdisp8_destroy_SZflags
@-------
ldrsb r0,[r12],#1 @ Load sign-extended byte to r0, increment r12 by 1
tst r2, #0xFF @ Is a segment override in effect? Zero flag will be set if not
add r0, r9, r0, lsl #16 @ r0 = (idx register + signed offset) << 16
biceq r2, r3, #0x0000FF00 @ r2 = logical SS segment in high halfword, with garbage in low byte
@-------
@ macro calc_linear_address_r2_from_r0high
@-------
add r2, r0, lsr #4 @ r2 = full logical linear memory address in highest 20 bits, garbage in low byte
mov r0, r2, lsr #(12+10+4) @ r0 = 16K page number
add r0, #(SP_EMSPAGES>>2) @ r0 = index into EMSPages table in stack
ldr r0,[sp, r0, lsl #2] @ r0 = physical start address of the page minus logical page start
add r2, r0, r2, lsr #12 @ r2 = physical linear address
@-------
@ Code specific to [BP+disp8] handling
@-------
tst r2, #0x7C000001 @ Is the target something else than halfword-aligned RAM?
bne .op_03_addr_r4 @ Yep, so jump there
@-------
@ Halfword-aligned RAM address accessed by BP-based indexing.
@-------
ldrh r0, [r2] @ Load halfword from RAM
adds r4, r0, lsl #16 @ Add it to register value
b loop @ Back to opcode loop
The optimizations I made to the code are the following:
So, after I coded similar optimizations to all the [BP+disp8] based operations that Norton SysInfo uses during the CPU speed calculation, how did this affect the speed? Here first is the new profiling result, where we can see that handling opcode 03 now takes on the average only 22.13 timer ticks (while it originally took over 28 ticks):
| opcode | byte | count | min ticks | avg ticks | total ticks | % of total | ITCM? | improvement |
|---|---|---|---|---|---|---|---|---|
| JL | 7C | 518073 | 11 | 11.01 | 5703131 | 4.2727% | Yes | 0% |
| ADD r16, r/m16 | 03 | 519191 | 14 | 22.13 | 11490927 | 8.6088% | No | 21.25% |
| MOV r/m16,r16 | 89 | 525284 | 24 | 24.25 | 12736016 | 9.5416% | No | 34.78% |
| CMP [BP+disp8],imm16 | 81 | 518356 | 25 | 25.12 | 13019844 | 9.7543% | No | 21.85% |
| POP r/m16 | 8F | 518085 | 28 | 28.06 | 14539076 | 10.8924% | No | 28.32% |
| MOV r16,r/m16 | 8B | 806280 | 14 | 20.04 | 16157616 | 12.1050% | Yes | 18.64% |
| DIV BX (etc) | F7 | 266636 | 16 | 67.04 | 17875969 | 13.3924% | No | 0% |
| INC/PUSH [BP+disp8] (etc) | FF | 1048576 | 27 | 30.25 | 31721006 | 23.7649% | No | 22.82% |
The operations that read from RAM (now using ldrh instead of two ldrb operations) have improved about 20%. The operations that write to RAM (this time with strh instead of two strb operations) have improved by about 30%! (The real improvement is even a little bit higher, as these percentages have the profiling overhead included in the results.) It is interesting that the memory store benefits from halfword access more than the load. Perhaps this is due to my not being able to avoid using the register immediately after load, while storing a register does not have this slowdown. And finally, here is what Norton SysInfo now shows as the speed of DSx86. I was hoping I could get back to above 10x original PC speed, and I am quite happy to see that I succeeded. All in all, looks like my refactoring the code did not completely kill the performance of DSx86.
The next program I am going to profile is Wing Commander II, as it has been pretty choppy to begin with. The last time I profiled it the MCGA graphics mode only used Direct screen access, while nowadays only blitted screen update is used. Thus the results will not be fully comparable to results from last year, but even so it will give me information on what opcodes to optimize next.
This is mostly a fix version after the somewhat buggy version 0.20. This version includes the finished AdLib emulation, and I fixed the problem introduced in 0.20 with the Direct SB mode where the start of the BIOS F000 segment was overwritten with corrupt data. This version also has a lot more of the opcodes refactored to use the new more robust memory handling, which also means that this version will run slower than the previous version. Norton SysInfo tells that this version runs at 9.9 times original PC, while the version before any of these internal changes ran at 11.5 times original PC. I am still not even half way done with the internal refactoring, so the next version might still be slower, until I get all the refactoring done and can again start optimizing things.
I spent about half my time working on the internal refactoring, and the other half with debugging and testing programs that behaved badly in the previous version. Here is a list of the specific programs I tested and the changes they required, where applicable.
The internal refactoring continues, and as you might have noticed, this version is quite a bit smaller than the previous version. That is due to refactored code no longer requiring separate graphics and normal RAM opcodes, but instead only the memory handlers are separate. So even though the code size gets smaller, more and more "graphics opcodes" get supported by every refactoring change I do. I am looking forward to a point where I can get rid of the separate graphics opcode framework completely, as that will free several kilobytes of ITCM for other more beneficial use.
I also hope to finally look into the mouse emulation improvements during the next couple of weeks. Adding smoother screen scaling could also help some games, but the problem with that is that it takes a lot of CPU cycles, during which time no interrupts are sent to the running x86 program, so especially Direct SB audio would become pretty much unusable. But, I'll see what I can do about that. There are also many games remaining in the Compatibility Wiki that I should look at, so I don't think I will run out of things to do in DSx86 for a while yet. :-)
Thanks again to all of you for your interest in DSx86!
Just a quick post, I just put the source code for my AdLib emulation available for download. See my download page, or get it directly from here. Hope you find it useful or interesting, and let me know if you see some obvious bugs in it. :-)
Last Monday I added the last missing features of the AdLib emulation, frequency modulation (vibrato) and amplitude modulation. I figured out a way to handle the vibrato without slowing down the code all that much. The sound frequency should change (using the 32kHz sound output frequency) after every 674 samples, but as I fill the buffers 64 samples at a time, I decided to change the frequency after every 640 samples, so I can move the calculations outside of the sample building loop. So the vibrato is slightly faster than it should be, but I don't think that is much of a problem. I did a similar change to the amplitude modulation, it should change the sound volume every 168 samples, but I change it every 192 samples, again to move the calculations outside of the 64-sample loops.
I just need to comment the code better and create a test project that would use my AdLib emulation code and then I could release the sources, in case people are interested in those.
I have so far tested Adventures of Robin Hood, Silpheed and SimAnt. Adventures of Robin Hood crashed with an sunsupported opcode, and when I debugged it I noticed that it uses only a 32-byte(!) stack! That is so little space that when a timer interrupt happens when the code calls a subroutine (after receiving a keypress) it runs out of stack space. Actually, the stack pointer should wrap to the end of the stack segment (which has unused space in the game launcher program), but my DSx86 stack emulation implementation did not handle this properly and thus crashed. I changed my stack emulation to handle stack pointer wrap-arounds properly, so Adventures of Robin Hood started up fine.
Silpheed progressed pretty far with the single-opcode internal refactoring I had already done, but then it ran into a problem with my string opcodes that still didn't handle writing to graphics memory with a segment address pointing to plain RAM properly. It did draw most of the enemy ships with the plain opcodes, but their removal from the screen was done using string opcodes, so after a while the screen was full of enemy ships. :-) It needed the refactored string opcodes before I could continue with it.
I then tested SimAnt, and noticed that it used the EMS memory in a way that also was incompatible with my segment-based memory access mode handling. It got the wrong data from the EMS memory using my old string opcodes, and thus filled the screen with garbage data instead of the Maxis logo at the start, for example. So, I decided to start working on the string opcode refactoring next.
I have now refactored many of the simple opcodes, like all OR operation variants and most of the MOV opcodes. What was interesting when I did this refactoring was that after every build the resulting DSx86.nds file got smaller. This was due to the new memory access handling using more common code for both the normal RAM and graphics opcodes, as only after the effective memory adress is calculated the code will branch to different handlers. Originally I had completely different opcode handlers, depending on where the effective segment poinst to. Thus, the new code has a lot more branches (which makes the code slower), but on the other hand there is much more common code which will help with the cache hit percentage. So perhaps the total slowdown is not quite as drastic. I also have a couple of speedup tricks I can still use, but those I can not do until all of the code is changed to use the new memory access strategy. So, looks like the next version will be quite a bit slower than the current version, but after I have refactored all the opcodes I can make the code slightly faster again.
I am currently working on the string opcode refactoring, and now I am pretty happy with the way the code looks. I am splitting the memory moves, for example, so that they are done in blocks that fit within the same 16K memory page, both for the source and target address, and also possible SI and DI segment wrap is taken into account. So all EMS memory and graphics memory access with the string opcodes should now always handle the correct data, so I can look elsewhere for erractic behaviour in games. I have already coded most of the string opcodes for main RAM and EGA graphics memory, but Mode-X handling is still completely missing. I still have a week before the next release, so I should have enough time to handle those as well.
After I coded the main RAM and EGA string operations, both Silpheed and SimAnt looked to be playable. Silpheed still hangs after intro if Enter is not pressed, and it also hangs when going to the debugger and trying to continue, so there are still some issues. SimAnt I haven't tested any further than by going to the main menu. It uses 640x480 graphics mode so it is a bit awkward to play in any case.
The heat wave continues here in Finland (as it seems to do for most of Europe), so I shall see how much coding I can get done during the next week.
Yes, it is version 0.20, not 0.16! I decided to jump the version number a bit again, as this version has some extensive internal architecture changes, as well as much improved audio features. I also began a new blog page as the previous one had grown quite long. See the end of this page for a link to the previous blog entries. Anyways, here is a list of the most important changes:
As you can see from the change list, my focus last week was the audio features of DSx86. The last time I worked on the AdLib emulation was September last year, so I first had to go through the code and try to remember how it worked. Then I began by increasing the audio volume, which was the most often requested audio improvement. I remembered I had tried to increase the volume once earlier, but that resulted in bad distortion. Now I found the reason for the distortion and was able to fix that, so increasing the audio volume actually made the audio much cleaner. Also, for the first time since I had started working on DSx86 I now used Hi-Fi headphones with my DS Lite, and was surprised to find that my AdLib emulation actually produces quite convincing bass frequencies! I had only tested the audio with the inbuilt speakers and el-cheapo tiny headphones, neither of which seemed to produce any bass sounds.
After I increased the audio volume I began working on the missing rhythm instruments. AdLib has two modes of operation, it can either use all 9 channels (each with 2 FM operators) for melodic instruments, or it can use the last three channels for rhythm instruments (so that Bass Drum uses both operators of channel 6, the other four rhythm instruments each use a single operator of the remaining two channels). Back in September the rhythm instrument code in my reference fmopl.c implementation looked extremely complex and slow, so I skipped implementing these at that time. Now when I am on my summer vacation I wanted to really look into this code and understand how it works, and now was able to optimize and invent various shortcuts to make it run pretty much as fast as the normal channels.
Actually Bass Drum behaved pretty much like a normal channel, except that in the normal melodic channel operator 1 either works as a phase modulator for operator 2, or it produces sound directly (so that a single channel can actually produce two different sounds), but with Bass Drum it either works as a phase modulator or is ignored completely. Ignoring it was of course quite an easy change, so that took care of the Bass Drum. Tom Tom was quite easy as well, it just used a single operator to drive the output, so it was actually easier than the melodic channels.
The HiHat, Snare and Cymbal sounds were more difficult. They also each use only a single operator, but they need a noise generator in addition to the phase frequency counter, and also the frequency is not used as a simple 16.16 fixed point value being an index to a waveform table, but only a few bits of the frequency counter are used to create certain fixed indices to the waveform table. Both HiHat and Cymbal also use two different operators (channel 7 operator 1 and channel 8 operator 2) to produce their output. For example, this is what the HiHat phase generation looks like in the reference implementation:
/* high hat phase generation:
phase = d0 or 234 (based on frequency only)
phase = 34 or 2d0 (based on noise)
*/
/* base frequency derived from operator 1 in channel 7 */
unsigned char bit7 = ((SLOT7_1->Cnt>>FREQ_SH)>>7)&1;
unsigned char bit3 = ((SLOT7_1->Cnt>>FREQ_SH)>>3)&1;
unsigned char bit2 = ((SLOT7_1->Cnt>>FREQ_SH)>>2)&1;
unsigned char res1 = (bit2 ^ bit7) | bit3;
/* when res1 = 0 phase = 0x000 | 0xd0; */
/* when res1 = 1 phase = 0x200 | (0xd0>>2); */
UINT32 phase = res1 ? (0x200|(0xd0>>2)) : 0xd0;
/* enable gate based on frequency of operator 2 in channel 8 */
unsigned char bit5e= ((SLOT8_2->Cnt>>FREQ_SH)>>5)&1;
unsigned char bit3e= ((SLOT8_2->Cnt>>FREQ_SH)>>3)&1;
unsigned char res2 = (bit3e ^ bit5e);
/* when res2 = 0 pass the phase from calculation above (res1); */
/* when res2 = 1 phase = 0x200 | (0xd0>>2); */
if (res2)
phase = (0x200|(0xd0>>2));
/* when phase & 0x200 is set and noise=1 then phase = 0x200|0xd0 */
/* when phase & 0x200 is set and noise=0 then phase = 0x200|(0xd0>>2), ie no change */
if (phase&0x200)
{
if (noise)
phase = 0x200|0xd0;
}
else
/* when phase & 0x200 is clear and noise=1 then phase = 0xd0>>2 */
/* when phase & 0x200 is clear and noise=0 then phase = 0xd0, ie no change */
{
if (noise)
phase = 0xd0>>2;
}
The noise value above is calculated in a noise-generator which gives a new value for each output sample, and uses the following algorithm in the reference implementation. The noise value in the above algorithm is the lowest bit of the OPL->noise_rng variable.
OPL->noise_p += OPL->noise_f;
i = OPL->noise_p >> FREQ_SH; /* number of events (shifts of the shift register) */
OPL->noise_p &= FREQ_MASK;
while (i)
{
if (OPL->noise_rng & 1) OPL->noise_rng ^= 0x800302;
OPL->noise_rng >>= 1;
i--;
}
My simplified and speeded-up version of the phase generation algorithm is below. The problems in my algorithm are that it does not take into account the frequency of operator 2 in channel 8 at all, as I don't have enough free registers to handle two operators simultaneously, and my noise generation is completely different. The result is that the HiHat does not sound quite like it should, it has more of a ringing and less noise to it's sound. It will have to do for now, though, until I figure out a better 1-CPU-cycle noise generator than my tst r7, r7, ror r7 opcode, or can figure out a way to calculate another operator while calculating the current operator as well.
@-------
@ On input: r7 = SLOT7_1->Cnt (16.16 fixed point value, FREQ_SH = 16)
@ On output: r1 = phase << 9
@-------
eor r1, r7, r7, lsr #5 @ r1 = (bit2 ^ bit7); (<<16)
orr r1, r7, lsr #1 @ r1 = (bit2 ^ bit7) | bit3; (<<16)
and r1, #(1<<(16+2)) @ r1 = res1 = (bit2 ^ bit7) | bit3; (== 0x200 shifted 9 bits left)
tst r7, r7, ror r7 @ Carry flag = pseudo-random noise value
orrcc r1, #(0xD0<<(16+2-9)) @ phase = res1|0xd0;
orrcs r1, #(0xD0<<(16+2-9-2)) @ phase = res1|0xd0>>2;
I also managed to speed up some things in my AdLib emulation in general, for example I reordered the operand-specific values in memory so that instead of using 7 separate ldr commands to load the r4-r10 registers needed in each operator calculation loop I load them with a single ldmia opcode, and I also improved the envelope calculations somewhat. The envelope generation in AdLib has the usual four phases, Attack, Decay, Sustain and Release. Plus silence of course. The code I use for the envelope generation is the following, it is similar to both melodic and rhythm instruments:
@-------
@ Calculate envelope for SLOT 1.
@ On input:
@ r1 = scratch register
@ r4 = sustain level (or silence level if in release phase) (in low 16 bits, 0..512)
@ r5 = envelope increment/decrement value (16.16 fixed point)
@ r8 = operator volume (16.16 fixed point), 0 = max volume, 512<<16 = silence
@-------
adds r8, r5 @ Adjust the volume by the envelope increment. Carry set if we are in attack phase.
bmi from_attack_to_decay_phase @ Go to decay if we went over max volume
rsbccs r1, r8, r4, lsl #16 @ Did we go under the SUSTAIN level (and we are not in attack phase)? Carry clear if we did.
bcc from_decay_to_sustain @ Yep, go adjust the volume
env_adjust_done:
The main idea of this code is that during normal envelope operations the program flow does not need to take any jumps, it will flow directly thru these four opcodes. The algorithm above is an ASM version of the following C language code. This is not based on the reference AdLib implementation, as I have completely re-engineered the envelope generation code to be based on running 16.16 fixed point adders instead of (slow) table lookups.
op->volume += op->env_incr;
if ( op->volume < 0 )
goto from_attack_to_decay;
if ( op->env_incr >= 0 && op->volume > op->sustain )
goto from_decay_to_sustain;
env_adjust_done:
The interesting part of my ASM implementation is that by using the reverse subtract RSB opcode instead of the CMP opcode I was able to get rid of the extra compare to see whether op->env_incr is greater than or equal to zero. The first adds always sets the carry flag if r5 (op->env_incr) is negative, and in that case I don't want to test the sustain level. So, by swapping the resulting Carry flag of the comparison between op->volume and op->sustain I can make sure the jump to from_decay_to_sustain is never taken if either op->env_incr is negative or op->volume is <= op->sustain. This is what I especially like about the ARM architecture, the conditionally executable opcodes make all sorts of neat tricks possible! I've been using the conditionally executed compare opcodes quite a bit recently, as that is a nice way to handle "comparison AND comparison" type of tests (with short-circuit evaluation).
The only bigger issues (in addition to the HiHat and Cymbal operators) in the AdLib emulation now are the amplitude and frequency modulation support. I actually have the amplitude modulation already coded in, but it does not seem to work properly, so that I still need to debug. Last September I did not have the iDeaS emulator to use for debugging the ARM7 code, so now debugging is much easier than what it was back then. The frequency modulation I haven't implemented at all yet, as it is quite a heavy CPU burden and I'm not yet sure if I can spare the CPU cycles. I would like to test that, though. After those changes I would consider my AdLib emulation finished, so I could release the sources and/or create an ARM7 library for using it, so that it could be used in all sorts of PC game porting projects. I would like to hear the original music in the Doom and Wolfenstein ports, for example, and why not Quake too, if it used AdLib music?
I have now run into several games that point the segment register outside of the graphics memory area (Gods, Silpheed) or the EMS memory area (Alone in the Dark), which breaks my speedup trick of precalculating the memory access area using only the segment register. I decided to change my memory access technique to be more compatible and robust, which will sadly also make it slower.
In this version I have started this internal reworking. Pretty much none of it should be visible yet, but a few opcodes are somewhat slower than before. I need to change every single opcode (which I have spent the last nearly a year implementing) to use this new architecture, so it will take quite a long time before this work is finished. I plan to change a few opcodes (or opcode groups) by each version, and try to keep the two-week release window even while I am doing this change. I will also implement other changes and improvements, so this major rewrite is a sort of constant background process. Removing the Direct screen update method was also partly due to this internal reworking, as it would have gotten in the way of some required internal changes.
I haven't done much about the general compatibility this last week, as I have focused on the audio issues. I hope to look into a couple of new games again for the next version, and look into improving either the screen scaling methods or the touchpad mouse emulation. I'm not yet sure which.
Anyways, thanks again for your interest in DSx86, and sorry for this long blog post! Being on vacation gives me more time to work on DSx86, and also makes it possible to write longer blog posts. Oh, and happy 4th of July to all you celebrating it! :-)