Lab 01 - 6502 Assembly Language Lab
This blog marks the start of a series of blogs that I will be making documenting my journey in learning the assembly language and with that software portability and optmization as well. This is done as part of the SPO600 course I'm taking at Seneca Polytechnic where I will be working on a number of labs and contributing to open source projects.
In this blog I go over my first lab where we learn the basics of assembly by programming a 6502 microprocessor to color pixels on a screen. We also learn how to analyze assembly programs and find ways to improve its performance as well, i.e., to accomplish the task in the least amount of CPU cycles as possible.
6502 Microprocessor
The 6502 microprocessor is a simple 8-bit processor developed for MOS Technology back in 1975. It is a relatively cheap and simple CPU, and so found its way into various popular machines such as the Apple II computer and the Nintendo Entertainment System (NES). Because of this processor's simplicity, this is what we'll be using in order to get acquainted with the assembly language. To be specific, we'll be using a web-based emulator of the 6502 found at http://6502.cdot.systems/
Initial Program
To start, here is a simple assembly program that renders a solid color on the screen.
lda #$00 ; set a pointer in memory location $40 to point to $0200 sta $40 ; ... low byte ($00) goes in address $40 lda #$02 sta $41 ; ... high byte ($02) goes into address $41 lda #$07 ; colour number ldy #$00 ; set index to 0 loop: sta ($40),y ; set pixel colour at the address (pointer)+Y iny ; increment index bne loop ; continue until done the page (256 pixels) inc $41 ; increment the page ldx $41 ; get the current page number cpx #$06 ; compare with 6 bne loop ; continue until done all pages
In order to analyze this program, we must determine the number of cycles each instruction is going to take and multiply that to the number of times the instruction will be executed. I used the 6502 Family CPU Reference website as a reference when determining the number of cycles each instruction takes.
Here is our analysis of the programs performance.
lda #$00 ; cycles=2, executes=1 , , , total_cycles=2 sta $40 ; cycles=3, executes=1 , , , total_cycles=3 lda #$02 ; cycles=2, executes=1 , , , total_cycles=2 sta $41 ; cycles=3, executes=1 , , , total_cycles=3 lda #$07 ; cycles=2, executes=1 , , , total_cycles=2 ldy #$00 ; cycles=2, executes=1 , , , total_cycles=2 loop: sta ($40),y ; cycles=6, executes=1024, , , total_cycles=6144 iny ; cycles=2, executes=1024, , , total_cycles=2048 bne loop ; cycles=3, executes=1020, alt_cycles=2, alt_executes=4, total_cycles=3068 inc $41 ; cycles=5, executes=4 , , , total_cycles=20 ldx $41 ; cycles=3, executes=4 , , , total_cycles=12 cpx #$06 ; cycles=2, executes=4 , , , total_cycles=8 bne loop ; cycles=3, executes=3 , alt_cycles=2, alt_executes=4, total_cycles=11 ; Total Cycles Overall: 11325 cycles ; CPU Speed: 1 MHz ; uS per clock: 1 ; Time: 11325 uS ; : 11.325 mS ; : 0.011325 S
Analyzing this code, we see that the program runs 11325 cycles in about 0.011315 seconds with a clock speed of 1MHz. We can certainly do better than that.
One thing we can do is figure out the most essential instruction in the program and see if we can exchange it for something faster. Given that the whole point of the program is to color pixels on a screen, the most essential instruction would be the one used to color a pixel which is sta ($40),y
.
Looking through the 6502 reference site, I see that this instruction supports a number of addressing modes. For our situation we need to be able to address $0200 - $05ff
, so that rules out any zeropage addressing modes. Additionally, since we need to use addresses not found in the zero-page, i.e., $0200 - $05ff
, that rules out the X-indexing mode which indexes first before dereferencing. What we need is to dereference first before indexing, so that leaves us with the Y-indexing mode which does just that. We have the following options for Y-indexing and their respective cycles counts.
STA $nnnn,Y ; 5 cycles STA ($nn),Y ; 6 cycles
Based on this, it would be more efficient for us to use STA $nnnn,Y
because it is 1 cycle faster than the STA ($nn),Y
we are currently using.
Now, what changes do we have to make in order to incorporate this alternative instruction? For one, we can go ahead and remove all things related to the pointer to $0200
as well as replace the pointer to start from the absolute $0200
address. That leaves us with this.
lda #$07 ; colour number ldy #$00 ; set index to 0 loop: sta $0200,y ; set pixel colour at the address $0200+Y iny ; increment index bne loop ; continue until done the page (256 pixels)
However, this is still incomplete as it only fills up a quarter of the screen.
What we can do to fix this is to copy the loop three more times but change the starting address each time like so.
lda #$07 ; colour number ldy #$00 ; set index to 0 loop1: sta $0200,y ; set pixel colour at the address $0200+Y iny ; increment index bne loop1 ; continue until done the page (256 pixels) loop2: sta $0300,y ; set pixel colour at the address $0300+Y iny ; increment index bne loop2 ; continue until done the page (256 pixels) loop3: sta $0400,y ; set pixel colour at the address $0400+Y iny ; increment index bne loop3 ; continue until done the page (256 pixels) loop4: sta $0500,y ; set pixel colour at the address $0500+Y iny ; increment index bne loop4 ; continue until done the page (256 pixels)
Now the program works as expected. Let us try to calculate this new program's performance.
lda #$07 ; cycles=2, executes=1 , , , total_cycles=2 ldy #$00 ; cycles=2, executes=1 , , , total_cycles=2 loop1: sta $0200,y ; cycles=5, executes=256, , , total_cycles=1280 iny ; cycles=2, executes=256, , , total_cycles=512 bne loop1 ; cycles=3, executes=255, alt_cycles=2, alt_executes=1, total_cycles=767 loop2: sta $0300,y ; cycles=5, executes=256, , , total_cycles=1280 iny ; cycles=2, executes=256, , , total_cycles=512 bne loop2 ; cycles=3, executes=255, alt_cycles=2, alt_executes=1, total_cycles=767 loop3: sta $0400,y ; cycles=5, executes=256, , , total_cycles=1280 iny ; cycles=2, executes=256, , , total_cycles=512 bne loop3 ; cycles=3, executes=255, alt_cycles=2, alt_executes=1, total_cycles=767 loop4: sta $0500,y ; cycles=5, executes=256, , , total_cycles=1280 iny ; cycles=2, executes=256, , , total_cycles=512 bne loop4 ; cycles=3, executes=255, alt_cycles=2, alt_executes=1, total_cycles=767 ; Total Cycles Overall: 10240 cycles ; CPU Speed: 1 MHz ; uS per clock: 1 ; Time: 10240 uS ; : 10.24 mS ; : 0.01024 S
As you can see this progam is a bit better now as we have brought down the over performance from 11325 cycles to 10240 cycles, shaving off 1085 cycles from the original program. However, we can still optimize the program even more.
If you notice, the new program seems a bit redundant having four loops doing pretty much the same thing. They take a starting address and color adjacent addresses 256 times (basically until the Y-register overflows). Another redundancy that they have is they all increment and use the Y-register in the same way, i.e., as an addressing index. So to eliminate this redundancy, we can try to tackle all four starting addresses in a single loop so we only have to increment the Y-register once per loop. This is what the refactored code will look like.
lda #$07 ; colour number ldy #$00 ; set index to 0 loop: sta $0200,y ; set pixel colour at the address $0200+Y sta $0300,y ; set pixel colour at the address $0300+Y sta $0400,y ; set pixel colour at the address $0400+Y sta $0500,y ; set pixel colour at the address $0500+Y iny ; increment index bne loop ; continue until done the page (256 pixels)
As you can see the code looks significantly more concise and it works as expected too. Lets try to analyze this program and see how much perforance gain did we get.
lda #$07 ; cycles=2, executes=1 , , , total_cycles=2 ldy #$00 ; cycles=2, executes=1 , , , total_cycles=2 loop: sta $0200,y ; cycles=5, executes=256, , , total_cycles=1280 sta $0300,y ; cycles=5, executes=256, , , total_cycles=1280 sta $0400,y ; cycles=5, executes=256, , , total_cycles=1280 sta $0500,y ; cycles=5, executes=256, , , total_cycles=1280 iny ; cycles=2, executes=256, , , total_cycles=512 bne loop ; cycles=3, executes=255, alt_cycles=2, alt_executes=1, total_cycles=767 ; Total Cycles Overall: 6403 cycles ; CPU Speed: 1 MHz ; uS per clock: 1 ; Time: 6403 uS ; : 6.403 mS ; : 0.006403 S
As you can see, our program now runs at a speed of 6403 cycles, which is almost 2x as fast as the original program we started with! No doubt we have successfully optimized our coloring program.
Modifying the program
We can modify the code to render a different color instead of yellow. All we have to do is change the $07
, which corresponds to yellow, to $0e
for light blue. This is how our modified program will look like
lda #$0e ; colour number for light blue ldy #$00 ; set index to 0 loop: sta $0200,y ; set pixel colour at the address $0200+Y sta $0300,y ; set pixel colour at the address $0300+Y sta $0400,y ; set pixel colour at the address $0400+Y sta $0500,y ; set pixel colour at the address $0500+Y iny ; increment index bne loop ; continue until done the page (256 pixels)
If we want to display a different color for each quarter of the screen we can load another color value into the accumulator before storing it into the next set of addresses like so.
ldy #$00 loop: lda #$07 ; colour number for yellow sta $0200,y lda #$0e ; colour number for light blue sta $0300,y lda #$02 ; colour number for red sta $0400,y lda #$05 ; colour number for green sta $0500,y iny bne loop
If we want to randomize the color used for each pixel, we can utilize a built-in pseudo-random number generator built into the 6502 Emulator we are using. The random number generator is found in $fe
, so we should load the accumulator with whatever value is stored in there before coloring a pixel
ldy #$00 loop: lda $fe ; value at this address changes randomly sta $0200,y sta $0300,y sta $0400,y sta $0500,y iny bne loop
Thoughts
Before I started studying assembly and attempting this lab, I always found assembly to be such a scary language because of how cryptic it looked. At the same time, however, because of this cryptic-ness I was really intrigued to learn more about it. Now, after having finished this lab, assembly doesn't seem as daunting as it used to be. Going through this lab made me realize how powerful learning assembly could be. You are in control with how memory is manipulated and where it is stored. This has also opened my eyes to how other programming languages work at the low-level. I have definitely gained a deeper appreciation for assembly and am now more motivated to learn about assembly, the differences in architectures, and how I can disassemble software to optimze them.
Comments
Post a Comment