Lab 01 - 6502 Assembly Language Lab

This blog marks the start of a series of blogs that I will be making documenting my journey in learning the assembly language and with that software portability and optmization as well. This is done as part of the SPO600 course I'm taking at Seneca Polytechnic where I will be working on a number of labs and contributing to open source projects.

In this blog I go over my first lab where we learn the basics of assembly by programming a 6502 microprocessor to color pixels on a screen. We also learn how to analyze assembly programs and find ways to improve its performance as well, i.e., to accomplish the task in the least amount of CPU cycles as possible.

6502 Microprocessor

The 6502 microprocessor is a simple 8-bit processor developed for MOS Technology back in 1975. It is a relatively cheap and simple CPU, and so found its way into various popular machines such as the Apple II computer and the Nintendo Entertainment System (NES). Because of this processor's simplicity, this is what we'll be using in order to get acquainted with the assembly language. To be specific, we'll be using a web-based emulator of the 6502 found at http://6502.cdot.systems/

Initial Program

To start, here is a simple assembly program that renders a solid color on the screen.

      lda #$00       ; set a pointer in memory location $40 to point to $0200
      sta $40        ; ... low byte ($00) goes in address $40
      lda #$02
      sta $41        ; ... high byte ($02) goes into address $41
      lda #$07       ; colour number
      ldy #$00       ; set index to 0
loop: sta ($40),y    ; set pixel colour at the address (pointer)+Y
      iny            ; increment index
      bne loop       ; continue until done the page (256 pixels)
      inc $41        ; increment the page
      ldx $41        ; get the current page number
      cpx #$06       ; compare with 6
      bne loop       ; continue until done all pages

In order to analyze this program, we must determine the number of cycles each instruction is going to take and multiply that to the number of times the instruction will be executed. I used the 6502 Family CPU Reference website as a reference when determining the number of cycles each instruction takes.

Here is our analysis of the programs performance.

      lda #$00       ; cycles=2, executes=1   ,             ,               , total_cycles=2
      sta $40        ; cycles=3, executes=1   ,             ,               , total_cycles=3
      lda #$02       ; cycles=2, executes=1   ,             ,               , total_cycles=2
      sta $41        ; cycles=3, executes=1   ,             ,               , total_cycles=3
      lda #$07       ; cycles=2, executes=1   ,             ,               , total_cycles=2
      ldy #$00       ; cycles=2, executes=1   ,             ,               , total_cycles=2
loop: sta ($40),y    ; cycles=6, executes=1024,             ,               , total_cycles=6144
      iny            ; cycles=2, executes=1024,             ,               , total_cycles=2048
      bne loop       ; cycles=3, executes=1020, alt_cycles=2, alt_executes=4, total_cycles=3068
      inc $41        ; cycles=5, executes=4   ,             ,               , total_cycles=20
      ldx $41        ; cycles=3, executes=4   ,             ,               , total_cycles=12
      cpx #$06       ; cycles=2, executes=4   ,             ,               , total_cycles=8
      bne loop       ; cycles=3, executes=3   , alt_cycles=2, alt_executes=4, total_cycles=11
      
      ; Total Cycles Overall: 11325 cycles
      ; CPU Speed: 1 MHz
      ; uS per clock: 1
      ; Time: 11325 uS
      ;     : 11.325 mS
      ;     : 0.011325 S   

Analyzing this code, we see that the program runs 11325 cycles in about 0.011315 seconds with a clock speed of 1MHz. We can certainly do better than that.

One thing we can do is figure out the most essential instruction in the program and see if we can exchange it for something faster. Given that the whole point of the program is to color pixels on a screen, the most essential instruction would be the one used to color a pixel which is sta ($40),y.

Looking through the 6502 reference site, I see that this instruction supports a number of addressing modes. For our situation we need to be able to address $0200 - $05ff, so that rules out any zeropage addressing modes. Additionally, since we need to use addresses not found in the zero-page, i.e., $0200 - $05ff, that rules out the X-indexing mode which indexes first before dereferencing. What we need is to dereference first before indexing, so that leaves us with the Y-indexing mode which does just that. We have the following options for Y-indexing and their respective cycles counts.

STA $nnnn,Y ; 5 cycles
STA ($nn),Y ; 6 cycles

Based on this, it would be more efficient for us to use STA $nnnn,Y because it is 1 cycle faster than the STA ($nn),Y we are currently using.

Now, what changes do we have to make in order to incorporate this alternative instruction? For one, we can go ahead and remove all things related to the pointer to $0200 as well as replace the pointer to start from the absolute $0200 address. That leaves us with this.

      lda #$07       ; colour number
      ldy #$00       ; set index to 0
loop: sta $0200,y    ; set pixel colour at the address $0200+Y
      iny            ; increment index
      bne loop       ; continue until done the page (256 pixels)

However, this is still incomplete as it only fills up a quarter of the screen.

What we can do to fix this is to copy the loop three more times but change the starting address each time like so.

      lda #$07       ; colour number
      ldy #$00       ; set index to 0
loop1: sta $0200,y    ; set pixel colour at the address $0200+Y
      iny            ; increment index
      bne loop1       ; continue until done the page (256 pixels)
loop2: sta $0300,y    ; set pixel colour at the address $0300+Y
      iny            ; increment index
      bne loop2       ; continue until done the page (256 pixels)
loop3: sta $0400,y    ; set pixel colour at the address $0400+Y
      iny            ; increment index
      bne loop3       ; continue until done the page (256 pixels)
loop4: sta $0500,y    ; set pixel colour at the address $0500+Y
      iny            ; increment index
      bne loop4       ; continue until done the page (256 pixels)

Now the program works as expected. Let us try to calculate this new program's performance.

      lda #$07       ; cycles=2, executes=1  ,             ,               , total_cycles=2
      ldy #$00       ; cycles=2, executes=1  ,             ,               , total_cycles=2
loop1: sta $0200,y   ; cycles=5, executes=256,             ,               , total_cycles=1280
      iny            ; cycles=2, executes=256,             ,               , total_cycles=512
      bne loop1      ; cycles=3, executes=255, alt_cycles=2, alt_executes=1, total_cycles=767
loop2: sta $0300,y   ; cycles=5, executes=256,             ,               , total_cycles=1280
      iny            ; cycles=2, executes=256,             ,               , total_cycles=512
      bne loop2      ; cycles=3, executes=255, alt_cycles=2, alt_executes=1, total_cycles=767
loop3: sta $0400,y   ; cycles=5, executes=256,             ,               , total_cycles=1280
      iny            ; cycles=2, executes=256,             ,               , total_cycles=512
      bne loop3      ; cycles=3, executes=255, alt_cycles=2, alt_executes=1, total_cycles=767
loop4: sta $0500,y   ; cycles=5, executes=256,             ,               , total_cycles=1280
      iny            ; cycles=2, executes=256,             ,               , total_cycles=512
      bne loop4      ; cycles=3, executes=255, alt_cycles=2, alt_executes=1, total_cycles=767
      
      ; Total Cycles Overall: 10240 cycles
      ; CPU Speed: 1 MHz
      ; uS per clock: 1
      ; Time: 10240 uS
      ;     : 10.24 mS
      ;     : 0.01024 S

As you can see this progam is a bit better now as we have brought down the over performance from 11325 cycles to 10240 cycles, shaving off 1085 cycles from the original program. However, we can still optimize the program even more.

If you notice, the new program seems a bit redundant having four loops doing pretty much the same thing. They take a starting address and color adjacent addresses 256 times (basically until the Y-register overflows). Another redundancy that they have is they all increment and use the Y-register in the same way, i.e., as an addressing index. So to eliminate this redundancy, we can try to tackle all four starting addresses in a single loop so we only have to increment the Y-register once per loop. This is what the refactored code will look like.

      lda #$07       ; colour number
      ldy #$00       ; set index to 0
loop: sta $0200,y    ; set pixel colour at the address $0200+Y
      sta $0300,y    ; set pixel colour at the address $0300+Y
      sta $0400,y    ; set pixel colour at the address $0400+Y
      sta $0500,y    ; set pixel colour at the address $0500+Y
      iny            ; increment index
      bne loop       ; continue until done the page (256 pixels)

As you can see the code looks significantly more concise and it works as expected too. Lets try to analyze this program and see how much perforance gain did we get.

      lda #$07       ; cycles=2, executes=1  ,             ,               , total_cycles=2
      ldy #$00       ; cycles=2, executes=1  ,             ,               , total_cycles=2
loop: sta $0200,y    ; cycles=5, executes=256,             ,               , total_cycles=1280
      sta $0300,y    ; cycles=5, executes=256,             ,               , total_cycles=1280
      sta $0400,y    ; cycles=5, executes=256,             ,               , total_cycles=1280
      sta $0500,y    ; cycles=5, executes=256,             ,               , total_cycles=1280
      iny            ; cycles=2, executes=256,             ,               , total_cycles=512
      bne loop       ; cycles=3, executes=255, alt_cycles=2, alt_executes=1, total_cycles=767
      
      ; Total Cycles Overall: 6403 cycles
      ; CPU Speed: 1 MHz
      ; uS per clock: 1
      ; Time: 6403 uS
      ;     : 6.403 mS
      ;     : 0.006403 S

As you can see, our program now runs at a speed of 6403 cycles, which is almost 2x as fast as the original program we started with! No doubt we have successfully optimized our coloring program.

Modifying the program

We can modify the code to render a different color instead of yellow. All we have to do is change the $07, which corresponds to yellow, to $0e for light blue. This is how our modified program will look like

      lda #$0e       ; colour number for light blue
      ldy #$00       ; set index to 0
loop: sta $0200,y    ; set pixel colour at the address $0200+Y
      sta $0300,y    ; set pixel colour at the address $0300+Y
      sta $0400,y    ; set pixel colour at the address $0400+Y
      sta $0500,y    ; set pixel colour at the address $0500+Y
      iny            ; increment index
      bne loop       ; continue until done the page (256 pixels)

If we want to display a different color for each quarter of the screen we can load another color value into the accumulator before storing it into the next set of addresses like so.

      ldy #$00       
loop: lda #$07       ; colour number for yellow
      sta $0200,y
      lda #$0e       ; colour number for light blue
      sta $0300,y
      lda #$02       ; colour number for red
      sta $0400,y
      lda #$05       ; colour number for green
      sta $0500,y
      iny
      bne loop

If we want to randomize the color used for each pixel, we can utilize a built-in pseudo-random number generator built into the 6502 Emulator we are using. The random number generator is found in $fe, so we should load the accumulator with whatever value is stored in there before coloring a pixel

      ldy #$00       
loop: lda $fe       ; value at this address changes randomly
      sta $0200,y
      sta $0300,y
      sta $0400,y
      sta $0500,y
      iny
      bne loop

Thoughts

Before I started studying assembly and attempting this lab, I always found assembly to be such a scary language because of how cryptic it looked. At the same time, however, because of this cryptic-ness I was really intrigued to learn more about it. Now, after having finished this lab, assembly doesn't seem as daunting as it used to be. Going through this lab made me realize how powerful learning assembly could be. You are in control with how memory is manipulated and where it is stored. This has also opened my eyes to how other programming languages work at the low-level. I have definitely gained a deeper appreciation for assembly and am now more motivated to learn about assembly, the differences in architectures, and how I can disassemble software to optimze them.

Comments

Popular posts from this blog

Understanding Compiler Passes

Identifying the Essence of a Function Clone

Building GCC