Skip to content

The Pi Piper – Blinking a LED From an MVS Application Program


I had managed to blink a LED the other day but it involved running in supervisor state to get hercules to issue a shell command and it was just pretty icky. I tried to sanitize it by directing the output of one of MVSs printers to a raspi LED but I kept getting I/O errors – the GPIO server looks like a dumb file at /sys/class/gpio/gpio18/value but while it happily responds to 1’s and 0’s it throws an error for anything else and i seemed always to get a blank or some other garbage at the end of a run.

I got around it by writing a small program to scrub the output and send only valid commands to the LED. I got the GPIO access code from this page and an example of the piping code from this one.

This is still a lot of moving parts, it’s output only, and probably unacceptably slow for any serious GPIO access but it’s a good place to stop. I’m going to see about putting a hook into the hercules emulator source to either do more direct passthrough or direct access to the GPIO’s.

Advertisements

So It Works, But the Software is NOT Happy! – Blinking a LED From MVS


The hercules configuration file connects 370 unit addresses to linux files and the raspberry pi GPIO pins can be controlled with environment variables that can be written to like files. I tried defining one of the gpio pins as a printer to get a connection from MVS to the raspi hardware. It sort of works but hercules is not happy about something and he makes MVS unhappy so the whole thing abends. I still think there’s the germ of an idea here and my LED IS on so…

AS1802 and ASLINK

One of the things that the 1802 historically lacks is a linker to combine and relocate pre-compiled object modules.  LCC1802 combines everything at the source level and compiles/assembles it every time.  This has the advantage of clarity but it does make the output bulkier.

Through an odd happenstance I came across AS1802 and ASLINK which accompanies it.  AS1802 is part of the ASxxxx series of cross assemblers which are also used by SDCC.

I downloaded the assembler and linker and tried them out today. I created assembly modules fblink.asm which calls an external routine _onems, and delay.asm which contains that routine.  I assembled the two generating files fblink.rel and delay.rel with commands like “as1802 fblink.asm”.

	.globl _onems
label1:	seq
	sep call
	.dw _onems
	req
	sep call
	.dw _onems
	br label1
	.BNDRY  8
	.globl	_onems
_onems:		;execute 1ms worth of instructions including call(15)/return(10) sequence. takes about 1 ms
;subroutine overhead soaks up 27 instruction time.
;each loop is 2 instruction times
;so the number of loops needed is 
;CPU speed/16000 less the 27 all divide by two
  .IFNDEF LCC1802CPUSPEED
LCC1802CPUSPEED .EQU 1600	;1.6MHZ default
  .ENDIF
  
  .IFLT LCC1802CPUSPEED-8000
	ldi	((LCC1802CPUSPEED/16)-15-10-2)/2
1$:	smi	1
	bnz	1$
  .ELSE
	ldi	((LCC1802CPUSPEED/16)-15-10-2)/4
2$:	smi	1
	sex	sp
	sex	sp
	bnz	2$
  .ENDIF
	sep	return

I created a library file basic.lib which just contains one line with delay.rel on it.

The link command “aslink -i -u -l basic.lib fblink.rel” loads the fblink program, notices the reference to _onems and goes through the files pointed to by basic.lib finding it in delay.rel.

The result goes into fblink.ihx with fblink at 0000 and _onems at 0010.  This is actually pretty magnificent. It would be an awful lot of work to convert from using my current assembler to AS1802 but if i ever get ambitious i might think about using it as a back end to the ASW macro-processor.

:0A0000007BD400107AD40010300009
:07001000F824FF013A12D5AC
:00000001FF

It occurs to me that this is a fairly dumb example. The onems routine really needs to be compiled each time so that changes in cpu speed can be accommodated. I can see separately building it in a makefile but it wouldn’t be a good library candidate. Better examples would be the 16 bit math routines which are always compiled and included even if they’re not needed.

Odious Comparisons – RCA 1802 vs IBM 370

Now that I have a C compiler on the emulated /370 I of course had to benchmark it. I was faintly worried that even though the Raspberry Pi host is fast that the emulated /370 would be too slow to be fun. That turns out not to be the case at all. I ran the Dhrystone benchmark which I’ve used for the 1802/1806. The emulated 370 ran 100,000 passes in 28 seconds for a score of 3570/Dhrystones per second – nominally about two MIPS. The best I’ve gotten out of an 1802/1806 is a bit over 200 Dhrystones/sec at 12MHz. (A Z80 at 4MHz scores about 300 by the way)

I was curious about the underpinnings so i looked at the generated code for the whole of the Dhrystone suite and a couple of the procedures in it.

    1. For the whole 540 lines of C the 370 compiler generates 1360 lines of assembler vs about 3500 for the 1802. Not all of either of those counts is executed in a benchmark pass – there’s a lot of printing and labels and data definitions.
    2. The first three functions in the C code(PROC_6, PROC_7, and PROC_8) total about 70 lines of C and generate 205 lines of 370 assembly and 920 for the 1802.
    3. Because the 1802’s instruction set is so simple and regular I know it is executing right around 3600 instructions for each pass.  For the 370 I am reduced to saying that if 1757 Dhrystones/sec is one MIP then each pass is about 570 instructions.
    4. A corollary of 3. above is that the 17:1 speed advantage of the emulated 370 comes from about a 7:1 advantage in instruction power and a 2.5:1 advantage in execution rate.
    5. As a parting shot I compiled the Dhrystone benchmark native on the Raspery Pi Zero.  The smallest number of passes that had a detectable pause was 10,000,000 and i had to run 100,000,000 to actually clock it at about 770 THOUSAND Dhrystones/sec or a nominal 215 MIPS. I had seen big numbers for the Pi but i didn’t really believe them until this.
    6. The 370 load module was about 17,000 bytes, the 1802 assembly about 14,000 and the Raspberry Pi just a bit smaller at 13,500 bytes.
    7. On the Pi, because it was easy I re-ran the compile with the optimizer turned on.  This cut the time for 100,000,000 passes down to about 4 seconds(vs 13 seconds unoptimized) but you really cant use that figure for MIPS. The -O3 sped the 370 version up but not nearly as much (100,000 passes in 20 seconds vs 28) and of course the 1802 compiler was already trying as hard as it could.

 

Below is the first procedure in the benchmark(PROC_6) rendered as 31 lines of C, 80 lines of 370 assembly, and 290 lines of 1802 code. In defense of the 1802 and my compiler I’ve also included the actual LCC1802 compiler output which uses a lot of macros to make the code manageable.

Probably the most striking thing about all this(well, beside the speed of the Pi) is that it’s possible at all.  The same C code compiles and runs on wildly different architectures.

 

Proc_6 (Enum_Val_Par, Enum_Ref_Par)
/*********************************/
    /* executed once */
    /* Enum_Val_Par == Ident_3, Enum_Ref_Par becomes Ident_2 */

Enumeration  Enum_Val_Par;
Enumeration *Enum_Ref_Par;
{
  *Enum_Ref_Par = Enum_Val_Par;
  if (! Func_3 (Enum_Val_Par))
    /* then, not executed */
    *Enum_Ref_Par = Ident_4;
  switch (Enum_Val_Par)
  {
    case Ident_1:
      *Enum_Ref_Par = Ident_1;
      break;
    case Ident_2:
      if (Int_Glob > 100)
        /* then */
      *Enum_Ref_Par = Ident_1;
      else *Enum_Ref_Par = Ident_4;
      break;
    case Ident_3: /* executed */
      *Enum_Ref_Par = Ident_2;
      break;
    case Ident_4: break;
    case Ident_5:
      *Enum_Ref_Par = Ident_3;
      break;
  } /* switch */
} /* Proc_6 */
* X-func Proc_6 prologue
PROC@6   PDPPRLG CINDEX=0,FRAME=96,BASER=12,ENTRY=YES
         B     @@FEN0
         LTORG
@@FEN0   EQU   *
         DROP  12
         BALR  12,0
         USING *,12
@@PG0    EQU   *
         LR    11,1
         L     10,=A(@@PGT0)
* Function Proc_6 code
         L     2,4(11)
         MVC   0(4,2),0(11)
         MVC   88(4,13),0(11)
         LA    1,88(,13)
         L     15,=V(FUNC@3)
         BALR  14,15
         LR    2,15
         LTR   2,2
         BNE   @@L2
         L     2,4(11)
         MVC   0(4,2),=F'3'
@@L2     EQU   *
         L     2,0(11)
         LA    3,4(0,0)
         CLR   2,3
         BH    @@L3
         L     3,=A(@@L11)
         L     2,0(11)
         MH    2,=H'4'
         L     2,0(2,3)
         BR    2
         DS    0F
         DS    0F
         DS    0F
         LTORG
         DS    0F
@@L11    EQU   *
         DC    A(@@L4)
         DC    A(@@L5)
         DC    A(@@L8)
         DC    A(@@L3)
         DC    A(@@L10)
@@L4     EQU   *
         L     2,4(11)
         MVC   0(4,2),=F'0'
         B     @@L3
@@L5     EQU   *
         L     2,=A(INT@GLOB)
         L     2,0(2)
         LA    3,100(0,0)
         CR    2,3
         BNH   @@L6
         L     2,4(11)
         MVC   0(4,2),=F'0'
         B     @@L3
@@L6     EQU   *
         L     2,4(11)
         MVC   0(4,2),=F'3'
         B     @@L3
@@L8     EQU   *
         L     2,4(11)
         MVC   0(4,2),=F'1'
         B     @@L3
@@L10    EQU   *
         L     2,4(11)
         MVC   0(4,2),=F'2'
@@L3     EQU   *
         LR    15,2
* Function Proc_6 epilogue
         PDPEPIL
* Function Proc_6 literal pool
         DS    0F
         LTORG
* Function Proc_6 page table
         DS    0F
@@PGT0   EQU   *
         DC    A(@@PG0)
         DS    0F
;$$function start$$ _Proc_6
_Proc_6:                ;framesize=10
        glo     R6
        stxd
        ghi     R6
        stxd
        glo     R7
        stxd
        ghi     R7
        stxd
        dec sp
        dec sp
        dec sp
        dec sp
        glo     SP
        adi     ((10+1))#256
        plo     MEMADDR
        ghi     SP
        adci    ((10+1))>>8; was/256
        phi     MEMADDR
        ghi     R12
        str     memAddr
        inc     memAddr
        glo     R12
        str     memAddr
        inc memaddr                             ;opt16.1
        ghi     R13
        str     MEMADDR
        glo     R13
        inc     MEMADDR
        str     MEMADDR
        dec     MEMADDR
        glo     SP
        adi     ((10+1))#256
        plo     MEMADDR
        ghi     SP
        adci    ((10+1))>>8; was/256
        phi     MEMADDR
        lda     memAddr
        phi     R7
        ldn     memAddr
        plo     R7
;{
;  *Enum_Ref_Par = Enum_Val_Par;
        glo     SP
        adi     ((12+1))#256
        plo     MEMADDR
        ghi     SP
        adci    ((12+1))>>8; was/256
        phi     MEMADDR
        lda     memAddr
        phi     R11
        ldn     memAddr
        plo     R11
        ghi     R7
        str     R11
        glo     R7
        inc     R11
        str     R11
        dec     R11
;  if (! Func_3 (Enum_Val_Par))
        glo     R7
        plo     R12
        ghi     R7
        phi     R12
        sep     RCALL
        dw      _FUNC_3
        glo     R15
        bnz    L8
        ghi     R15
        bnz    L8
;    *Enum_Ref_Par = Ident_4;
        glo     SP
        adi     ((12+1))#256
        plo     MEMADDR
        ghi     SP
        adci    ((12+1))>>8; was/256
        phi     MEMADDR
        lda     memAddr
        phi     R11
        ldn     memAddr
        plo     R11
        ldi     (3)>>8; top byte
        str     R11
        inc     R11
        ldi     (3) & 255;low byte
        str     R11
        dec     R11
L8:
;  switch (Enum_Val_Par)
        glo     R7
        plo     R6
        ghi     R7
        phi     R6
        glo     R6
        smi     (0)#256
        ghi     R6
        smbi    (0)>>8; was/256      ;that's a standard signed subtraction
        ghi     R6 ;
        xri     (0)>>8; was/256      ;sets the top bit if the signs are different
        shlc          ;the original df is now in bit 0 and df=1 if signs were different
        lsnf    ;bypass the df flip if signs were the same
        xri     01     ;invert original df if signs were different
        shrc           ;put it back in df
        LBNF    L10  ;execute
        glo     R6
        sdi     (4)#256      ;subtract d FROM immediate value
        ghi     R6
        sdbi    (4)>>8; was/256      ;that's a standard signed subtraction (of register FROM immediate)
        ghi     R6 ;
        xri     (4)>>8; was/256      ;sets the top bit if the signs are different
        shlc          ;the original df is now in bit 0 and df=1 if signs were different
        lsnf    ;bypass the df flip if signs were the same
        xri     01     ;invert original df if signs were different
        shrc           ;put it back in df
        LBNF    L10  ;execute
        glo     R6
        shl
        plo     R11
        ghi     R6
        shlc
        phi     R11
        glo     R11
        adi     ((L20))#256
        plo     MEMADDR
        ghi     R11
        adci    ((L20))>>8; was/256
        phi     MEMADDR
        lda     memAddr
        phi     R11
        ldn     memAddr
        plo     R11
        glo     R6
        stxd
        ghi     R6
        stxd
        glo     R11
        plo     R6
        ghi     R11
        phi     R6
        sep     RRET
L20:
        dw L13
        dw L14
        dw L17
        dw L11
        dw L19
;  {
L13:
;      *Enum_Ref_Par = Ident_1;
        glo     SP
        adi     ((12+1))#256
        plo     MEMADDR
        ghi     SP
        adci    ((12+1))>>8; was/256
        phi     MEMADDR
        lda     memAddr
        phi     R11
        ldn     memAddr
        plo     R11
        ldi     (0)>>8; top byte
        str     R11
        inc     R11
        ldi     (0) & 255;low byte
        str     R11
        dec     R11
;      break;
        br L11
L14:
;      if (Int_Glob > 100)
        ldi     ((_INT_GLOB))&255
        plo     MEMADDR
        ldi     ((_INT_GLOB))>>8; was/256
        phi     MEMADDR
        lda     memAddr
        phi     R11
        ldn     memAddr
        plo     R11
        glo     R11
        sdi     (100)#256      ;subtract d FROM immediate value
        ghi     R11
        sdbi    (100)>>8; was/256      ;that's a standard signed subtraction (of register FROM immediate)
        ghi     R11 ;
        xri     (100)>>8; was/256      ;sets the top bit if the signs are different
        shlc          ;the original df is now in bit 0 and df=1 if signs were different
        lsnf    ;bypass the df flip if signs were the same
        xri     01     ;invert original df if signs were different
        shrc           ;put it back in df
        bdf    L15  ;execute
;      *Enum_Ref_Par = Ident_1;
        glo     SP
        adi     ((12+1))#256
        plo     MEMADDR
        ghi     SP
        adci    ((12+1))>>8; was/256
        phi     MEMADDR
        lda     memAddr
        phi     R11
        ldn     memAddr
        plo     R11
        ldi     (0)>>8; top byte
        str     R11
        inc     R11
        ldi     (0) & 255;low byte
        str     R11
        dec     R11
        br L11
L15:
;      else *Enum_Ref_Par = Ident_4;
        glo     SP
        adi     ((12+1))#256
        plo     MEMADDR
        ghi     SP
        adci    ((12+1))>>8; was/256
        phi     MEMADDR
        lda     memAddr
        phi     R11
        ldn     memAddr
        plo     R11
        ldi     (3)>>8; top byte
        str     R11
        inc     R11
        ldi     (3) & 255;low byte
        str     R11
        dec     R11
;      break;
        br L11
L17:
;      *Enum_Ref_Par = Ident_2;
        glo     SP
        adi     ((12+1))#256
        plo     MEMADDR
        ghi     SP
        adci    ((12+1))>>8; was/256
        phi     MEMADDR
        lda     memAddr
        phi     R11
        ldn     memAddr
        plo     R11
        ldi     (1)>>8; top byte
        str     R11
        inc     R11
        ldi     (1) & 255;low byte
        str     R11
        dec     R11
;      break;
        br L11
;    case Ident_4: break;
L19:
;      *Enum_Ref_Par = Ident_3;
        glo     SP
        adi     ((12+1))#256
        plo     MEMADDR
        ghi     SP
        adci    ((12+1))>>8; was/256
        phi     MEMADDR
        lda     memAddr
        phi     R11
        ldn     memAddr
        plo     R11
        ldi     (2)>>8; top byte
        str     R11
        inc     R11
        ldi     (2) & 255;low byte
        str     R11
        dec     R11
;      break;
L10:
L11:
        ldi     0
        plo     R15
        phi     R15
;} /* Proc_6 */
L7:
        inc sp
        inc sp
        inc sp
        inc sp
        inc     sp
        lda     sp
        phi     R7
        lda     sp
        plo     R7
        lda     sp
        phi     R6
        ldn     sp
        plo     R6
        sep     RRET
;$$function end$$ _Proc_6
;$$function start$$ _Proc_6
_Proc_6:		;framesize=10
	pushr R6
	pushr R7
	reserve 4; save room for outgoing arguments
	st2 R12,'O',sp,(10+1); flag1
	st2 R13,'O',sp,(12+1); flag1
	ld2 R7,'O',sp,(10+1) ;reg:INDIRI2(addr)
;{
;  *Enum_Ref_Par = Enum_Val_Par;
	ld2 R11,'O',sp,(12+1) ;reg:INDIRP2(addr)
	st2 R7,'O',R11,0; ASGNI2(addr,reg)
;  if (! Func_3 (Enum_Val_Par))
	cpy2 R12,R7 ;LOADI2(reg)
	Ccall _Func_3; CALLI2(ar)
	jnzU2 R15,L8; NE 0
;    *Enum_Ref_Par = Ident_4;
	ld2 R11,'O',sp,(12+1) ;reg:INDIRP2(addr)
	st2I 3,'O',R11,0; ASGNI2(addr,acon)
L8:
;  switch (Enum_Val_Par)
	cpy2 R6,R7 ;LOADI2(reg)
	jcI2I R6,0,lbnf,L10  ;LT=lbnf i.e. subtract immedB from A and jump if borrow
	jnI2I R6,4,lbnf,L10; GT reverse  the subtraction
	cpy2 R11,R6
	shl2I R11,1
	ld2 R11,'O',R11,(L20) ;reg:INDIRP2(addr)
	jumpv R11; JUMPV(reg)
L20:
	dw L13
	dw L14
	dw L17
	dw L11
	dw L19
;  {
L13:
;      *Enum_Ref_Par = Ident_1;
	ld2 R11,'O',sp,(12+1) ;reg:INDIRP2(addr)
	st2I 0,'O',R11,0; ASGNI2(addr,acon)
;      break;
	lbr L11
L14:
;      if (Int_Glob > 100)
	ld2 R11,'D',(_Int_Glob),0 ;reg:INDIRI2(addr)
	jnI2I R11,100,lbdf,L15 ;LEI2 100 11 L15; LE is flipped test & subtraction
;      *Enum_Ref_Par = Ident_1;
	ld2 R11,'O',sp,(12+1) ;reg:INDIRP2(addr)
	st2I 0,'O',R11,0; ASGNI2(addr,acon)
	lbr L11
L15:
;      else *Enum_Ref_Par = Ident_4;
	ld2 R11,'O',sp,(12+1) ;reg:INDIRP2(addr)
	st2I 3,'O',R11,0; ASGNI2(addr,acon)
;      break;
	lbr L11
L17:
;      *Enum_Ref_Par = Ident_2;
	ld2 R11,'O',sp,(12+1) ;reg:INDIRP2(addr)
	st2I 1,'O',R11,0; ASGNI2(addr,acon)
;      break;
	lbr L11
;    case Ident_4: break;
L19:
;      *Enum_Ref_Par = Ident_3;
	ld2 R11,'O',sp,(12+1) ;reg:INDIRP2(addr)
	st2I 2,'O',R11,0; ASGNI2(addr,acon)
;      break;
L10:
L11:
	ld2z R15
;} /* Proc_6 */
L7:
	release 4; release room for outgoing arguments
	popr R7
	popr R6
	Cretn

;$$function end$$ _Proc_6

And Hello World!

18-04-15 WTO

Another baby step: A C program with inline 370 Assembler. After a fair bit of self-inflicted fumbling around and some help from the Hercules-OS380 mailing list I can run a C program with embedded assembler code.

//GCCMVS2 JOB CLASS=A,MSGCLASS=A,REGION=4096K
//S1 EXEC GCCCLG,COS1='-S',PARM.ASM='DECK,LIST'
//SYSIN DD *
int main(){
	asm(" WTO 'OH HELLO' ");
	return 42;
}

A few notes:

    • The C is in upper and lower case as usual but the assembly part is strictly uppercase
    • The JCL and assembly use single quotes ( ‘ )
    • To get the asm() recognized i needed to override the compiler options in COS1=’-S’ the default is COS1=’-ansi -S -Wpedanticerrors’
    • To see the assembly output and any errors I had to specify PARM.ASM=’LIST,DECK’, the default is ‘DECK,NOLIST’
    • I’m submitting the jobs from windows using netcat which works wonderfully.
    • I’m retrieving the output from the MVS printer using a homebrew awk program that splits the output by job and pipes the last one to more(see below)
    • I’m running hercules/mvs in the background on linux (started with mvs&)

     

     

     

We Have Blinkenlights


This is a desperate travesty of my rusty system programming skills but it works.

I started out wanting to write a C program using GCCMVS and inline assembly language to toggle a LED on the Pi. I can get GCCMVS to run but it doesn’t recognize the asm() constructs. I ended up writing the whole thing in Assembly.

To get the led to blink the easiest approach was to get the hercules emulator to issue a shell command “echo X > /sys/class/gpio/gpio18/value ” where X is 1 to turn it on or 0 to turn it off.

A program under MVS can get the emulator to issue a command by passing it via “DIAG 8”. Hercules wants you to be in supervisor state so i needed to use modeset which wants you to be APF authorized so the program had to go into sys1.linklib.

I stole the bulk of the code from a program called MDDIAG8 by Mark Dickinson.
The shell commands to manipulate a GPIO pin came from here

#   Exports pin to userspace
echo "18" > /sys/class/gpio/export                  

# Sets pin 18 as an output
echo "out" > /sys/class/gpio/gpio18/direction

# Sets pin 18 to high
echo "1" > /sys/class/gpio/gpio18/value

# Sets pin 18 to low
echo "0" > /sys/class/gpio/gpio18/value 

We Have Ignition – GCCMVS Submitted From Windows

18-04-11 RC42

//GCCMVS JOB CLASS=A,MSGCLASS=A,REGION=4096K
//S1 EXEC GCCCLG
//SYSIN DD *
int main(){
	return 42;
}
//STEPLIB DD DSN=SYS2.LINKLIB,DISP=SHR
//

This is not a big deal but it’s a step on the way.  Much easier to edit code on Windows and submit via the socket reader and netcat.

the textpad macro contains

c:appsnetcatnc -w1 192.168.0.104 3505 <$filename

and i'm reading the output on the pi with

 tail -300 prt/prt00e.txt | more

sadly, i forget where i got netcat – google it.

Using an AWS Tape Image

  • enter mvs commands on the hercules console with: / e.g. /d u,dasd
  • put the aws dataset into the mvs/tapes/ directory on the pi e.g. mvs/tapes/herccmd.aws
  • Vary a tape drive online in mvs with: /v 480,online
  • submit your job and wait for the mount request
  • “mount the tape” with: devinit 480 tapes/herccmd.aws

 

Olduino/370


I love the 1802 but my native instructon set was the 360/370 and I worked as a system programmer for years so I’d love to do an Olduino/370. I played with the MVS Tur(n)key system a long time ago but when it got to wanting JCL I shut it down. Now I realize the real fun is in blinking LEDS. I figure the credit-card size Raspberry Pi W might be the starting point. I’m sure the pi-arduino shield space is well covered so i just need a 370 emulator with a boot-loader and a PC based C compiler for it.

I came across the Mainframe Pi group on facebook where someone mentioned the Hercules Emulator running MVS on a Pi and I’m in!

Saleae Logic 8 – One of My Favourite Things


I got a new logic analyzer last week. I’ve had an original Logic for years but it didn’t work well with my new laptop so when Saleae offered me a significant discount on the new model i jumped at it. Also, it’s RED!

I’ve unboxed it and used it to solve two problems already so i’m pretty pleased. It’s the same 8 channels as the older Logic but it can sample up to 100M samples/sec and the same channels can sample analog signals up to 10MS/S. The analog function immediately solved one of my problems showing a weak 3V signal were i needed 5V.

The only thing i find awkward so far is that each channel has its own gound connection. The book tells you you only need one for most uses but the rest of the leads are still there confusing the issue.
img_1051

Well Heck, I Should Have Known It Wouldn’t Work – RobotDyn SD Card Shield is 3.3V Logic


I bought this a while ago and could never get consistent results with it. This morning in the shower i realized why. SD cards run on 3.3V and the shield has a level converter that brings the olduino’s 5V signals down to 3.3V but the card’s output on the olduino’s MISO is a bare 3.3V. I guess this is fine with arduino’s and it might even be ok with the 74HC595 used as the olduino’s input shift register except that there’s a 1K resistor between the MISO input on the headers and the shift register. I’m not sure but the Seeed/Radio Shack SD card adapter that I used before must have a proper bi-directional level shifter(or the olduino i used it with had a 74LS595 chip which would be fine). Anyway, it was an excuse to try out my new logic analyzer – so there’s that.

Somewhat ironically by the way, the reason for the resistor that broke the camel’s back is that i had trouble with SD cards not releasing MISO when they’re deselected. I know now that i have to send a bunch of clocks for it to realize it needs to release it.

Below is the schematic of the data logger shield. I have not tried out the real time clock but i will.
18-03-12 robotdyn sd

++ Considered Harmful. Also, Be Careful What You Wish For – Compiler Debug Output

I probably had come across this and forgotten but in addition to the -target=symbolic output showing the intermediate language, giving LCC the -Wf-d option causes it to send out copious debugging information to stderr including the DAG internal representation of the code and the rules selected to produce the assembly language output. There’s a LOT of it and it’s divorced from the source code but it’s still valuable. Compiling a module with nothing but nstdlib.h generated 5000 lines of output. The 150 lines below corresponds to the memset() procedure. Some of it’s interesting but it’s a slog to wade through and line up with the source.

One thing I notice is that my compiler doesn’t do a good job with the post-increment and post-decrement operators as in “*p++ = (unsigned char)c” below. The generated code is actually much better if i separate the *p=… from the p++. The silly copy back from R11 to R6 gets taken out by the peephole optimizer but it still copies into R11 so it can use the value after incrementing the variable.

;        *p++ = (unsigned char)c;
	cpy2 R11,R6 ;LOADP2(reg)
	cpy2 R6,R11
	incm R6,1
	str1 R13,R11; ASGNU1(indaddr,LOADU1(LOADU2(reg))) 18-03-21
VS
;        *p = (unsigned char)c;
	str1 R13,R6; ASGNU1(indaddr,LOADU1(LOADU2(reg))) 18-03-21
;        *p++;
	incm R6,1

[/code]

void *memset(void *s, int c, unsigned int n) //sets memory at s to c for n bytes
{
    unsigned char* p=s;
    while(n--)
        *p++ = (unsigned char)c;
    return s;
}
(allocating 12 to symbol s)
(free[0]=ffffefff)
(free[1]=ffffffff)
(allocating 13 to symbol c)
(free[0]=ffffcfff)
(free[1]=ffffffff)
(allocating 7 to symbol n)
(free[0]=ffffcf7f)
(free[1]=ffffffff)
(allocating 6 to symbol p)
(free[0]=ffffcf3f)
(free[1]=ffffffff)
(targeting 618ce0->x.kids[1]=61dd00 to 7)
ASGNU2(2:VREGP(n), INDIRU2(2:ADDRFP2(2:n)))**
dumpcover(618ce0) = stmt: ASGNU2(VREGP,reg) / # write register
dumpcover(61dd00) =  reg: INDIRU2(addr) /       ld2 R%c,%0 ;reg:INDIRU2(addr)
dumpcover(61dcb8) =   addr: ADDRFP2 / 'O',sp,(%A+1)
(listing 61dd00)
(listing 618ce0)
(rallocing 61dd00)
(rallocing 618ce0)
(targeting 61c9c0->x.kids[1]=618dd0 to 6)
ASGNP2(2:VREGP(p), LOAD(INDIRP2(2:VREGP(s))))**
dumpcover(61c9c0) = stmt: ASGNP2(VREGP,reg) / # write register
dumpcover(618dd0) =  reg: LOADP2(reg) / ?       cpy2 R%c,R%0 ;LOADP2(reg)
dumpcover(61c978) =   reg: INDIRP2(VREGP) / # read register
(listing 61c978)
(listing 618dd0)
(listing 61c9c0)
(rallocing 61c978)
(rallocing 618dd0)
(rallocing 61c9c0)
JUMPV(0:ADDRGP2(2:107))**
dumpcover(61cc50) = stmt: JUMPV(acon) /         lbr %0
dumpcover(61cc08) =  acon: ADDRGP2 / %a
(listing 61cc50)
(rallocing 61cc50)
LABELV(0:106)**
dumpcover(61cd30) = stmt: LABELV / %a:
(listing 61cd30)
(rallocing 61cd30)
(targeting 61d190->x.kids[1]=618f18 to ?)
ASGNP2(2:VREGP(1), LOAD(INDIRP2(2:VREGP(p))))**
dumpcover(61d190) = stmt: ASGNP2(VREGP,reg) / # write register
dumpcover(618f18) =  reg: LOADP2(reg) / ?       cpy2 R%c,R%0 ;LOADP2(reg)
dumpcover(61ce10) =   reg: INDIRP2(VREGP) / # read register
(using 1)
(targeting 61cee8->x.kids[1]=61cea0 to 6)
ASGNP2(2:VREGP(p), ADDP2(2:INDIRP2(2:VREGP(1)), CNSTI2(2:1)))**
dumpcover(61cee8) = stmt: ASGNP2(VREGP,reg) / # write register
dumpcover(61cea0) =  reg: ADDP2(reg,consm) / ?  cpy2 R%c,R%0
        incm R%c,%1
dumpcover(61d290) =   reg: INDIRP2(VREGP) / # read register
dumpcover(61ce58) =   consm: CNSTI2 / %a
(using 1)
ASGNU1(1:INDIRP2(2:VREGP(1)), LOAD(LOAD(INDIRI2(2:VREGP(c)))))**
dumpcover(61d050) = stmt: ASGNU1(indaddr,LOADU1(LOADU2(reg))) /         str1 R%1,%0; ASGNU1(indaddr,LOADU1(LOADU2(reg))) 18-03-21
dumpcover(61d320) =  indaddr: reg / R%0
dumpcover(61d320) =   reg: INDIRP2(VREGP) / # read register
dumpcover(61cf78) =  reg: INDIRI2(VREGP) / # read register
(using 1)
(listing 61ce10)
(listing 618f18)
(listing 61d190)
(listing 61d290)
(listing 61cea0)
(listing 61cee8)
(listing 61d320)
(listing 61cf78)
(listing 61d050)
(rallocing 61ce10)
(rallocing 618f18)
(allocating 11 to node 618f18)
(free[0]=ffffc73f)
(free[1]=ffffffff)
(rallocing 61d190)
(rallocing 61d290)
(rallocing 61cea0)
(rallocing 61cee8)
(rallocing 61d320)
(rallocing 61cf78)
(rallocing 61d050)
(freeing 11)
(free[0]=ffffcf3f)
(free[1]=ffffffff)
LABELV(0:107)**
dumpcover(61d390) = stmt: LABELV / %a:
(listing 61d390)
(rallocing 61d390)
(targeting 61d718->x.kids[1]=6191c0 to ?)
ASGNU2(2:VREGP(2), LOAD(INDIRU2(2:VREGP(n))))**
dumpcover(61d718) = stmt: ASGNU2(VREGP,reg) / # write register
dumpcover(6191c0) =  reg: LOADU2(reg) / ?       cpy2 R%c,R%0 ;LOADU2*(reg)
dumpcover(61d470) =   reg: INDIRU2(VREGP) / # read register
(using 2)
(targeting 61d548->x.kids[1]=61d500 to 7)
ASGNU2(2:VREGP(n), SUBU2(2:INDIRU2(2:VREGP(2)), CNSTU2(2:1)))**
dumpcover(61d548) = stmt: ASGNU2(VREGP,reg) / # write register
dumpcover(61d500) =  reg: SUBU2(reg,consm) / ?  cpy2 R%c,R%0    ;SUBU2(reg,consm)
        decm R%c,%1     ;SUBU2(reg,consm)
dumpcover(61d818) =   reg: INDIRU2(VREGP) / # read register
dumpcover(61d4b8) =   consm: CNSTU2 / %a
(using 2)
NEU2(2:INDIRU2(2:VREGP(2)), CNSTU2(2:0))**
dumpcover(61d5d8) = stmt: NEU2(reg,con0) /      jnzU2 R%0,%a; NE 0
dumpcover(61d8a8) =  reg: INDIRU2(VREGP) / # read register
dumpcover(61d590) =  con0: CNSTU2 / 0
(using 2)
(listing 61d470)
(listing 6191c0)
(listing 61d718)
(listing 61d818)
(listing 61d500)
(listing 61d548)
(listing 61d8a8)
(listing 61d5d8)
(rallocing 61d470)
(rallocing 6191c0)
(allocating 11 to node 6191c0)
(free[0]=ffffc73f)
(free[1]=ffffffff)
(rallocing 61d718)
(rallocing 61d818)
(rallocing 61d500)
(rallocing 61d548)
(rallocing 61d8a8)
(rallocing 61d5d8)
(freeing 11)
(free[0]=ffffcf3f)
(free[1]=ffffffff)
(targeting 61da40->x.kids[0]=6193e8 to 15)
RETP2(2:LOAD(INDIRP2(2:VREGP(s))))**
dumpcover(61da40) = stmt: RETP2(reg) / # retn
dumpcover(6193e8) =  reg: LOADP2(reg) / ?       cpy2 R%c,R%0 ;LOADP2(reg)
dumpcover(61d9f8) =   reg: INDIRP2(VREGP) / # read register
(listing 61d9f8)
(listing 6193e8)
(listing 61da40)
(rallocing 61d9f8)
(rallocing 6193e8)
(allocating 15 to node 6193e8)
(free[0]=ffff4f3f)
(free[1]=ffffffff)
(rallocing 61da40)
(freeing 15)
(free[0]=ffffcf3f)
(free[1]=ffffffff)
LABELV(0:105)**
dumpcover(61dc00) = stmt: LABELV / %a:
(listing 61dc00)
(rallocing 61dc00)
;{
;    unsigned char* p=s;
;        *p++ = (unsigned char)c;
;    while(n--)
;    return s;

**UPDATE**
I think, by the way, that the DAG internal representation for the “*p++=c” line in the source code is:

ASGNP2(2:VREGP(1), LOAD(INDIRP2(2:VREGP(p))))**
ASGNP2(2:VREGP(p), ADDP2(2:INDIRP2(2:VREGP(1)), CNSTI2(2:1)))**
ASGNU1(1:INDIRP2(2:VREGP(1)), LOAD(LOAD(INDIRI2(2:VREGP(c)))))**

which looks to me like
-load p into a temporary
-add one to p
-load c and store it at *the temporary

I don’t see how any sort of combination rule could help with that but if i were more sophisticated maybe i could rewrite the sequence to eliminate the temp.

SD Adventures – Formatting a 16GB SDHC Card for the Olduino

18-03-22 sdhc card
I’ve had trouble finding small SD cards for using with the Olduino. There are two problems with cards bigger than 2GB:

  • They are usually formatted with NTFS or FAT32 rather than FAT16 and the tinyfat code that i cribbed from only supports FAT16
  • They are newer and I thought my code wouldn’t support the newer protocols used on SDHC cards,

I got a newish 16GB SanDisk microSDHC card at Target and used the windows DiskPart utility to make a 4GB partition on it and format it as FAT. This wastes the bulk of the card but who cares – it was $7. The DiskPart session is listed at the bottom of this post.

It turned out that the SDHC protocol thing was a non-issue. I tried the Olduino code with the fresh NTFS-formatted disk, it read it and objected to the file system but it was able to access the disk. With the disk re-partitioned and formatted FAT it was able to access the file system as usual. The only thing I had to do to the tinyfat code was to add 0x0E to the acceptable file systems. According to Wikipedia 0x0E is FAT16 with Logical Block Addressing which i think i’ve been using all the time anyway.

I need to dig in a bit more because there’s stuff in the MMC code about shifting sector addresses 9 bits left for SDHC – i.e. multiplying by 512 which doesn’t make a lot of sense to me but it does seem to work ok.


Microsoft DiskPart version 10.0.16299.15

Copyright (C) Microsoft Corporation.
On computer: DESKTOP-IM4I6JJ

DISKPART> list disk

  Disk ###  Status         Size     Free     Dyn  Gpt
  --------  -------------  -------  -------  ---  ---
  Disk 0    Online          476 GB  1024 KB        *
  Disk 1    Online           14 GB  3072 KB

DISKPART> select disk 1

Disk 1 is now the selected disk.

DISKPART> list disk

  Disk ###  Status         Size     Free     Dyn  Gpt
  --------  -------------  -------  -------  ---  ---
  Disk 0    Online          476 GB  1024 KB        *
* Disk 1    Online           14 GB  3072 KB

DISKPART> list partition

  Partition ###  Type              Size     Offset
  -------------  ----------------  -------  -------
  Partition 1    Primary             14 GB  4096 KB

DISKPART> select partition 1

Partition 1 is now the selected partition.

DISKPART> delete partition

DiskPart successfully deleted the selected partition.

DISKPART> list disk

  Disk ###  Status         Size     Free     Dyn  Gpt
  --------  -------------  -------  -------  ---  ---
  Disk 0    Online          476 GB  1024 KB        *
* Disk 1    Online           14 GB    14 GB

DISKPART> create partition primary size=4000

DiskPart succeeded in creating the specified partition.

DISKPART> list disk

  Disk ###  Status         Size     Free     Dyn  Gpt
  --------  -------------  -------  -------  ---  ---
  Disk 0    Online          476 GB  1024 KB        *
* Disk 1    Online           14 GB    10 GB

DISKPART> filesystems

Current File System

  Type                 : RAW
  Allocation Unit Size : 512
  Flags : 00000000

File Systems Supported for Formatting

  Type                 : NTFS
  Allocation Unit Sizes: 512, 1024, 2048, 4096 (Default), 8192, 16K, 32K, 64K, 128K, 256K, 512K, 1024K, 2048K

  Type                 : FAT
  Allocation Unit Sizes: 64K (Default)

  Type                 : FAT32 (Default)
  Allocation Unit Sizes: 1024, 2048, 4096 (Default), 8192, 16K, 32K

  Type                 : exFAT
  Allocation Unit Sizes: 512, 1024, 2048, 4096, 8192, 16K, 32K (Default), 64K, 128K, 256K, 512K, 1024K, 2048K, 4096K, 8192K, 16384K, 32768K

DISKPART> format FS=FAT

  100 percent completed

DiskPart successfully formatted the volume.

DISKPART>

The images below are the olduino output before and after the repartitioning – there are lots of extra diagnostic prints in the access modules at the moment.

What A Fight – More Compiler Combo Rules

I spent a few pleasant hours today working on the python liveness analysis proram but i kept coming back to the fact that what i wanted to fix was extra register shuffling and that there might be an easier fix.  For example the following code generates an unnecessary register copy.

		*where=spiRec();
generates:
	Ccall _spiRec; CALLI2(ar)
	cpy2 R11,R15 ;LOADU2*(reg)
	str1 R11,R7; ASGNU1(indaddr,reg)		DH*

I looked at the symbolic output for it but i couldn’t really relate to the CVIU2 and CVUU1 which don’t seem to be used in my target.

;		*where++=spiRec();
spireceiven.c:3.2:
 2. ADDRFP2 count=2 where
1' INDIRP2 count=2 #2
5. CNSTI2 1
4. ADDP2 #1 #5
3' ASGNP2 #2 #4 2 1
7. ADDRGP2 spiRec
6' CALLI2 #7 {int function}
10. CVIU2 #6 2
9. CVUU1 #10 2
8' ASGNU1 #1 #9 1 1

I finally hacked into the function emitasm(Node p,int nt) in module gen.c and printed the op code and rules as they were used with the following butchery(now commented out)

unsigned emitasm(Node p, int nt) {
	int rulenum;
	short *nts;
	char *fmt;
	Node kids[10];
	//printf("\nHERE WE GO %d\n",p->op);
	p = reuse(p, nt);
	rulenum = getrule(p, nt);
	nts = IR->x._nts[rulenum];
	fmt = IR->x._templates[rulenum];
	assert(fmt);
	//printf("\nHERE WE ARE %s\n",fmt);

I looked at the mess that came out and picked out the op codes 2278(LOADU2),2117(INDIRI2),1078(ASGNU1),1254(LOADU1),2119(INDIRP2). The only one that wasn’t already in my rules was LOADU1 so I finally tried

stmt: ASGNU1(indaddr,LOADU1(LOADU2(reg)))  "\tstr1 R%1,%0; ASGNU1(indaddr,LOADU1(LOADU2(reg))) 18-03-21\n"  1
which generated:
;		*where=spiRec();
	Ccall _spiRec; CALLI2(ar)
	str1 R15,R7; ASGNU1(indaddr,LOADU1(LOADU2(reg))) 18-03-21

So yay, but what a fight and i’m only a tiny bit smarter in the end. I think i’ve been down this road before trying to get the symbolic target to use the same terminals as mine but i’m obviously not there yet.

This, apropos of optimization is a set of peephole optimization rules for the SDCC Z80 compiler.

SD Card Access – SPI Speed and Other Issues

I tried recently hooking up an SD card on the olduino(which i have done in the past) and found it non-responsive.  I hooked it up to an arduino(which worked after i hooked up the IOREF pin which is not used on older arduino’s) and trapped the SPI data.  The main thing i note is that the arduino was had slowed its clock down to 250KHz while the olduino was trying to rip along at 4MHz.  Looking back at this blog the last reference to the SD card on the olduino 1802 was the end of 2013.  At that time I think the SPI would have been clocked at 1-2MHz.  I’ll have to try slowing down the AVR clock routines or bit banging with the olduino.

18-03-08 sd init

**UPDATE** I got a schematic from the manufacturer of the shield i was trying.  It depends on getting 3.3V from the arduino header to power the SD card and level shifter.  I can try powering it from an arduino but it begins to seem like a lot of fuss.**

18-03-12 robotdyn sd

**UPDATE** tried it with 3.3V from the arduino but i found i was missing clocks.  even when i take off the SD card shield the SPI sequences are missing clocks when i have the 3.3v cross connected from the arduino.  no idea why.

18-03-12 missing clocks

 **UPDATE** Finally got this working by bit-banging the spi protocol from the olduino’s parallel port with input on EF3. Still don’t know why it didn’t work with the hardware SPI even after I slowed it down. I DO know that the card draws too much 3.3V power to be supplied from the arduino diecimilia but it’s not clear to me how that could affect things.

Also, here’s an excellent page where someone has captured the SD initialization sequence with a logic analyzer – very useful. I don’t see the author’s name anywhere but i greatly appreciated his work.

Whoah, This Statement is Huge – More Compiler Output

Testing the SD card programs I was struck by how big the resulting 1802 code was – over 18K. Perusing the assembler listing file I note that the tinyfat routines were particularly big. One module that reads the next line of a text file clocks in at almost 3K! Looking at that module, I found a particularly ugly C statement that generates over 700 bytes of code and makes multiple calls to 32 bit arithmetic routines. It would be much shorter in assembler but i’m more interested in whether I can learn anything that helps me improve code generation generally.

If I look at the first 9 lines of the output, which i think correspond to (uint32_t)BS.fatCopies*(uint32_t)BS.sectorsPerFAT), I notice a bunch of extraneous copying of registers which might easily yield to combination rules although I have a number of combinations in the ruleset already i’ll have to try to track down why they don’t cover this.

**the code being generated is following the rule for reg:CVUI2(INDIRU1(addr)) followed by the rule for CVIU4(reg). I tried putting in a rule for reg:CVIU4(CVUI2(INDIRU1(addr))) which blew up the compiler and one for reg:CVIU4(INDIRU1(addr)) which just gets ignored. Like I say, printing the dags would be a good leg up.**

		sec=((uint32_t)BS.reservedSectors+
		((uint32_t)BS.fatCopies*(uint32_t)BS.sectorsPerFAT)+
		(((uint32_t)BS.rootDirectoryEntries*32)/512)+
		(((uint32_t)currFile.currentCluster-2)*(uint32_t)BS.sectorsPerCluster)+BS.hiddenSectors)+(((uint32_t)currFile.currentPos/512) % (uint32_t)BS.sectorsPerCluster);
;		sec=((uint32_t)BS.reservedSectors+
	ld1 R11,'D',(_BS+3),0
	zExt R11 ;CVUI2: widen unsigned char to signed int (zero extend)
	cpy2 RL8,R11
	sext4 RL8; CVIU4
	ld2 RL10,'D',(_BS+12),0
	zext4 RL10 ;CVUU4: widen unsigned int to unsigned long (zero extend)
	Ccall _mulu4
	cpy4 RL10,RL8; LOADU4(reg)
	st4 RL10,'O',sp,(40+1); ASGNU4
	ld1 R9,'D',(_BS),0
	zExt R9 ;CVUI2: widen unsigned char to signed int (zero extend)
	cpy2 RL8,R9
	sext4 RL8; CVIU4
	st4 RL8,'O',sp,(36+1); ASGNU4
	ld2 RL10,'D',(_currFile+13),0
	zext4 RL10 ;CVUU4: widen unsigned int to unsigned long (zero extend)
	ldI4 RL8,2 ;loading a long unsigned constant
	alu4 RL8,RL10,RL8,sm,smb
	ld4 RL10,'O',sp,(36+1);reg:  INDIRU4(addr)
	Ccall _mulu4
	cpy4 RL10,RL8; LOADU4(reg)
	st4 RL10,'O',sp,(32+1); ASGNU4
	ld4 RL8,'D',(_currFile+20),0;reg:  INDIRU4(addr)
	shrU4I RL8,9
	ld4 RL10,'O',sp,(36+1);reg:  INDIRU4(addr)
	Ccall _modu4
	cpy4 RL10,RL8; LOADU4(reg)
	st4 RL10,'O',sp,(28+1); ASGNU4
	ld2 RL8,'D',(_BS+1),0
	zext4 RL8 ;CVUU4: widen unsigned int to unsigned long (zero extend)
	ld4 RL10,'O',sp,(40+1);reg:  INDIRU4(addr)
	alu4 RL10,RL8,RL10,add,adc
	ld2 RL8,'D',(_BS+4),0
	zext4 RL8 ;CVUU4: widen unsigned int to unsigned long (zero extend)
	shl4I RL8,5; LSHU4(reg,con)
	shrU4I RL8,9
	alu4 RL10,RL10,RL8,add,adc
	ld4 RL8,'O',sp,(32+1);reg:  INDIRU4(addr)
	alu4 RL10,RL10,RL8,add,adc
	ld4 RL8,'D',(_BS+16),0;reg:  INDIRU4(addr)
	alu4 RL10,RL10,RL8,add,adc
	ld4 RL8,'O',sp,(28+1);reg:  INDIRU4(addr)
	alu4 RL10,RL10,RL8,add,adc
	st4 RL10,'O',sp,(44+1); ASGNU4

One thing that would be a big leg up on this stuff would be to write something that would dump the DAG structures the compiler uses so i could see what the darned thing is looking for rather than parsing backwards from the output – it would be the equivalent of using a debugger instead of print statements to debug a program. Googling around i note that LCC has a symbolic target which, among other things, prints the DAGs!

Shown below is a smplified version of that big statement with the symbolic output. It sure looks like CVIU4(CVUI2(INDIRU1(addr))) to me.

	unsigned char fatCopies;
	unsigned int sectorsPerFAT;
	unsigned long dummy;
	dummy=(unsigned long)fatCopies * (unsigned long)sectorsPerFAT;
***************TARGET=symbolic OUTPUT FOLLOWS****************
;	dummy=(unsigned long)fatCopies * (unsigned long)sectorsPerFAT;
testsize.c:5.1:
 2. ADDRLP2 dummy
7. ADDRLP2 fatCopies
6. INDIRU1 #7
5. CVUI2 #6 1
4. CVIU4 #5 2
10. ADDRLP2 sectorsPerFAT
9. INDIRU2 #10
8. CVUU4 #9 2
3. MULU4 #4 #8
1' ASGNU4 #2 #3 4 1

Yep. Trying a more careful version of the combination rule does work, the rule
gives what you see below which is still not great but i have my hands on the controls and i’m steering in the right direction! The real answer here is to see what actually has to be done as long arithmetic and break up that monster equation. It’s served its purpose though in pointing up some bad patterns and how to go after them. Also, yay for
reg: CVIU4(CVUI2(INDIRU1(addr))) “\tld1 R%c,%0\n\tzExt R%c\n\tzExt4 R%c; CVIU4(INDIRU1(addr)):*HOORAY*widen unsigned char to long\n” 1

;	dummy=(unsigned long)fatCopies * (unsigned long)sectorsPerFAT;
	ld1 RL8,'O',sp,(13+1)
	zExt RL8
	zExt4 RL8; CVIU4(INDIRU1(addr)):*HOORAY*widen unsigned char to long

 

Easy Optimization Win With the LCC Rules

This is no big deal but i’m counting it as an easy win. I often see the compiler copying one register to another before doing something with it. I would hesitate to clean it up with the peephole optimizer because i couldn’t be sure there wasn’t some other usage of the second register further down the code – hence my interest in liveness analysis. I decided to have a hard look at one instance though and correcting it was trivial.

;void comparulator(unsigned long x,unsigned long y){
;	unsigned int ix=x;
	cpy2 R11,RL6 
	st2 R11,'O',sp,(16+1)
;void comparulator(unsigned long x,unsigned long y){
;	unsigned int ix=x; 
	st2 RL6,'O',sp,(16+1)

The instruction patterns in the machine description record (.MD file) tell the compiler what assembler instructions to emit for a given intermediate code sequence. In this case it was using a LOADU2 pattern to get a value into a register and an ASGNU2(addr,reg) to put it into storage. By giving it a combination of ASGNU2(addr,LOADU2(reg)) it happily skipped the intermediate move. I think I have tried some of these combinations in the past with less success but I’m encouraged this time.

reg: LOADU2(reg)  "?\tcpy2 R%c,R%0\n"  move(a)+10
stmt: ASGNU2(addr,reg)  "\tst2 R%1,%0\n"  10

stmt: ASGNU2(addr,LOADU2(reg))  "\tst2 R%1,%0\n"  10

(note that i’ve cleaned up the assembly code and rules a bit – the real thing is sprinkled with debugging and tracking comments.)

Here’s an instance where optimizing out a load would get me in trouble though:

;scomparulator(-11,-11);
	ldI4 RL10,-11 ;loading a long integer constant
	cpy4 Rp1p2,RL10; LOADI4*
	st4 RL10,'O',sp,(4+1); arg+f**
	Ccall _scomparulator

You could certainly do something with it but it’s an instance where a temporary is being reused for its value and it’s not something that’s generated by the rules – the line with the arg+f** tag at the end is generated by my c code in the function() procedure.

I guess Meh About Sums it Up – Branch Optimization IV

The linear version of the branch optimization works pretty well now. I ran it on the Dhrystone code with a python profiler where it took .1 sec to shorten 100 of 144 branches in 8000 lines of code. On the same code the copt peephole optimizer did 119 optimizations on the 3000 lines of macro code it read. Probably safe to say that the copt changes saved about five or ten times as much as the branch optimizer since the peephole is usually taking code out at the macro level so several machine instructions a go and the optimized branches really only take out the equivalent of 1/2 instruction. If I could do it in a single pass or as part of copt fine but on its own maybe not worth it.

I ran a quick comparison of the linear version with the recursion version and there’s just no contest. The recursion version ran much much longer. I also accidentally overwrote the recursive version which i had not versioned so that’s that!

#18-02-10 linear two pass scan - no recursion
#18-02-13 trying to include aligns with jumps.
import sys
sys.path.append('/lcc42/examples/branchanal/')
from opdefsNW import *
from opfuncsNW import *
asminfile = open(sys.argv[1],'r')
asmdata = asminfile.read().expandtabs()
asmlines = asmdata.split('\n')
print len(asmlines), 'lines read'
jumpcount=0;aligncount=0
repcount=0
progsize=0
i=0
tln=0;pln=0;mln=0 #temp label numbers for $$,+ -
labeldefs={}
labelrefs=[]
def ireplace(old, new, text): #    Replace case insensitive
    index_l = text.lower().index(old.lower())
    return text[:index_l] + new + text[index_l + len(old):] 
def gotlabel(token):
	global tln,pln,mln
	label=token.split(':')[0]
	if label.startswith("$$"):
		label+=str(tln)
	elif label.startswith("+"): #check for temporary label
		label+=str(pln)
		pln+=1
	elif label.startswith("-"): #check for temporary label
		mln+=1
		label+=str(mln)
	else:
		tln+=1
	labeldefs[label]=progsize

def fmtlabel(label):
	global tln,pln,mln
	if label.startswith("$$"):
		label+=str(tln)
	elif label.startswith("+"): #check for temporary label
		label+=str(pln)
	elif label.startswith("-"): #check for temporary label
		label+=str(mln)
	return label

for line in asmlines:
	aline=line.split(";")[0] #get rid of any comment portion
	tokens=aline.split();
	if tokens:
		#print "%d %X %s" % (i+1,progsize,aline)
		if (not aline.startswith(' ')) or tokens[0].endswith(':'): #if it has a label
			gotlabel(tokens[0])
			tokens=tokens[1:]			#get rid of it
		if tokens:
			if tokens[0].lower() in jumpdefs:
				jumpcount+=1
				labelrefs.append([i,progsize,tokens[0].lower(),fmtlabel(tokens[1].split(",")[0])])
			elif tokens[0].lower()=="align":
				aligncount+=1
				labelrefs.append([i,progsize,tokens[0].lower(),tokens[1].split(",")[0],process(tokens,progsize)])
				#print "align ",i,progsize,tokens[0].lower(),tokens[1].split(",")[0],process(tokens,progsize)
			progsize+=process(tokens,progsize)

	i+=1;
print "pass 1 completed. ",len(labeldefs)," labels found. ",len(labelrefs)," jumps found."
#print labelrefs,labeldefs
adjb=0;adjl=0
repacount=0
i=0
for ref in labelrefs: # line index, location, branch op(or align), label referenced(or alignment)
	line=ref[0]; loc=ref[1]; brop=ref[2]; label=ref[3]
	adjamt=0
	if brop=="align":
		repacount+=1
		newsize=process([brop,label],loc)
		adjamt=newsize-ref[4]
		labelrefs[i][4]=newsize
		#print "*A*%4d %4x %s" %(line+1,loc,brop), label, ref[4],newsize, adjamt
	elif (loc+1)//256==labeldefs[label]//256:
		asmlines[line]=ireplace(brop,jumpreps[brop],asmlines[line])
		#print "%4d %4x %s %4x" %(line+1,loc,asmlines[line],labeldefs[label])
		repcount+=1
		adjamt=-1
	if not adjamt==0:
		for adjref in labelrefs: # line index, location, branch op, label referenced
			if adjref[0]>line: # for jumps further down the way
				if adjref[2]=="align":
					pass #print "A%d %x "%(adjref[0],adjref[1]),adjref[2],adjref[3],adjref[4]
				adjref[1]+=adjamt #get adjusted
				adjb+=1
		for k, labloc in labeldefs.iteritems():
			if labloc>loc:
				labeldefs[k] +=adjamt
				adjl+=1
	i+=1
#print labeldefs
#for k,labloc in labeldefs.iteritems():
#	print "%4x %s" % (labloc,k)
print repcount,"+",repacount," fixup cycles ",adjb," branch fixups ",adjl," label fixups"
asmoutfile=open(sys.argv[1].split('.')[0]+".basm",'w'); asmoutfile.truncate()
i=0
for line in asmlines:
	asmoutfile.write(line+'\n')
	i+=1
print i, 'lines written'
print "%d long jumps found, %d shortened" % (jumpcount,repcount)
##################opdefsNW follows
opsizes={
	'or':1,'ori':2,'xor':1,'xri':2,
	'and':1,'ani':2,
	'out':1,'inp':1,
	'sep':1,'ldi':2,'plo':1,'phi':1,'glo':1,'ghi':1,
	'sm':1,'smb':1,'smbi':2,
	'sd':1,'sdb':1,'sdi':2,'sdbi':2,
	'skp':1,'lskp':1,'lsnf':1,'lsdf':1,'lsz':1,'lsnz':1,
	'sex':1,'lda':1,'ret':1,'nop':1,'dec':1,'stxd':1,
	'shr':1,'shrc':1,'shl':1,'shlc':1,
	'str':1,'ldn':1,
	'add':1,'adi':2,'adc':1,'adci':2,'str':1,'smi':2,'inc':1,
	'dis':1,'sav':1,
	'bz':2,'bnz':2,'br':2,'b3':2,'bn3':2,'bnf':2,'bdf':2,
        'cpy2': 4, 'cpy1': 2, 'cpy4': 8,
        'zext': 3,'sext': 9,'zext4':4,'sext4':11,
        'negi2':9,'negi4':32,
        'alu2':12,'alu2i':8,
        'alu4':22,'alu4i':16,
        'ldad': 6,
        'lda2': -8,
        'shl2i': -6,'shri2i': -8,'shru2i': -6,
        'shl4':12,'shri4':14,
        'shl4i': -12,'shri4i': -14,
        'st1':-10,'st2':-13,'str1':2,
        'ld2':-12,'ld1':-10,
        'ldn1':2,'ldn2':4,
        'jzu2':8,
        'jnzu2':8,'jnzu1':4,'sjnzu2':6,'sjnzu1':3, #the sj codes are short branch variants
        'jeqi2':18,'jcu2':13,'jneu2': 18,
        'jci2': 20,'jci4': 28,'jcu4': 28,
        'jcu2i':9,
        'jni2i':17,'jnu2i':9,
        'jneu2i': 12,
        'jneu4':39,
        'jequ2i':12,'jci2i':17,
        'jcf4':69,
        'ld2z':4,
        'ldi4':12,'st4':-19,'ld4':-16, 
        'incm': -1,'decm': -1,
        'popr': 5, 'pushr': 4,
        'popf': 5, 'popl': 4,
        'lbr': 3,'lbnz':3,'lbz':3,'lbnf':3,'lbdf':3,
        'equ': 0, 'db': -1,  'dw':2,'dd':4,
        'align':-1, #force alignment
        'cretn': 1,'ccall': 3,
         'seq': 1,   'req': 1,
        'listing': 0,   'include': -3,
        'release': -1,'reserve': -1,
        'jumpv':10,
        'relaxed':0, 'macexp': 0,
        'ldx':1,'ldxa':1,'irx':1,
        'org':-1}
        
jumpdefs={
	'lbr': 3,'lbnz':3,'lbz':3,'lbdf':3,'lbnf':3
	}
jumpreps={
	'lbr': 'br','lbnz':'bnz','lbz':'bz','lbdf':'bdf','lbnf':'bnf'	
	}

#######################opfuncsNW follows
from opdefsNW import *
def opsizer(opsize,tokens,currentpc):
    op=tokens[0].lower()
    operands=tokens[1].split(",")
    #print "op=",op," operands=",operands
    if op=='reserve':
        n=int(operands[0])
        if (n<9):
              return n
        else:
              return 10
    elif op=='release':
        n=int(operands[0])
        if (n",operands
        if len(tokens)>=3 and tokens[2]=='dup':
              return int(tokens[1]) #tokens[1] because dup is separated from the count by a space
        else:
	      #print "db length is ",len(operands)
              return len(operands)
    elif op=='include':
        if operands[0].lower().startswith('lcc1802prolo'):
            return 3
        else:
            if operands[0].startswith('lcc1802epilo'):
                epiloglocation=currentpc
            return 0
    elif op in ['ld2','ld1','lda2','st2','st1','ld4','st4']: #ops that can be offset or direct storage refs
    	 #print operands
         if operands[1].lower()=="'o'": #offset/index
            return opsize*-1 #returns the whole value
         else:
            return (opsize*-1)-2 #direct op is 2 smaller
    elif op in ['shl2i','shri2i','shlu2i','shri2i','shru2i','shl4i','shri4i']:
         return int(operands[1])*-1*opsize
    elif op in ['incm','decm']:
         return int(operands[1])*-1*opsize
    elif op=='align':
        boundary=int(operands[0])
        if (currentpc%boundary)==0:
            return 0
        else:
            return boundary-(currentpc%boundary)
    elif op=='org':
    	print "I Hope this ORG is not harmful! ",operands[0]
    	return 0
    else:
        print '**************opsizer oops',tokens,opsize
        x = raw_input("Press Enter to continue")
       
            
        
    return 999
    
def process(tokens,currentpc):
    global opdefs
    try:
        #print 'processing ',tokens,"**",tokens[0],"**",opsizes[tokens[0]],currentpc
        opsize=opsizes[tokens[0].lower()]
    except:
        print 'process oops', tokens
        opsize=0
        print tokens[0],opsizes
        x = raw_input("Press Enter to continue")

    if opsize>=0:
        thisopsize=opsize
    else:
        thisopsize=opsizer(opsize,tokens,currentpc)
    #print tokens[0],opsize,thisopsize
    return thisopsize

Huh – The Awkwardness of Aligns – Branch Optimization III

I redid the branch optimization code to eliminate recursion. The new version does a single pass over the program to locate all the labels and long branches, then starts back down the list of branches looking at their location vs their target. When it finds an unnecessary long branch it shortens it in the program text and runs over all the branches and labels further down the program subtracting one from their location. This gets rid of the recursion but does a LOT of looping through the lists – in one modest sized C program it does 170 fixup scans making 18000 branch fixups and 24000 label fixups. The code is easier to understand though. One major problem is that it doesn’t work! Because I don’t look at the actual code when i’m doing the fixups – just the list ob branches and labels – I’m not processing align statements which will have a different effect on the program counter when they get moved. The only thing I can think of to do is to treat the aligns like jumps – include them in the jump list and, when I come to one, give it the “length” required to do the alignment. I’m frankly not clear as to whether that will work!


UPDATE:
It does work although I don’t completely trust it. If it fails though, the worst that will happen is that i will get assembly errors – all i’m doing is changing long to short branches and if i’m wrong it will be obvious. The 50 line comparulator test program ran successfully with 175 of 210 long branches shortened. I looked at some of the others and they were genuinely not convertible – the code is fairly long and a switch statement generates a lot of branches. If It was important I could manipulate the C source to reduce branch spans – maybe more function calls.

The other interesting thing about this is that if this were routine I could get rid of at least some of the “align”s and use long branches knowing they would probably get optimized out.

The current version of the code is called branchalignear.py.

#18-02-10 linear two pass scan - no recursion
#18-02-13 trying to include aligns with jumps.
import sys
sys.path.append('/lcc42/examples/branchanal/')
from opdefsNW import *
from opfuncsNW import *
asminfile = open(sys.argv[1],'r')
asmdata = asminfile.read().expandtabs()
asmlines = asmdata.split('\n')
print len(asmlines), 'lines read'
jumpcount=0;aligncount=0
repcount=0
progsize=0
i=0
tln=0;pln=0;mln=0 #temp label numbers for $$,+ -
labeldefs={}
labelrefs=[]
def ireplace(old, new, text): #    Replace case insensitive
    index_l = text.lower().index(old.lower())
    return text[:index_l] + new + text[index_l + len(old):] 
def gotlabel(token):
	global tln,pln,mln
	label=token.split(':')[0]
	if label.startswith("$$"):
		label+=str(tln)
	elif label.startswith("+"): #check for temporary label
		label+=str(pln)
		pln+=1
	elif label.startswith("-"): #check for temporary label
		mln+=1
		label+=str(mln)
	else:
		tln+=1
	labeldefs[label]=progsize

def fmtlabel(label):
	global tln,pln,mln
	if label.startswith("$$"):
		label+=str(tln)
	elif label.startswith("+"): #check for temporary label
		label+=str(pln)
	elif label.startswith("-"): #check for temporary label
		label+=str(mln)
	return label

for line in asmlines:
	aline=line.split(";")[0] #get rid of any comment portion
	tokens=aline.split();
	if tokens:
		#print "%d %X %s" % (i+1,progsize,aline)
		if (not aline.startswith(' ')) or tokens[0].endswith(':'): #if it has a label
			gotlabel(tokens[0])
			tokens=tokens[1:]			#get rid of it
		if tokens:
			if tokens[0].lower() in jumpdefs:
				jumpcount+=1
				labelrefs.append([i,progsize,tokens[0].lower(),fmtlabel(tokens[1].split(",")[0])])
			elif tokens[0].lower()=="align":
				aligncount+=1
				labelrefs.append([i,progsize,tokens[0].lower(),tokens[1].split(",")[0],process(tokens,progsize)])
				#print "align ",i,progsize,tokens[0].lower(),tokens[1].split(",")[0],process(tokens,progsize)
			progsize+=process(tokens,progsize)

	i+=1;
print "pass 1 completed. ",len(labeldefs)," labels found. ",len(labelrefs)," jumps found."
#print labelrefs,labeldefs
adjb=0;adjl=0
repacount=0
i=0
for ref in labelrefs: # line index, location, branch op(or align), label referenced(or alignment)
	line=ref[0]; loc=ref[1]; brop=ref[2]; label=ref[3]
	adjamt=0
	if brop=="align":
		repacount+=1
		newsize=process([brop,label],loc)
		adjamt=newsize-ref[4]
		labelrefs[i][4]=newsize
		#print "*A*%4d %4x %s" %(line+1,loc,brop), label, ref[4],newsize, adjamt
	elif (loc+1)//256==labeldefs[label]//256:
		asmlines[line]=ireplace(brop,jumpreps[brop],asmlines[line])
		#print "%4d %4x %s %4x" %(line+1,loc,asmlines[line],labeldefs[label])
		repcount+=1
		adjamt=-1
	if not adjamt==0:
		for adjref in labelrefs: # line index, location, branch op, label referenced
			if adjref[0]>line: # for jumps further down the way
				if adjref[2]=="align":
					pass #print "A%d %x "%(adjref[0],adjref[1]),adjref[2],adjref[3],adjref[4]
				adjref[1]+=adjamt #get adjusted
				adjb+=1
		for k, labloc in labeldefs.iteritems():
			if labloc>loc:
				labeldefs[k] +=adjamt
				adjl+=1
	i+=1
#print labeldefs
#for k,labloc in labeldefs.iteritems():
#	print "%4x %s" % (labloc,k)
print repcount,"+",repacount," fixup cycles ",adjb," branch fixups ",adjl," label fixups"
asmoutfile=open(sys.argv[1].split('.')[0]+".basm",'w'); asmoutfile.truncate()
i=0
for line in asmlines:
	asmoutfile.write(line+'\n')
	i+=1
print i, 'lines written'
print "%d long jumps found, %d shortened" % (jumpcount,repcount)

Also for the record, when i run this of one of the selftest programs 00comparulator.c the 50 lines of C generate 6400 lines of assembly(it’s compilicated code and it uses the floating point library) including 210 long jumps of which 170 get optimized away. As noted above the 170 branch optimizations cause 20100 branch/align fixups and 25222 label fixups.

Branch Optimization II

18-02-08 branch analysis blink
I figured out my overall theory and tried it out successfully. The code runs through the expanded assembly a line at a time keeping track of the size of code generated so it always knows the program counter. When it encounters a label it stores it with its address. When it encounters a long branch it checks the target to see if it’s on the same page and if so it converts it to a shot branch. Sometimes the label hasn’t been defined yet so it calls itself recursively starting at the next line looking for that label. The insight, if there is one, is that because I’m always moving forward, shortening a branch will never screw up a decision i’ve already made – it can only move the target closer.

This works, at least for Blink. There were six long branches encountered including the branch to lcc1802init generated by the prolog. I don’t bother with the branch to lcc1802init but the other five were successfully converted to short branches.

Now I need to try a bigger program sample and clean up the code.