• AVR Freaks

Helpful ReplyHot!Unsigned 32 by 32 divide routine in assembler

Page: << < ..67 > Showing page 6 of 7
Author
1and0
Access is Denied
  • Total Posts : 11506
  • Reward points : 0
  • Joined: 2007/05/06 12:03:20
  • Location: Harry's Gray Matter
  • Status: offline
Re: Unsigned 32 by 32 divide routine in assembler 2020/09/08 20:58:09 (permalink)
2 (1)
NorthGuy
Impossible to see the future is :)

You must be a time traveler who's been to or from the future. ;)
 
Spinlectrix
Junior Member
  • Total Posts : 72
  • Reward points : 0
  • Joined: 2016/03/20 14:48:37
  • Location: 0
  • Status: offline
Re: Final Drop for: Unsigned 32 by 32 divide routine in assembler 2020/09/08 23:21:06 (permalink)
0
1and0       

 
; Consequently, scale the dividend to ensure the quotient fits in its LSW
        lsr     w5,w1       ; dividend' = dividend >> 1
        rrc     w4,w0       ;     w1:w0 = w5:w4 >> 1
        
.....
 
 
 
; Undo the scaling of dividend by 2 and apply the same ratio as the divisor
        
        dec     w7,w7       ; new shift to account for quotient <<= 1
        lsr     w0,w7,w0    ; quotient >>= (n - 1)
        

 
OMG, this is brilliant; I still want to think about it when I'm fresh in the morning, but I had entirely missed this trick!  Great insight!
 
<Edit 1:>  Well, after more reflection, I was pretty sure I was going to be able to shave several cycles off this routine. But (man!), after trying for a couple of hours, I can't shave even one!  Excellent job! (I especially like how the algorithm responds when the shift-right quantity of the divisor is zero! It does the divide as usual, and then wipes out all bits of the quotient during the adjustment to the within-one section!)
post edited by Spinlectrix - 2020/09/09 14:46:04
Spinlectrix
Junior Member
  • Total Posts : 72
  • Reward points : 0
  • Joined: 2016/03/20 14:48:37
  • Location: 0
  • Status: offline
Re: Unsigned 32 by 32 divide routine in assembler 2020/09/08 23:25:11 (permalink)
5 (1)
NorthGuy
Impossible to see the future is :)

I thought it was merely 'Difficult' ?
1and0
Access is Denied
  • Total Posts : 11506
  • Reward points : 0
  • Joined: 2007/05/06 12:03:20
  • Location: Harry's Gray Matter
  • Status: offline
Re: Final Drop for: Unsigned 32 by 32 divide routine in assembler 2020/09/09 03:54:03 (permalink)
0
Spinlectrix
OMG, this is brilliant; I still want to think about it when I'm fresh in the morning, but I had entirely missed this trick!  Great insight!

Keeping more dividend bits for the calculation of the quotient minimizes the rounding error.
 
GlennP
Super Member
  • Total Posts : 833
  • Reward points : 0
  • Joined: 2009/03/29 15:04:55
  • Location: El Paso County, CO, USA
  • Status: offline
Re: Final Drop for: Unsigned 32 by 32 divide routine in assembler 2020/09/09 07:59:34 (permalink)
0
I'm unable to "Reply to post" on the Fast Integer Arithmetic thread I created (https://www.microchip.com.orums/FindPost/1152205), but I have been able to upload .zip.txt files to the original post.  It's current up-to-date with Harry's latest.  It includes a C main program for testing the cases that have tripped up older versions and can be used with the simulator and stopwatch to get cycle counts.
 
GP
1and0
Access is Denied
  • Total Posts : 11506
  • Reward points : 0
  • Joined: 2007/05/06 12:03:20
  • Location: Harry's Gray Matter
  • Status: offline
Re: Final Drop for: Unsigned 32 by 32 divide routine in assembler 2020/09/10 08:08:06 (permalink)
0
MBedder
Give the assembler source file a capital .S extension and use all standard (#) C preprocessor features.

Well, it doesn't work because it seems those macros (predefined symbols) for the device families are defined for the uses of the assembler conditional directives, and the standard (#) C preprocessor does NOT see them. That is, those predefined symbols are undefined to the C preprocessor. :(
 
1and0
Access is Denied
  • Total Posts : 11506
  • Reward points : 0
  • Joined: 2007/05/06 12:03:20
  • Location: Harry's Gray Matter
  • Status: offline
Re: Final Drop for: Unsigned 32 by 32 divide routine in assembler 2020/09/10 08:19:20 (permalink)
0
Spinlectrix
Well, after more reflection, I was pretty sure I was going to be able to shave several cycles off this routine. But (man!), after trying for a couple of hours, I can't shave even one!  Excellent job! (I especially like how the algorithm responds when the shift-right quantity of the divisor is zero! It does the divide as usual, and then wipes out all bits of the quotient during the adjustment to the within-one section!)

I'm not done yet. ;) For the PIC devices (PIC24E, dsPIC33E and dsPIC33C) that support the MULW instruction, one more instruction word and cycle can be shaved but these devices are slower at RETurn.
;
; Unsigned Long Integer Division of 32-bit by 32-bit
;
;   uint32_t
;   udivmod3232 (uint32_t dividend, uint32_t divisor, uint32_t *remainder)
;
;   uint32_t
;   udiv3232 (uint32_t dividend, uint32_t divisor)
;
;
; Dividend = Quotient * Divisor + Remainder
;
; Calling Parameters
;   w1:w0 = Dividend
;   w3:w2 = Divisor
;
; Return Values
;   w1:w0 = Quotient
;   w3:w2 = Remainder
;
; Used Registers
;   w7,w6, w5,w4
;
; Tcy(max) = 12 + DIV + 15+|1| + RETLW
;
; For PIC24F, PIC24H, dsPIC30F, dsPIC33F
;   Tcy(max) = 12 + 18 + 16 + 3 = 49 (quotient & remainder)
;            = 12 + 18 + 13 + 3 = 46 (quotient only)
;
; For PIC24E, dsPIC33E
;   Tcy(max) = 12 + 18 + 15 + 6 = 51 (quotient & remainder)
;            = 12 + 18 + 12 + 6 = 48 (quotient only)
;
; For dsPIC33C
;   Tcy(max) = 12 +  6 + 15 + 6 = 39 (quotient & remainder)
;            = 12 +  6 + 12 + 6 = 36 (quotient only)
;
; Ref: https://www.microchip.com/forums/FindPost/1151719
;
; Credits (All Microchip Forum Usernames):
; * Spinlectrix, 1and0, and GlennP17321.
;
 
; Conditional assembly:
; * Define NO_REMAINDER to return only the quotient and no remainder.
; * Adding the remainder takes 3 more program memory words and 3 more
;   instruction cycles maximum which are negligible, so it is the
;   default by having NO_REMAINDER not defined.
;
; NO_REMAINDER = 1
 
; Loop counter value for the REPEAT instruction of a DIV instruction.
   
.ifdef __dsPIC33C
DIV_RCOUNT = (6-1)          ; execute DIV 6 consecutive times
.else
DIV_RCOUNT = (18-1)         ; execute DIV 18 consecutive times
.endif
 
; Division Routine Names
 
.ifndef NO_REMAINDER
udivmod3232:
.else
udiv3232:
.endif
 
        mov     w0,w4       ; save dividend LSW
        ff1l    w3,w5       ; w5 = first one left of divisor MSW
        bra     c,0f        ; if (divisor <= 0xFFFF) branch
;-------
; Case of Divisor > 0xFFFF
;
; Here the high word of the divisor is non-zero, so the quotient will fit in
; a 16-bit word; i.e. the high word of the quotient is guaranteed to be zero.
;
;   w1:w0 = w1:w0 / w3:w2
;         = w5:w4 / w3:w2
;         = [(w5:w4) >> n] / [(w3:w2) >> n]
        
; Copy divisor for correction of the quotient later.
        
        exch    w1,w5       ; w1 = first one of divisor, w5 = dividend MSW
        
; Calculate shift counts for normalization.
        
        subr    w1,#17,w7   ; right shift >> (w7 = 17 - w1), 1 <= (n=w7) <= 16
        subr    w7,#16,w0   ; left  shift << (w0 = 16 - w7)
        
; Normalize the divisor so its left-most 1 is at the MSb of its LSW.
        
        sl      w3,w0,w0    ; divisor' = divisor >> n
        lsr     w2,w7,w6    ; "
        ior     w0,w6,w6    ; w6 = (w3:w2) >> n
        
; Consequently, scale the dividend to ensure the quotient fits in its LSW;
; also, keeping more dividend bits minimizes the quotient rounding error.
        
        lsr     w5,w1       ; dividend' = dividend >> 1
        rrc     w4,w0       ;     w1:w0 = w5:w4 >> 1
        
; Divide scaled dividend by the normalized divisor to get an estimated quotient
; where the quotient error Err(q) = estimated quotient - correct quotient.
        
        repeat  #DIV_RCOUNT ; calculate the 16-bit quotient (32/16 divide)
        div.ud  w0,w6       ; w0 = (w1:w0)/w6, no overflow here, Err(q)=[0,+1]
 
.ifdef  __PIC24E
PIC_24E_33C_33E = 1
.endif
 
.ifdef  __dsPIC33E
PIC_24E_33C_33E = 1
.endif
 
.ifdef  __dsPIC33C
PIC_24E_33C_33E = 1
.endif
        
; Undo the scaling of dividend by 2 and apply the same ratio as the divisor.
        
        dec     w7,w7       ; new shift to account for quotient <<= 1
        lsr     w0,w7,w1    ; w1 = quotient >>= (n - 1)
        
; Correct quotient for rounding error due to LSbs discarded by the shifts.
        
        btss    SR,#Z       ; if (quotient != 0)
        dec     w1,w1       ;   --quotient, Err(q) = [-1,0]
                            ;   so (quotient * divisor) won't overflow below
    .ifdef  PIC_24E_33C_33E
        mulw.uu w1,w3,w0    ;    w0    = w1 * w3
    .else
        mul.uu  w1,w3,w6    ; w7:w6    = w1 * w3
        mov     w6,w0       ;    w0    = w1 * w3
    .endif
        mul.uu  w1,w2,w6    ; +  w7:w6 = w1 * w2
        add     w0,w7,w7    ; =  w7:w6 = quotient * divisor, no overflow here
        
; Calculate the remainder to correct the final quotient.
        
        sub     w4,w6,w6    ; w7:w6 = dividend - quotient * divisor
        subb    w5,w7,w7    ;       = remainder #1, if it's < divisor
        
        sub     w6,w2,w2    ; if ((dividend - quotient * divisor) >= divisor)
        subb    w7,w3,w3    ;   w3:w2 = remainder #2, if Carry = 1
        
    .ifndef NO_REMAINDER
        btss    SR,#C       ; get the correct remainder
        mov.d   w6,w2       ; "
    .endif
        
        addc    w1,#0,w0    ; correct the quotient += Carry, Err(q) = 0
        retlw   #0,w1       ; return with quotient MSW = 0
;-------
; Case of Divisor <= 0xFFFF
;
; Here the high word of the divisor is zero, so the quotient may overflow a
; 16-bit word.
;
;   w1:w0 = w1:w0 / w2
;         = [(w1 / w2) << 16] + [(w1 % w2):w0 / w2]
        
; Divide high word of dividend by the divisor.
        
0:      repeat  #DIV_RCOUNT ; calculate quotient MSW (16/16 divide)
        div.u   w1,w2       ; w0 = w1 / w2, w1 = w1 % w2
                            ; remainder becomes dividend MSW of next divide
        exch    w0,w4       ; save quotient MSW, restore dividend LSW
        
; Divide remainder and low word of dividend by the divisor.
        
        repeat  #DIV_RCOUNT ; calculate quotient LSW (32/16 divide)
        div.ud  w0,w2       ; w0 = (w1:w0) / w2, w1 = (w1:w0) % w2
        
    .ifndef NO_REMAINDER
        mul.uu  w1,#1,w2    ; w3:w2 = remainder
    .endif
        mov     w4,w1       ; restore quotient MSW
        return              ; return

Maybe now it's finished! ;)



GlennP
Super Member
  • Total Posts : 833
  • Reward points : 0
  • Joined: 2009/03/29 15:04:55
  • Location: El Paso County, CO, USA
  • Status: offline
Re: Final Drop for: Unsigned 32 by 32 divide routine in assembler 2020/09/10 18:14:07 (permalink)
4 (1)
1and0  ...
I'm not done yet. ;) For the PIC devices (PIC24E, dsPIC33E and dsPIC33C) that support the MULW instruction, one more instruction word and cycle can be shaved but these devices are slower at RETurn.

 
Isn't "Harry's" the name of a razor?  Is that how you're shaving these cycles?
 
Not only is the RETurn slower, but the R/CALLs are too.  And branches taken.  Flushing the instruction pipeline.
 
Edit 1: The latest changes are in the zip.txt file attached and include both the MULW speedup and an improved test program.
GP
 
post edited by GlennP - 2020/09/10 19:12:31
Spinlectrix
Junior Member
  • Total Posts : 72
  • Reward points : 0
  • Joined: 2016/03/20 14:48:37
  • Location: 0
  • Status: offline
Re: Final Drop for: Unsigned 32 by 32 divide routine in assembler 2020/09/13 14:41:12 (permalink)
0
Signed 32 by 32 version is  now up for review in another thread.
 
Glenn — As a guess, I think the reason you can't post in some of the threads you started is the inclusion of hyperlink(s), which the Forum software dodges, probably for security safety; Try posting the same material without any hyperlinks, even ones to these forums.
GlennP
Super Member
  • Total Posts : 833
  • Reward points : 0
  • Joined: 2009/03/29 15:04:55
  • Location: El Paso County, CO, USA
  • Status: offline
Re: Final Drop for: Unsigned 32 by 32 divide routine in assembler 2020/09/13 15:47:37 (permalink)
0
Spinlectrix
Signed 32 by 32 version is  now up for review in another thread.
 
Glenn — As a guess, I think the reason you can't post in some of the threads you started is the inclusion of hyperlink(s), which the Forum software dodges, probably for security safety; Try posting the same material without any hyperlinks, even ones to these forums.

Thanks.
 
I'll look at the Signed version tonight (MDT).
 
Some of the posts rejected were virtually empty (certainly no attachments or hyperlinks).  But there has been a recent change in the forum S/W or its configuration (or just server reboots), so I may be able to post on those threads again.
 
GP
Spinlectrix
Junior Member
  • Total Posts : 72
  • Reward points : 0
  • Joined: 2016/03/20 14:48:37
  • Location: 0
  • Status: offline
Re: Final Drop for: Unsigned 32 by 32 divide routine in assembler 2020/09/18 15:16:18 (permalink)
0
Ok—
In optimizing the signed version, I also managed to trim off some cycles of the unsigned version.  What I think is my final code follows:
 
   .title   "Unsigned 32-Bit Integer Division Routine for 16-Bit PIC Processors."
   .sbttl   "Uses 16-Bit Divisor Instructions to Achieve Lower Cycle Counts."
   .psize   60, 132
 
 
 
; Unsigned Integer Division of 32-bit by 32-bit
;
; Calling Parameters
;   W1:W0 = Dividend.
;   W3:W2 = Divisor.
;
; Return Values
;   W1:W0 = Quotient  = Dividend / Divisor.
;   IFF a remainder is requested, then w3:w2 returns with the Remainder
;      W3:W2 = Remainder = Dividend % Divisor. [Optional - adds instructions and (about 2) cycles]
;   However, if a Remainder is NOT requested, W3:W2 will be preserved across the call, and
;   the same Divisor can be used to operate on any new Dividend placed in w1:w0
;
; Clobbered Registers:
;   W4-W7
;
; Ref: https://www.microchip.com/forums/FindPost/1152154 and Predecessors.
;
; Credits (All Microchip Forum Usernames):
;   Spinlectrix - Original Coding.  Trim Register Usage.  Speedups.  Remainder Option.
;   1and0       - Fix Off-by-One Bug, Several Speedups, Quotient Overflow Bug Fix.
;   GlennP17321 - Off-by-One Bug Found, One Speedup, Conditionalize Remainder Calculations.
;
;
; Instruction Count depends on the Build Parameter: "Remainder".
;   +Remainder: 40 Instructions.
;   -Remainder: 33 Instructions.
;
; Cycle Counts depend on the Execution Case, Build Parameters, and the Specific Microcontroller Model.
;   The 33C versions use a repeat #(6-1) for Divides.  Others use a repeat #(18-1).
;   R/CALL instructions take 2 cycles except for 24E, 33E, and MasterCores in 33C where they take 4.
;   RETURN instructions take 3 cycles except for 24E, 33E, and MasterCores in 33C where they take 6.
;   Branch Taken instrs take 2 cycles except for 24E, 33E, and MasterCores in 33C where they take 4.
;   Counts Here EXCLUDE Call / Return Cycles (see 'Overhead' Line).  Allows 'inline' Evaluation.
;   CAUTION: Cycle Counts Measured with the Simulator and Stopwatch. BUGS SUSPECTED! 1-2 Extra Cycles?
;
;
; When Divisor.HO == 0:
;   Perform Long Division using 16/16 and 32/16 Hardware Instructions.
;
;   +----------------+----------------+
;   |   DividendHO       DividendLO   |
;   +----------------+----------------+
;
;   1. QuotientHO = DividendHO / DivisorLO.  Use 16/16 Divide.
;   2. Replace DividendHO with Remainder from Step 1:
;
;   +----------------+----------------+
;   |    Remainder       DividendLO   |
;   +----------------+----------------+
;
;   3. QuotientLO = (Remainder | DividendLO) / DivisorLO.  Use 32/16 Divide.
;   4. RemainderHO = 0, RemainderLO = Remainder from Step 3.
;   5. Quotient = (QuotientHO | QuotientLO).
 
;
; When Divisor.HO != 0:
;
;   1. Form Approximate (Normalized) Divisor per Diagram Below.
;
;   Sample Divisor (FindFirstOne = 6, RightShift = 11, LeftShift = 5):
;   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
;   |0 0 0 0 0 1 a b c d e f g h i j k l m n o p q r s t u v w x y z|
;   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
;   |              H O              |              L O              |
;    3 3 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
;    1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0
;             |                               |
;             +---------------+---------------+
;                             |
;                             |     Approximate 16-Bit Divisor:
;                             |     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
;                             +-->  |1 a b c d e f g h i j k l m n o|
;                                   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
;                                           From HO       | From LO
;                                    1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
;                                    5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0
;
;   2. Shift Dividend >> 1: Assures NO Overflow Computing (Shifted) Quotient but Retains Precision.
;   3. Divide using 32/16 Instruction.  Quotient MAY be One Greater due to Divisor Truncation.
;   4. Quotient >>= (RightShift - 1) to so Quotient Scale Becomes 1:1.
;   5. Calculate Trial Remainder = Dividend - (Divisor * Quotient) and
;   6. Return iff Remainder is Positive or Zero (Non-Negative) with Quotient (and Remainder)
;   7. Otherwise subtract one from quotient (it was too big) (and add Divisor to Remainder, making it Non-Negative).
 
;
; Include Processor-Specific Definitions.
.include "xc.inc"   ; Finds and Includes Processor-Specific .inc File.
;
; Specify Number of Repeats to Perform a Divide.
;   This Should be in the Microchip-Supplied Include Files, But Was NOT Found.
.ifdef __dsPIC33C
   DivideRepeatCount =  6   ;  6 for 33C Models.
.else
   DivideRepeatCount = 18   ; 18 for All Other Models.
.endif
;
; Code Directives.
.section .text, code      ; Normal Code Section.
.align   2         ; Instruction Alignment.
.global   _udiv3232      ; Make Visible Externally.
.type   _udiv3232, @function   ; Debugger Information in ELS.

 
But I can't seem to post the entire routine in one paste -- too big (?) so the code for the routine continues on next post...
post edited by Spinlectrix - 2020/09/18 16:35:15
Spinlectrix
Junior Member
  • Total Posts : 72
  • Reward points : 0
  • Joined: 2016/03/20 14:48:37
  • Location: 0
  • Status: offline
Re: Final Drop for: Unsigned 32 by 32 divide routine in assembler 2020/09/18 15:31:32 (permalink)
0
Unsigned 32 divide routine code follows:
 
.ifndef Remainder
   Remainder = 0
.endif
 
_udiv3232:
_udivmod3232:
    mov w0, w4 ; Copy Dividend.LO to w4.

; Find First '1' Bit in Divisor.HO (if Any).
    ff1l w3, w5 ; w5 = First One Position; 1 = HO, 16 = LO.
    bra NC, 0f ; C(arry) Implies NO Ones in w3, so Divisor.HO == 0.

; Follows here the method used when the Hi word of the unsigned divisor is zero.
; Divisor < 0x10000. [Divisor.HO == 0] Perform Long Division because Quotient MAY Not Fit in Word.
; Divide Dividend.HO / Divisor.LO.
      repeat #(DivideRepeatCount-1) ; Division Repeat Count Depends on 33C (6) or 'Other' (18).
      div.uw w1, w2 ; Divide Step. At End: w0 = w1 / w2 and w1 = w1 % w2.

; Save Quotient.HO; Restore Dividend.LO.
      exch w0, w4 ; Save Quotient.HO; Restore Dividend.LO.

; Divide (Remainder:Dividend.LO) / Divisor.
      repeat #(DivideRepeatCount-1) ; Division Repeat Count Depends on 33C (6) or 'Other' (18).
      div.ud w0, w2 ; Divide Step. At End: w0 = (w1:w0) / w2 and w1 = (w1:w0) % w2.

; Set w3:w2 to Remainder if Requested.
.if Remainder
      mul.uu w1, #1, w2 ; W3:W2 is Remainder.
.endif

; Recover Quotient.HO.
      mov w4, w1 ; w1 = Quotient.HO (from w4, which was used as temporary storage).
      return ; Return to Caller.

; Follows here the method used when the Hi word of the unsigned divisor is NOT zero.
0:; Divisor > 0x0FFFF. [Divisor.HO > 0] Perform Simple Division and Possible Correction.
; Here Divisor.HO > 0, so the Result Will Fit into a Word (Quotient.HO must be zero).

; Preserve Dividend.HO in W5. [Dividend.LO Already in W4.]
    exch w1, w5 ; w1 = FirstOne Position; w5 = Dividend.HO. W5:W4 = Dividend.

; Calculate Shift Counts for Normalizing the Divisor.
    subr w1, #17, w7 ; w7 = Right Shift Count.
    subr w7, #16, w0 ; w0 = Left Shift Count.

; 'Normalized' 16-Bit Divisor Created by Shifting the Actual Divisor RIGHT by "RightShiftCount".
    sl w3, w0, w0 ; Divisor.HO <<= LeftShiftCount.
    lsr w2, w7, w6 ; Divisor.LO >>= RightShiftCount.
    ior w0, w6, w6 ; Divisor.LO = (Divisor >> RightShiftCount). Note: MSb = 1.

; Shift Dividend Right by 1: W1:W0 = W5:W4 >> 1.
; Retains Maximum Precision in Quotient (Next Step) while Assuring Quotient Fits in One Word.
    lsr w5, w1 ; Dividend.HO >>= 1.
    rrc w4, w0 ; Dividend.LO >>= 1 (and HO Bit from Dividend.HO Bit 0).

; Perform Division of 32/16 Unsigned using "div.ud" Instruction.
    repeat #(DivideRepeatCount-1) ; Division Repeat Count Depends on 33C (6) or 'Other' (18).
    div.ud w0, w6 ; Divide Step. At End: w0 = (w1:w0) / w6 and w1 = (w1:w0) % w6.

; Scale Quotient by RightShiftCount-1 to Compensate for Divisor Shifting.
    dec w7, w7 ; Reduce RightShiftCount (Dividend Already Shifted Right One Above).
    lsr w0, w7, w0 ; Quotient (w0) >> (RightShiftCount-1). Quotient Now Scaled Back to 1:1.

; Form Quotient * Divisor.
    mul.uu w3, w0, w6 ; w7:w6 = Divisor.HO * Quotient.
    mov w6, w1 ; Save (Divisor.HO * Quotient).LO in w1.
    mul.uu w2, w0, w6 ; w7:w6 = Divisor.LO * Quotient.
    add w7, w1, w7 ; w7:w6 = Divisor * Quotient.

.if Remainder
        exch w2, w6 ; Swap divisor with (divisor*quotient) product
        exch w3, w7 ; (Preserve the Divisor in case we have to recalculate remainder)
        clr w1 ; w1:w3:w2 will represent an unsigned 48-bit quantity here
        addc w1,#0,w1 ; Set bit iff add produced an overflow

; Calculate the default remainder for the probable quotient
        sub w4, w2, w2 ; w3:w2 = dividend - quotient * divisor
        subb w5, w3, w3 ; = Remainder (probably)
        subbr w1, #0, w1 ; Make sure entire 48-bit value did not go negative

        btsc SR, #0 ; Iff Carry is clear we have to adjust the quotient and remainder, so do not return
        retlw #0, w1 ; Return to Caller AFTER Setting Quotient.HO (W1) = 0.

; Correct quotient, Calculate the remainder for the corrected quotient
        dec w0, w0 ; Adjust quotient down by one
        add w2, w6, w2 ; We had subtracted one too many divisors:
        addc w3, w7, w3 ; Add one divisor back into Remainder
        retlw #0, w1 ; Return to Caller AFTER Setting Quotient.HO (W1) = 0.

.else ; Case for no Remainder necessary
; Note: If the NoRemainder or Remaider = 0 case is used, the divisor in w3:w2 should be preserved across the call!
    bra C, 1f ; Go decrease Quotient iff add caused Carry

; Determine if dividend - (quotient * divisor) is non-negative
    sub w4, w6, w6 ; w7:w6 = dividend - quotient * divisor
    subb w5, w7, w7 ; = Remainder (probably)

    btsc SR, #0 ; Iff Carry is clear, we have to adjust the quotient, so do not return
    retlw #0, w1 ; Return to Caller AFTER Setting Quotient.HO (W1) = 0.

1: dec w0, w0 ; Adjust quotient down by one
    retlw #0, w1 ; Return to Caller AFTER Setting Quotient.HO (W1) = 0.
.endif
.end

post edited by Spinlectrix - 2020/09/18 16:47:33
1and0
Access is Denied
  • Total Posts : 11506
  • Reward points : 0
  • Joined: 2007/05/06 12:03:20
  • Location: Harry's Gray Matter
  • Status: offline
Re: Final Drop for: Unsigned 32 by 32 divide routine in assembler 2020/09/18 15:40:01 (permalink)
2 (1)
Spinlectrix
<Edit>  Well, I am confuzled... cannot seem to post code right now.

Testing 123 ...
;
; Unsigned Long Integer Division of 32-bit by 32-bit
;
;   uint32_t
;   udivmod3232 (uint32_t dividend, uint32_t divisor, uint32_t *remainder)
;
;   uint32_t
;   udiv3232 (uint32_t dividend, uint32_t divisor)
;
;
; Dividend = Quotient * Divisor + Remainder
;
; Calling Parameters
;   w1:w0 = Dividend
;   w3:w2 = Divisor
;
; Return Values
;   w1:w0 = Quotient  = Dividend / Divisor
;   w3:w2 = Remainder = Dividend % Divisor
;
; Clobbered Registers
;   w7,w6, w5,w4

1and0
Access is Denied
  • Total Posts : 11506
  • Reward points : 0
  • Joined: 2007/05/06 12:03:20
  • Location: Harry's Gray Matter
  • Status: offline
Re: Final Drop for: Unsigned 32 by 32 divide routine in assembler 2020/09/18 16:23:36 (permalink)
0
Spinlectrix
In optimizing the signed version, I also managed to trim off some cycles of the unsigned version.  What I think is my final code follows:

; Cycle Counts depend on the Execution Case, Build Parameters, and the Specific Microcontroller Model.
;   The 33C versions use a repeat #(6-1) for Divides.  Others use a repeat #(18-1).
;   R/CALL instructions take 2 cycles except for 24E, 33E, and MasterCores in 33C where they take 4.
;   RETURN instructions take 3 cycles except for 24E, 33E, and MasterCores in 33C where they take 6.
;   Branch Taken instrs take 2 cycles except for 24E, 33E, and MasterCores in 33C where they take 4.
;   Counts Here EXCLUDE Call / Return Cycles (see 'Overhead' Line).  Allows 'inline' Evaluation.
;   CAUTION: Cycle Counts Measured with the Simulator and Stopwatch. BUGS SUSPECTED! 1-2 Extra Cycles?


I may be wrong, but I count more cycles. What maximum cycles did you get?  For which PIC device?
Spinlectrix
Junior Member
  • Total Posts : 72
  • Reward points : 0
  • Joined: 2016/03/20 14:48:37
  • Location: 0
  • Status: offline
Re: Final Drop for: Unsigned 32 by 32 divide routine in assembler 2020/09/18 17:55:29 (permalink)
0
1and0I may be wrong, but I count more cycles. What maximum cycles did you get?  For which PIC device?



Well, you're not wrong that when calculating remainders, the longest path (Worst case) through this code is longer in my routine than in your previously posted code. 
 
However, on any core of processor, the No-Remainder path, both the average path and the worst-case path are faster.  In addition, when using the No-Remainder variant, the Divisor value is preserved across the div calls in the new code, allowing tables or arrays of Dividends to be successively divided by a constant Divisor without reloading (and thus saving another 2 cyles, minimum).  Given how rare ULongword Modulo is useful as a operator on a microcontroller, the normal ULongword divide is the case I felt was the one worth concentrating on.
 
And even when calculating the Remainder, the _average_ cycle count is still a bit faster; Events triggering the worst-case path in the new code are rare, typically occurring in only about 1 in 2^16 divides.  That said, it's not like the new routine is hugely faster:  My current tester, doing 1M random divides with also calculating the Remainder yields an average of 57.61 cycles per divide with your code (Worst case 59 cycles) on a 'E'-core processor.  With my new code, (and again calculating the Remainder) yields 57.40 cyles/divide (Worst-case 63 cycles).  Not a lot to write home about, but still an improvement.
 
I'm in the process of doing 10M random divides to see if it changes things any, but I doubt it.  I will update the results regardless, and you're welcome to see my random Dividend and Divisor generator, or to write your own; It's not inconceivable that that could bias these results! <Edit update: no change at all for 10M divides>
 
The bigger part of the insight I gained in doing this refactoring is better reflected in managing a significant improvement (if a few cycles can be called significant) of the Remainder path of the signed divide.  Soon to post on the signed 32 by 32 divivison thread!
post edited by Spinlectrix - 2020/09/18 18:59:49
1and0
Access is Denied
  • Total Posts : 11506
  • Reward points : 0
  • Joined: 2007/05/06 12:03:20
  • Location: Harry's Gray Matter
  • Status: offline
Re: Final Drop for: Unsigned 32 by 32 divide routine in assembler 2020/09/18 22:14:51 (permalink)
0
Spinlectrix
Well, you're not wrong that when calculating remainders, the longest path (Worst case) through this code is longer in my routine than in your previously posted code. 

The worst case cycle is what matter when trying to write the fastest routine. It is the weakest link. ;)
 

However, on any core of processor, the No-Remainder path, both the average path and the worst-case path are faster.  In addition, when using the No-Remainder variant, the Divisor value is preserved across the div calls in the new code, allowing tables or arrays of Dividends to be successively divided by a constant Divisor without reloading (and thus saving another 2 cyles, minimum).  Given how rare ULongword Modulo is useful as a operator on a microcontroller, the normal ULongword divide is the case I felt was the one worth concentrating on.

That is why I suggested earlier in this thread that there should be two separate routines, so they can be individually optimized without trade-off or compromise.
 

And even when calculating the Remainder, the _average_ cycle count is still a bit faster; Events triggering the worst-case path in the new code are rare, typically occurring in only about 1 in 2^16 divides.  That said, it's not like the new routine is hugely faster:  My current tester, doing 1M random divides with also calculating the Remainder yields an average of 57.61 cycles per divide with your code (Worst case 59 cycles) on a 'E'-core processor.  With my new code, (and again calculating the Remainder) yields 57.40 cyles/divide (Worst-case 63 cycles).  Not a lot to write home about, but still an improvement.

How did you get worse case 59 cycles for my code? I counted my code takes maximum of 53 cycles on an 'E'-core calculating the remainder too.
 
<edit> For 'E'-core, swap the two legs in my code to shave one cycle; i.e. 52 cycles max. My cycle counts include the RETURN but not the CALL -- only cycles consumed by the routine.
 

I'm in the process of doing 10M random divides to see if it changes things any, but I doubt it.  I will update the results regardless, and you're welcome to see my random Dividend and Divisor generator, or to write your own; It's not inconceivable that that could bias these results! <Edit update: no change at all for 10M divides>

Please post your random dividend and divisor generator.
 
post edited by 1and0 - 2020/09/18 23:23:19
Spinlectrix
Junior Member
  • Total Posts : 72
  • Reward points : 0
  • Joined: 2016/03/20 14:48:37
  • Location: 0
  • Status: offline
Re: Final Drop for: Unsigned 32 by 32 divide routine in assembler 2020/09/19 09:58:41 (permalink)
0
However, you have managed to shame me into putting even more time into this:  The following u3232 code has no worst-case paths that are longer than your original algorithm:
 
I've now given up pasting easily visible source code; above some size, posting it just seems to stop working(?), so the udiv3232 routine is attached here as a .zip file.
 
                                               Original Code      Original Code          New Code           New Code       
'E' Core Performance Stats:  Average Cycles      Worst-Case       Average Cycles      Worst-Case

No Remainder Returned               54.12                      56                      52.06                     53           -2.06
With Remainder Returned            55.61                      57                      54.00                     57           -1.61
 
<Edit:  These numbers have now been updated, thanks to Harry (1and0) pointing out a uniform 2-cycle flaw in how I was collecting timing numbers!  I believe the numbers posted here are now correct. >
post edited by Spinlectrix - 2020/09/20 10:03:10
1and0
Access is Denied
  • Total Posts : 11506
  • Reward points : 0
  • Joined: 2007/05/06 12:03:20
  • Location: Harry's Gray Matter
  • Status: offline
Re: Final Drop for: Unsigned 32 by 32 divide routine in assembler 2020/09/19 10:14:06 (permalink)
0
Spinlectrix
However, you have managed to shame me into putting even more time into this:  The following u3232 code has no worst-case paths that are longer than your original algorithm:

That was not my intention. Any code that will beat the fastest code is welcome! We all benefit and learn something at the same time. When push comes to shove, we can unroll the divide loops and get rid of the REPEAT instructions to shave a cycle or two ... hehe. ;)

For my signed version here the function call can be inlined to eliminate the Call and Return.
 
1and0
Access is Denied
  • Total Posts : 11506
  • Reward points : 0
  • Joined: 2007/05/06 12:03:20
  • Location: Harry's Gray Matter
  • Status: offline
Re: Final Drop for: Unsigned 32 by 32 divide routine in assembler 2020/09/19 16:09:47 (permalink)
0
1and0
Spinlectrix
However, on any core of processor, the No-Remainder path, both the average path and the worst-case path are faster.  In addition, when using the No-Remainder variant, the Divisor value is preserved across the div calls in the new code, allowing tables or arrays of Dividends to be successively divided by a constant Divisor without reloading (and thus saving another 2 cyles, minimum).  Given how rare ULongword Modulo is useful as a operator on a microcontroller, the normal ULongword divide is the case I felt was the one worth concentrating on.

That is why I suggested earlier in this thread that there should be two separate routines, so they can be individually optimized without trade-off or compromise.

I finally have some time to look closer at your algorithm for the no-remainder version. It is brilliant and shaves at least 2 cycles; so let's use your code for the no-remainder version and my code for the remainder version, then we'll have the best of both world! Keep going and shave more cycles. :)
 
Spinlectrix
Junior Member
  • Total Posts : 72
  • Reward points : 0
  • Joined: 2016/03/20 14:48:37
  • Location: 0
  • Status: offline
Re: Final Drop for: Unsigned 32 by 32 divide routine in assembler 2020/09/19 16:44:19 (permalink)
0
Well, the most recent version saves four cycles, but on just one of the two divide methods, so as a result the average performance is only 2 cycles faster.
 
Man, it gets harder and harder to add any performance!  My work on eliminating Worst-case paths through the code took a few hours, shaved only 1 cycle for one path through the worst-case, and did not affect the average performance at all.
 
1and0Keep going and shave more cycles. :)

See thread post #117 above.
post edited by Spinlectrix - 2020/09/19 17:13:22
Page: << < ..67 > Showing page 6 of 7
Jump to:
© 2020 APG vNext Commercial Version 4.5