1and0
Access is Denied
 Total Posts : 11506
 Reward points : 0
 Joined: 2007/05/06 12:03:20
 Location: Harry's Gray Matter
 Status: offline
Re: Unsigned 32 by 32 divide routine in assembler
2020/09/08 20:58:09
(permalink)
NorthGuy Impossible to see the future is :)
You must be a time traveler who's been to or from the future. ;)

Spinlectrix
Junior Member
 Total Posts : 72
 Reward points : 0
 Joined: 2016/03/20 14:48:37
 Location: 0
 Status: offline
Re: Final Drop for: Unsigned 32 by 32 divide routine in assembler
2020/09/08 23:21:06
(permalink)
1and0
; Consequently, scale the dividend to ensure the quotient fits in its LSW lsr w5,w1 ; dividend' = dividend >> 1 rrc w4,w0 ; w1:w0 = w5:w4 >> 1 ..... ; Undo the scaling of dividend by 2 and apply the same ratio as the divisor dec w7,w7 ; new shift to account for quotient <<= 1 lsr w0,w7,w0 ; quotient >>= (n  1)
OMG, this is brilliant; I still want to think about it when I'm fresh in the morning, but I had entirely missed this trick! Great insight! <Edit 1:> Well, after more reflection, I was pretty sure I was going to be able to shave several cycles off this routine. But (man!), after trying for a couple of hours, I can't shave even one! Excellent job! (I especially like how the algorithm responds when the shiftright quantity of the divisor is zero! It does the divide as usual, and then wipes out all bits of the quotient during the adjustment to the withinone section!)
post edited by Spinlectrix  2020/09/09 14:46:04

Spinlectrix
Junior Member
 Total Posts : 72
 Reward points : 0
 Joined: 2016/03/20 14:48:37
 Location: 0
 Status: offline
Re: Unsigned 32 by 32 divide routine in assembler
2020/09/08 23:25:11
(permalink)
NorthGuy Impossible to see the future is :)
I thought it was merely 'Difficult' ?

1and0
Access is Denied
 Total Posts : 11506
 Reward points : 0
 Joined: 2007/05/06 12:03:20
 Location: Harry's Gray Matter
 Status: offline
Re: Final Drop for: Unsigned 32 by 32 divide routine in assembler
2020/09/09 03:54:03
(permalink)
Spinlectrix OMG, this is brilliant; I still want to think about it when I'm fresh in the morning, but I had entirely missed this trick! Great insight!
Keeping more dividend bits for the calculation of the quotient minimizes the rounding error.

GlennP
Super Member
 Total Posts : 833
 Reward points : 0
 Joined: 2009/03/29 15:04:55
 Location: El Paso County, CO, USA
 Status: offline
Re: Final Drop for: Unsigned 32 by 32 divide routine in assembler
2020/09/09 07:59:34
(permalink)
I'm unable to "Reply to post" on the Fast Integer Arithmetic thread I created ( https://www.microchip.com.orums/FindPost/1152205), but I have been able to upload .zip.txt files to the original post. It's current uptodate with Harry's latest. It includes a C main program for testing the cases that have tripped up older versions and can be used with the simulator and stopwatch to get cycle counts. GP

1and0
Access is Denied
 Total Posts : 11506
 Reward points : 0
 Joined: 2007/05/06 12:03:20
 Location: Harry's Gray Matter
 Status: offline
Re: Final Drop for: Unsigned 32 by 32 divide routine in assembler
2020/09/10 08:08:06
(permalink)
MBedder Give the assembler source file a capital .S extension and use all standard (#) C preprocessor features.
Well, it doesn't work because it seems those macros (predefined symbols) for the device families are defined for the uses of the assembler conditional directives, and the standard (#) C preprocessor does NOT see them. That is, those predefined symbols are undefined to the C preprocessor. :(

1and0
Access is Denied
 Total Posts : 11506
 Reward points : 0
 Joined: 2007/05/06 12:03:20
 Location: Harry's Gray Matter
 Status: offline
Re: Final Drop for: Unsigned 32 by 32 divide routine in assembler
2020/09/10 08:19:20
(permalink)
Spinlectrix Well, after more reflection, I was pretty sure I was going to be able to shave several cycles off this routine. But (man!), after trying for a couple of hours, I can't shave even one! Excellent job! (I especially like how the algorithm responds when the shiftright quantity of the divisor is zero! It does the divide as usual, and then wipes out all bits of the quotient during the adjustment to the withinone section!)
I'm not done yet. ;) For the PIC devices (PIC24E, dsPIC33E and dsPIC33C) that support the MULW instruction, one more instruction word and cycle can be shaved but these devices are slower at RETurn. ; ; Unsigned Long Integer Division of 32bit by 32bit ; ; uint32_t ; udivmod3232 (uint32_t dividend, uint32_t divisor, uint32_t *remainder) ; ; uint32_t ; udiv3232 (uint32_t dividend, uint32_t divisor) ; ; ; Dividend = Quotient * Divisor + Remainder ; ; Calling Parameters ; w1:w0 = Dividend ; w3:w2 = Divisor ; ; Return Values ; w1:w0 = Quotient ; w3:w2 = Remainder ; ; Used Registers ; w7,w6, w5,w4 ; ; Tcy(max) = 12 + DIV + 15+1 + RETLW ; ; For PIC24F, PIC24H, dsPIC30F, dsPIC33F ; Tcy(max) = 12 + 18 + 16 + 3 = 49 (quotient & remainder) ; = 12 + 18 + 13 + 3 = 46 (quotient only) ; ; For PIC24E, dsPIC33E ; Tcy(max) = 12 + 18 + 15 + 6 = 51 (quotient & remainder) ; = 12 + 18 + 12 + 6 = 48 (quotient only) ; ; For dsPIC33C ; Tcy(max) = 12 + 6 + 15 + 6 = 39 (quotient & remainder) ; = 12 + 6 + 12 + 6 = 36 (quotient only) ; ; Ref: https://www.microchip.com/forums/FindPost/1151719 ; ; Credits (All Microchip Forum Usernames): ; * Spinlectrix, 1and0, and GlennP17321. ; ; Conditional assembly: ; * Define NO_REMAINDER to return only the quotient and no remainder. ; * Adding the remainder takes 3 more program memory words and 3 more ; instruction cycles maximum which are negligible, so it is the ; default by having NO_REMAINDER not defined. ; ; NO_REMAINDER = 1 ; Loop counter value for the REPEAT instruction of a DIV instruction. .ifdef __dsPIC33C DIV_RCOUNT = (61) ; execute DIV 6 consecutive times .else DIV_RCOUNT = (181) ; execute DIV 18 consecutive times .endif ; Division Routine Names .ifndef NO_REMAINDER udivmod3232: .else udiv3232: .endif mov w0,w4 ; save dividend LSW ff1l w3,w5 ; w5 = first one left of divisor MSW bra c,0f ; if (divisor <= 0xFFFF) branch ; ; Case of Divisor > 0xFFFF ; ; Here the high word of the divisor is nonzero, so the quotient will fit in ; a 16bit word; i.e. the high word of the quotient is guaranteed to be zero. ; ; w1:w0 = w1:w0 / w3:w2 ; = w5:w4 / w3:w2 ; = [(w5:w4) >> n] / [(w3:w2) >> n] ; Copy divisor for correction of the quotient later. exch w1,w5 ; w1 = first one of divisor, w5 = dividend MSW ; Calculate shift counts for normalization. subr w1,#17,w7 ; right shift >> (w7 = 17  w1), 1 <= (n=w7) <= 16 subr w7,#16,w0 ; left shift << (w0 = 16  w7) ; Normalize the divisor so its leftmost 1 is at the MSb of its LSW. sl w3,w0,w0 ; divisor' = divisor >> n lsr w2,w7,w6 ; " ior w0,w6,w6 ; w6 = (w3:w2) >> n ; Consequently, scale the dividend to ensure the quotient fits in its LSW; ; also, keeping more dividend bits minimizes the quotient rounding error. lsr w5,w1 ; dividend' = dividend >> 1 rrc w4,w0 ; w1:w0 = w5:w4 >> 1 ; Divide scaled dividend by the normalized divisor to get an estimated quotient ; where the quotient error Err(q) = estimated quotient  correct quotient. repeat #DIV_RCOUNT ; calculate the 16bit quotient (32/16 divide) div.ud w0,w6 ; w0 = (w1:w0)/w6, no overflow here, Err(q)=[0,+1] .ifdef __PIC24E PIC_24E_33C_33E = 1 .endif .ifdef __dsPIC33E PIC_24E_33C_33E = 1 .endif .ifdef __dsPIC33C PIC_24E_33C_33E = 1 .endif ; Undo the scaling of dividend by 2 and apply the same ratio as the divisor. dec w7,w7 ; new shift to account for quotient <<= 1 lsr w0,w7,w1 ; w1 = quotient >>= (n  1) ; Correct quotient for rounding error due to LSbs discarded by the shifts. btss SR,#Z ; if (quotient != 0) dec w1,w1 ; quotient, Err(q) = [1,0] ; so (quotient * divisor) won't overflow below .ifdef PIC_24E_33C_33E mulw.uu w1,w3,w0 ; w0 = w1 * w3 .else mul.uu w1,w3,w6 ; w7:w6 = w1 * w3 mov w6,w0 ; w0 = w1 * w3 .endif mul.uu w1,w2,w6 ; + w7:w6 = w1 * w2 add w0,w7,w7 ; = w7:w6 = quotient * divisor, no overflow here ; Calculate the remainder to correct the final quotient. sub w4,w6,w6 ; w7:w6 = dividend  quotient * divisor subb w5,w7,w7 ; = remainder #1, if it's < divisor sub w6,w2,w2 ; if ((dividend  quotient * divisor) >= divisor) subb w7,w3,w3 ; w3:w2 = remainder #2, if Carry = 1 .ifndef NO_REMAINDER btss SR,#C ; get the correct remainder mov.d w6,w2 ; " .endif addc w1,#0,w0 ; correct the quotient += Carry, Err(q) = 0 retlw #0,w1 ; return with quotient MSW = 0 ; ; Case of Divisor <= 0xFFFF ; ; Here the high word of the divisor is zero, so the quotient may overflow a ; 16bit word. ; ; w1:w0 = w1:w0 / w2 ; = [(w1 / w2) << 16] + [(w1 % w2):w0 / w2] ; Divide high word of dividend by the divisor. 0: repeat #DIV_RCOUNT ; calculate quotient MSW (16/16 divide) div.u w1,w2 ; w0 = w1 / w2, w1 = w1 % w2 ; remainder becomes dividend MSW of next divide exch w0,w4 ; save quotient MSW, restore dividend LSW ; Divide remainder and low word of dividend by the divisor. repeat #DIV_RCOUNT ; calculate quotient LSW (32/16 divide) div.ud w0,w2 ; w0 = (w1:w0) / w2, w1 = (w1:w0) % w2 .ifndef NO_REMAINDER mul.uu w1,#1,w2 ; w3:w2 = remainder .endif mov w4,w1 ; restore quotient MSW return ; return
Maybe now it's finished! ;)

GlennP
Super Member
 Total Posts : 833
 Reward points : 0
 Joined: 2009/03/29 15:04:55
 Location: El Paso County, CO, USA
 Status: offline
Re: Final Drop for: Unsigned 32 by 32 divide routine in assembler
2020/09/10 18:14:07
(permalink)
1and0 ... I'm not done yet. ;) For the PIC devices (PIC24E, dsPIC33E and dsPIC33C) that support the MULW instruction, one more instruction word and cycle can be shaved but these devices are slower at RETurn. Isn't "Harry's" the name of a razor? Is that how you're shaving these cycles? Not only is the RETurn slower, but the R/CALLs are too. And branches taken. Flushing the instruction pipeline. Edit 1: The latest changes are in the zip.txt file attached and include both the MULW speedup and an improved test program. GP
post edited by GlennP  2020/09/10 19:12:31

Spinlectrix
Junior Member
 Total Posts : 72
 Reward points : 0
 Joined: 2016/03/20 14:48:37
 Location: 0
 Status: offline
Re: Final Drop for: Unsigned 32 by 32 divide routine in assembler
2020/09/13 14:41:12
(permalink)
Signed 32 by 32 version is now up for review in another thread. Glenn — As a guess, I think the reason you can't post in some of the threads you started is the inclusion of hyperlink(s), which the Forum software dodges, probably for security safety; Try posting the same material without any hyperlinks, even ones to these forums.

GlennP
Super Member
 Total Posts : 833
 Reward points : 0
 Joined: 2009/03/29 15:04:55
 Location: El Paso County, CO, USA
 Status: offline
Re: Final Drop for: Unsigned 32 by 32 divide routine in assembler
2020/09/13 15:47:37
(permalink)
Spinlectrix Signed 32 by 32 version is now up for review in another thread. Glenn — As a guess, I think the reason you can't post in some of the threads you started is the inclusion of hyperlink(s), which the Forum software dodges, probably for security safety; Try posting the same material without any hyperlinks, even ones to these forums. Thanks. I'll look at the Signed version tonight (MDT). Some of the posts rejected were virtually empty (certainly no attachments or hyperlinks). But there has been a recent change in the forum S/W or its configuration (or just server reboots), so I may be able to post on those threads again. GP

Spinlectrix
Junior Member
 Total Posts : 72
 Reward points : 0
 Joined: 2016/03/20 14:48:37
 Location: 0
 Status: offline
Re: Final Drop for: Unsigned 32 by 32 divide routine in assembler
2020/09/18 15:16:18
(permalink)
Ok— In optimizing the signed version, I also managed to trim off some cycles of the unsigned version. What I think is my final code follows: .title "Unsigned 32Bit Integer Division Routine for 16Bit PIC Processors." .sbttl "Uses 16Bit Divisor Instructions to Achieve Lower Cycle Counts." .psize 60, 132 ; Unsigned Integer Division of 32bit by 32bit ; ; Calling Parameters ; W1:W0 = Dividend. ; W3:W2 = Divisor. ; ; Return Values ; W1:W0 = Quotient = Dividend / Divisor. ; IFF a remainder is requested, then w3:w2 returns with the Remainder ; W3:W2 = Remainder = Dividend % Divisor. [Optional  adds instructions and (about 2) cycles] ; However, if a Remainder is NOT requested, W3:W2 will be preserved across the call, and ; the same Divisor can be used to operate on any new Dividend placed in w1:w0 ; ; Clobbered Registers: ; W4W7 ; ; Ref: https://www.microchip.com/forums/FindPost/1152154 and Predecessors. ; ; Credits (All Microchip Forum Usernames): ; Spinlectrix  Original Coding. Trim Register Usage. Speedups. Remainder Option. ; 1and0  Fix OffbyOne Bug, Several Speedups, Quotient Overflow Bug Fix. ; GlennP17321  OffbyOne Bug Found, One Speedup, Conditionalize Remainder Calculations. ; ; ; Instruction Count depends on the Build Parameter: "Remainder". ; +Remainder: 40 Instructions. ; Remainder: 33 Instructions. ; ; Cycle Counts depend on the Execution Case, Build Parameters, and the Specific Microcontroller Model. ; The 33C versions use a repeat #(61) for Divides. Others use a repeat #(181). ; R/CALL instructions take 2 cycles except for 24E, 33E, and MasterCores in 33C where they take 4. ; RETURN instructions take 3 cycles except for 24E, 33E, and MasterCores in 33C where they take 6. ; Branch Taken instrs take 2 cycles except for 24E, 33E, and MasterCores in 33C where they take 4. ; Counts Here EXCLUDE Call / Return Cycles (see 'Overhead' Line). Allows 'inline' Evaluation. ; CAUTION: Cycle Counts Measured with the Simulator and Stopwatch. BUGS SUSPECTED! 12 Extra Cycles? ; ; ; When Divisor.HO == 0: ; Perform Long Division using 16/16 and 32/16 Hardware Instructions. ; ; +++ ;  DividendHO DividendLO  ; +++ ; ; 1. QuotientHO = DividendHO / DivisorLO. Use 16/16 Divide. ; 2. Replace DividendHO with Remainder from Step 1: ; ; +++ ;  Remainder DividendLO  ; +++ ; ; 3. QuotientLO = (Remainder  DividendLO) / DivisorLO. Use 32/16 Divide. ; 4. RemainderHO = 0, RemainderLO = Remainder from Step 3. ; 5. Quotient = (QuotientHO  QuotientLO). ; ; When Divisor.HO != 0: ; ; 1. Form Approximate (Normalized) Divisor per Diagram Below. ; ; Sample Divisor (FindFirstOne = 6, RightShift = 11, LeftShift = 5): ; +++++++++++++++++++++++++++++++++ ; 0 0 0 0 0 1 a b c d e f g h i j k l m n o p q r s t u v w x y z ; +++++++++++++++++++++++++++++++++ ;  H O  L O  ; 3 3 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 ; 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 ;   ; +++ ;  ;  Approximate 16Bit Divisor: ;  +++++++++++++++++ ; +> 1 a b c d e f g h i j k l m n o ; +++++++++++++++++ ; From HO  From LO ; 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 ; 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 ; ; 2. Shift Dividend >> 1: Assures NO Overflow Computing (Shifted) Quotient but Retains Precision. ; 3. Divide using 32/16 Instruction. Quotient MAY be One Greater due to Divisor Truncation. ; 4. Quotient >>= (RightShift  1) to so Quotient Scale Becomes 1:1. ; 5. Calculate Trial Remainder = Dividend  (Divisor * Quotient) and ; 6. Return iff Remainder is Positive or Zero (NonNegative) with Quotient (and Remainder) ; 7. Otherwise subtract one from quotient (it was too big) (and add Divisor to Remainder, making it NonNegative). ; ; Include ProcessorSpecific Definitions. .include "xc.inc" ; Finds and Includes ProcessorSpecific .inc File. ; ; Specify Number of Repeats to Perform a Divide. ; This Should be in the MicrochipSupplied Include Files, But Was NOT Found. .ifdef __dsPIC33C DivideRepeatCount = 6 ; 6 for 33C Models. .else DivideRepeatCount = 18 ; 18 for All Other Models. .endif ; ; Code Directives. .section .text, code ; Normal Code Section. .align 2 ; Instruction Alignment. .global _udiv3232 ; Make Visible Externally. .type _udiv3232, @function ; Debugger Information in ELS. But I can't seem to post the entire routine in one paste  too big (?) so the code for the routine continues on next post...
post edited by Spinlectrix  2020/09/18 16:35:15

Spinlectrix
Junior Member
 Total Posts : 72
 Reward points : 0
 Joined: 2016/03/20 14:48:37
 Location: 0
 Status: offline
Re: Final Drop for: Unsigned 32 by 32 divide routine in assembler
2020/09/18 15:31:32
(permalink)
Unsigned 32 divide routine code follows: .ifndef Remainder Remainder = 0 .endif _udiv3232: _udivmod3232: mov w0, w4 ; Copy Dividend.LO to w4.
; Find First '1' Bit in Divisor.HO (if Any). ff1l w3, w5 ; w5 = First One Position; 1 = HO, 16 = LO. bra NC, 0f ; C(arry) Implies NO Ones in w3, so Divisor.HO == 0.
; Follows here the method used when the Hi word of the unsigned divisor is zero. ; Divisor < 0x10000. [Divisor.HO == 0] Perform Long Division because Quotient MAY Not Fit in Word. ; Divide Dividend.HO / Divisor.LO. repeat #(DivideRepeatCount1) ; Division Repeat Count Depends on 33C (6) or 'Other' (18). div.uw w1, w2 ; Divide Step. At End: w0 = w1 / w2 and w1 = w1 % w2.
; Save Quotient.HO; Restore Dividend.LO. exch w0, w4 ; Save Quotient.HO; Restore Dividend.LO.
; Divide (Remainder:Dividend.LO) / Divisor. repeat #(DivideRepeatCount1) ; Division Repeat Count Depends on 33C (6) or 'Other' (18). div.ud w0, w2 ; Divide Step. At End: w0 = (w1:w0) / w2 and w1 = (w1:w0) % w2.
; Set w3:w2 to Remainder if Requested. .if Remainder mul.uu w1, #1, w2 ; W3:W2 is Remainder. .endif
; Recover Quotient.HO. mov w4, w1 ; w1 = Quotient.HO (from w4, which was used as temporary storage). return ; Return to Caller.
; Follows here the method used when the Hi word of the unsigned divisor is NOT zero. 0:; Divisor > 0x0FFFF. [Divisor.HO > 0] Perform Simple Division and Possible Correction. ; Here Divisor.HO > 0, so the Result Will Fit into a Word (Quotient.HO must be zero).
; Preserve Dividend.HO in W5. [Dividend.LO Already in W4.] exch w1, w5 ; w1 = FirstOne Position; w5 = Dividend.HO. W5:W4 = Dividend.
; Calculate Shift Counts for Normalizing the Divisor. subr w1, #17, w7 ; w7 = Right Shift Count. subr w7, #16, w0 ; w0 = Left Shift Count.
; 'Normalized' 16Bit Divisor Created by Shifting the Actual Divisor RIGHT by "RightShiftCount". sl w3, w0, w0 ; Divisor.HO <<= LeftShiftCount. lsr w2, w7, w6 ; Divisor.LO >>= RightShiftCount. ior w0, w6, w6 ; Divisor.LO = (Divisor >> RightShiftCount). Note: MSb = 1.
; Shift Dividend Right by 1: W1:W0 = W5:W4 >> 1. ; Retains Maximum Precision in Quotient (Next Step) while Assuring Quotient Fits in One Word. lsr w5, w1 ; Dividend.HO >>= 1. rrc w4, w0 ; Dividend.LO >>= 1 (and HO Bit from Dividend.HO Bit 0).
; Perform Division of 32/16 Unsigned using "div.ud" Instruction. repeat #(DivideRepeatCount1) ; Division Repeat Count Depends on 33C (6) or 'Other' (18). div.ud w0, w6 ; Divide Step. At End: w0 = (w1:w0) / w6 and w1 = (w1:w0) % w6.
; Scale Quotient by RightShiftCount1 to Compensate for Divisor Shifting. dec w7, w7 ; Reduce RightShiftCount (Dividend Already Shifted Right One Above). lsr w0, w7, w0 ; Quotient (w0) >> (RightShiftCount1). Quotient Now Scaled Back to 1:1.
; Form Quotient * Divisor. mul.uu w3, w0, w6 ; w7:w6 = Divisor.HO * Quotient. mov w6, w1 ; Save (Divisor.HO * Quotient).LO in w1. mul.uu w2, w0, w6 ; w7:w6 = Divisor.LO * Quotient. add w7, w1, w7 ; w7:w6 = Divisor * Quotient.
.if Remainder exch w2, w6 ; Swap divisor with (divisor*quotient) product exch w3, w7 ; (Preserve the Divisor in case we have to recalculate remainder) clr w1 ; w1:w3:w2 will represent an unsigned 48bit quantity here addc w1,#0,w1 ; Set bit iff add produced an overflow
; Calculate the default remainder for the probable quotient sub w4, w2, w2 ; w3:w2 = dividend  quotient * divisor subb w5, w3, w3 ; = Remainder (probably) subbr w1, #0, w1 ; Make sure entire 48bit value did not go negative
btsc SR, #0 ; Iff Carry is clear we have to adjust the quotient and remainder, so do not return retlw #0, w1 ; Return to Caller AFTER Setting Quotient.HO (W1) = 0.
; Correct quotient, Calculate the remainder for the corrected quotient dec w0, w0 ; Adjust quotient down by one add w2, w6, w2 ; We had subtracted one too many divisors: addc w3, w7, w3 ; Add one divisor back into Remainder retlw #0, w1 ; Return to Caller AFTER Setting Quotient.HO (W1) = 0.
.else ; Case for no Remainder necessary ; Note: If the NoRemainder or Remaider = 0 case is used, the divisor in w3:w2 should be preserved across the call! bra C, 1f ; Go decrease Quotient iff add caused Carry
; Determine if dividend  (quotient * divisor) is nonnegative sub w4, w6, w6 ; w7:w6 = dividend  quotient * divisor subb w5, w7, w7 ; = Remainder (probably)
btsc SR, #0 ; Iff Carry is clear, we have to adjust the quotient, so do not return retlw #0, w1 ; Return to Caller AFTER Setting Quotient.HO (W1) = 0.
1: dec w0, w0 ; Adjust quotient down by one retlw #0, w1 ; Return to Caller AFTER Setting Quotient.HO (W1) = 0. .endif .end
post edited by Spinlectrix  2020/09/18 16:47:33

1and0
Access is Denied
 Total Posts : 11506
 Reward points : 0
 Joined: 2007/05/06 12:03:20
 Location: Harry's Gray Matter
 Status: offline
Re: Final Drop for: Unsigned 32 by 32 divide routine in assembler
2020/09/18 15:40:01
(permalink)
Spinlectrix <Edit> Well, I am confuzled... cannot seem to post code right now.
Testing 123 ... ; ; Unsigned Long Integer Division of 32bit by 32bit ; ; uint32_t ; udivmod3232 (uint32_t dividend, uint32_t divisor, uint32_t *remainder) ; ; uint32_t ; udiv3232 (uint32_t dividend, uint32_t divisor) ; ; ; Dividend = Quotient * Divisor + Remainder ; ; Calling Parameters ; w1:w0 = Dividend ; w3:w2 = Divisor ; ; Return Values ; w1:w0 = Quotient = Dividend / Divisor ; w3:w2 = Remainder = Dividend % Divisor ; ; Clobbered Registers ; w7,w6, w5,w4

1and0
Access is Denied
 Total Posts : 11506
 Reward points : 0
 Joined: 2007/05/06 12:03:20
 Location: Harry's Gray Matter
 Status: offline
Re: Final Drop for: Unsigned 32 by 32 divide routine in assembler
2020/09/18 16:23:36
(permalink)
Spinlectrix In optimizing the signed version, I also managed to trim off some cycles of the unsigned version. What I think is my final code follows:
; Cycle Counts depend on the Execution Case, Build Parameters, and the Specific Microcontroller Model. ; The 33C versions use a repeat #(61) for Divides. Others use a repeat #(181). ; R/CALL instructions take 2 cycles except for 24E, 33E, and MasterCores in 33C where they take 4. ; RETURN instructions take 3 cycles except for 24E, 33E, and MasterCores in 33C where they take 6. ; Branch Taken instrs take 2 cycles except for 24E, 33E, and MasterCores in 33C where they take 4. ; Counts Here EXCLUDE Call / Return Cycles (see 'Overhead' Line). Allows 'inline' Evaluation. ; CAUTION: Cycle Counts Measured with the Simulator and Stopwatch. BUGS SUSPECTED! 12 Extra Cycles?
I may be wrong, but I count more cycles. What maximum cycles did you get? For which PIC device?

Spinlectrix
Junior Member
 Total Posts : 72
 Reward points : 0
 Joined: 2016/03/20 14:48:37
 Location: 0
 Status: offline
Re: Final Drop for: Unsigned 32 by 32 divide routine in assembler
2020/09/18 17:55:29
(permalink)
1and0I may be wrong, but I count more cycles. What maximum cycles did you get? For which PIC device?
Well, you're not wrong that when calculating remainders, the longest path (Worst case) through this code is longer in my routine than in your previously posted code. However, on any core of processor, the NoRemainder path, both the average path and the worstcase path are faster. In addition, when using the NoRemainder variant, the Divisor value is preserved across the div calls in the new code, allowing tables or arrays of Dividends to be successively divided by a constant Divisor without reloading (and thus saving another 2 cyles, minimum). Given how rare ULongword Modulo is useful as a operator on a microcontroller, the normal ULongword divide is the case I felt was the one worth concentrating on. And even when calculating the Remainder, the _average_ cycle count is still a bit faster; Events triggering the worstcase path in the new code are rare, typically occurring in only about 1 in 2^16 divides. That said, it's not like the new routine is hugely faster: My current tester, doing 1M random divides with also calculating the Remainder yields an average of 57.61 cycles per divide with your code (Worst case 59 cycles) on a 'E'core processor. With my new code, (and again calculating the Remainder) yields 57.40 cyles/divide (Worstcase 63 cycles). Not a lot to write home about, but still an improvement. I'm in the process of doing 10M random divides to see if it changes things any, but I doubt it. I will update the results regardless, and you're welcome to see my random Dividend and Divisor generator, or to write your own; It's not inconceivable that that could bias these results! <Edit update: no change at all for 10M divides> The bigger part of the insight I gained in doing this refactoring is better reflected in managing a significant improvement (if a few cycles can be called significant) of the Remainder path of the signed divide. Soon to post on the signed 32 by 32 divivison thread!
post edited by Spinlectrix  2020/09/18 18:59:49

1and0
Access is Denied
 Total Posts : 11506
 Reward points : 0
 Joined: 2007/05/06 12:03:20
 Location: Harry's Gray Matter
 Status: offline
Re: Final Drop for: Unsigned 32 by 32 divide routine in assembler
2020/09/18 22:14:51
(permalink)
Spinlectrix Well, you're not wrong that when calculating remainders, the longest path (Worst case) through this code is longer in my routine than in your previously posted code.
The worst case cycle is what matter when trying to write the fastest routine. It is the weakest link. ;) However, on any core of processor, the NoRemainder path, both the average path and the worstcase path are faster. In addition, when using the NoRemainder variant, the Divisor value is preserved across the div calls in the new code, allowing tables or arrays of Dividends to be successively divided by a constant Divisor without reloading (and thus saving another 2 cyles, minimum). Given how rare ULongword Modulo is useful as a operator on a microcontroller, the normal ULongword divide is the case I felt was the one worth concentrating on.
That is why I suggested earlier in this thread that there should be two separate routines, so they can be individually optimized without tradeoff or compromise. And even when calculating the Remainder, the _average_ cycle count is still a bit faster; Events triggering the worstcase path in the new code are rare, typically occurring in only about 1 in 2^16 divides. That said, it's not like the new routine is hugely faster: My current tester, doing 1M random divides with also calculating the Remainder yields an average of 57.61 cycles per divide with your code (Worst case 59 cycles) on a 'E'core processor. With my new code, (and again calculating the Remainder) yields 57.40 cyles/divide (Worstcase 63 cycles). Not a lot to write home about, but still an improvement.
How did you get worse case 59 cycles for my code? I counted my code takes maximum of 53 cycles on an 'E'core calculating the remainder too. <edit> For 'E'core, swap the two legs in my code to shave one cycle; i.e. 52 cycles max. My cycle counts include the RETURN but not the CALL  only cycles consumed by the routine. I'm in the process of doing 10M random divides to see if it changes things any, but I doubt it. I will update the results regardless, and you're welcome to see my random Dividend and Divisor generator, or to write your own; It's not inconceivable that that could bias these results! <Edit update: no change at all for 10M divides>
Please post your random dividend and divisor generator.
post edited by 1and0  2020/09/18 23:23:19

Spinlectrix
Junior Member
 Total Posts : 72
 Reward points : 0
 Joined: 2016/03/20 14:48:37
 Location: 0
 Status: offline
Re: Final Drop for: Unsigned 32 by 32 divide routine in assembler
2020/09/19 09:58:41
(permalink)
However, you have managed to shame me into putting even more time into this: The following u3232 code has no worstcase paths that are longer than your original algorithm: I've now given up pasting easily visible source code; above some size, posting it just seems to stop working(?), so the udiv3232 routine is attached here as a .zip file. Original Code Original Code New Code New Code ' E' Core Performance Stats: Average Cycles WorstCase Average Cycles WorstCase No Remainder Returned 54.12 56 52.06 53 2.06 With Remainder Returned 55.61 57 54.00 57 1.61 <Edit: These numbers have now been updated, thanks to Harry (1and0) pointing out a uniform 2cycle flaw in how I was collecting timing numbers! I believe the numbers posted here are now correct. >
post edited by Spinlectrix  2020/09/20 10:03:10

1and0
Access is Denied
 Total Posts : 11506
 Reward points : 0
 Joined: 2007/05/06 12:03:20
 Location: Harry's Gray Matter
 Status: offline
Re: Final Drop for: Unsigned 32 by 32 divide routine in assembler
2020/09/19 10:14:06
(permalink)
Spinlectrix However, you have managed to shame me into putting even more time into this: The following u3232 code has no worstcase paths that are longer than your original algorithm:
That was not my intention. Any code that will beat the fastest code is welcome! We all benefit and learn something at the same time. When push comes to shove, we can unroll the divide loops and get rid of the REPEAT instructions to shave a cycle or two ... hehe. ;) For my signed version here the function call can be inlined to eliminate the Call and Return.

1and0
Access is Denied
 Total Posts : 11506
 Reward points : 0
 Joined: 2007/05/06 12:03:20
 Location: Harry's Gray Matter
 Status: offline
Re: Final Drop for: Unsigned 32 by 32 divide routine in assembler
2020/09/19 16:09:47
(permalink)
1and0
Spinlectrix However, on any core of processor, the NoRemainder path, both the average path and the worstcase path are faster. In addition, when using the NoRemainder variant, the Divisor value is preserved across the div calls in the new code, allowing tables or arrays of Dividends to be successively divided by a constant Divisor without reloading (and thus saving another 2 cyles, minimum). Given how rare ULongword Modulo is useful as a operator on a microcontroller, the normal ULongword divide is the case I felt was the one worth concentrating on.
That is why I suggested earlier in this thread that there should be two separate routines, so they can be individually optimized without tradeoff or compromise.
I finally have some time to look closer at your algorithm for the noremainder version. It is brilliant and shaves at least 2 cycles; so let's use your code for the noremainder version and my code for the remainder version, then we'll have the best of both world! Keep going and shave more cycles. :)

Spinlectrix
Junior Member
 Total Posts : 72
 Reward points : 0
 Joined: 2016/03/20 14:48:37
 Location: 0
 Status: offline
Re: Final Drop for: Unsigned 32 by 32 divide routine in assembler
2020/09/19 16:44:19
(permalink)
Well, the most recent version saves four cycles, but on just one of the two divide methods, so as a result the average performance is only 2 cycles faster. Man, it gets harder and harder to add any performance! My work on eliminating Worstcase paths through the code took a few hours, shaved only 1 cycle for one path through the worstcase, and did not affect the average performance at all. 1and0Keep going and shave more cycles. :)
See thread post #117 above.
post edited by Spinlectrix  2020/09/19 17:13:22
