• AVR Freaks

Hot!pic32mk1024mcf100 floating point multiply speed

Author
MPaulHolmes
Junior Member
  • Total Posts : 76
  • Reward points : 0
  • Joined: 2009/10/31 10:52:40
  • Location: 0
  • Status: offline
2020/01/20 12:46:44 (permalink)
0

pic32mk1024mcf100 floating point multiply speed

I have my pic32mk1024mcf100 running at 120MHz, and so I was doing some timing tests, since it's all totally new to me, and I want to know how long each operation type takes in C.  I'm using the free xc32 compiler ver 2.30.  I am also using MPLab X ver 5.25.  Doing this 100 times for floating point x,y,z:
z = x * y;
z = x * y;
z = x * y;
...
z = x * y;
 
seems to have a time of about 38 clock cycles for each line.  Addition is the same.  For integer multiply, it's about 28 clock cycles.  Is there something I'm doing wrong?  I remember the dspic30f had some __builtin_mulss(x,y) that was a single cycle.  Is there something similar available on this new microcontroller?  If not, is there a way to speed things up?  Does the paid version of the xc32 compiler do a better job?
#1

17 Replies Related Threads

    andersm
    Super Member
    • Total Posts : 2741
    • Reward points : 0
    • Joined: 2012/10/07 14:57:44
    • Location: 0
    • Status: offline
    Re: pic32mk1024mcf100 floating point multiply speed 2020/01/20 13:21:56 (permalink)
    0
    Start by turning optimizations on.
    #2
    Mysil
    Super Member
    • Total Posts : 3642
    • Reward points : 0
    • Joined: 2012/07/01 04:19:50
    • Location: Norway
    • Status: offline
    Re: pic32mk1024mcf100 floating point multiply speed 2020/01/20 13:44:50 (permalink)
    0
    Hi,
    There are a lot of pitfalls about doing timing tests on small fragments of code.
    You may have to try many versions of your experiments,
    and study the assembler instructions generated by the compiler.
     
    Even with the free license for XC32 compiler, optimization levels 0 and 1 are both available.
    What differences have been observed between optimization levels?
     
    In a pipelined CPU with load-store architecture, doing repeated operations on the same variables,
    may put the processor at a disadvantage, since the compiler may get too few variables to use every clock cycle.
     
    With Pro-mode license, loop unrolling and other optimizations may be available with higher optimization levels.
     
    Then,  what wait-states value is used, and what prefetch cache settings are used? 
     
        Mysil
    #3
    jg_ee
    Super Member
    • Total Posts : 187
    • Reward points : 0
    • Joined: 2015/04/30 10:54:52
    • Location: Colorado
    • Status: offline
    Re: pic32mk1024mcf100 floating point multiply speed 2020/01/20 14:15:29 (permalink)
    0
    As another reference point for you:
     
    I set up the same test for my board I have on my desk with the same chip using 100 of those multiply lines as indicated above.
     
    3 flash wait states.
     
    With the instruction pre-fetch enabled I get about 12 cycles per multiply operation.
     
    With also the predictive cache enabled, I get about 8 cycles per multiply operation. (I leave this off because of errata for Silicon REV A1)
     
    The listing.disasm file shows for each operation:
     

    9D00FA20 8FC30014 LW V1, 20(FP)
    9D00FA24 8FC20018 LW V0, 24(FP)
    9D00FA28 00620018 MULT 0, V1, V0
    9D00FA2C 00001012 MFLO V0
    9D00FA30 AFC2001C SW V0, 28(FP)

    #4
    jg_ee
    Super Member
    • Total Posts : 187
    • Reward points : 0
    • Joined: 2015/04/30 10:54:52
    • Location: Colorado
    • Status: offline
    Re: pic32mk1024mcf100 floating point multiply speed 2020/01/20 14:24:46 (permalink)
    0
    I just noticed you wanted floating point.  For:
     
    3 Flash wait states.
    Instruction pre-fetch enabled.
    No predictive cache:
     
    About 19 cycles per multiply line. 
     
    With the following in the listing.disasm:
    9D00FA0C 8FC30014 LW V1, 20(FP)
    9D00FA10 8FC20018 LW V0, 24(FP)
    9D00FA14 44830000 MTC1 V1, F0
    9D00FA18 44820800 MTC1 V0, F1
    9D00FA1C 46010002 MUL.S F0, F0, F1
    9D00FA20 44020000 MFC1 V0, F0
    9D00FA24 AFC2001C SW V0, 28(FP)

     
    #5
    jg_ee
    Super Member
    • Total Posts : 187
    • Reward points : 0
    • Joined: 2015/04/30 10:54:52
    • Location: Colorado
    • Status: offline
    Re: pic32mk1024mcf100 floating point multiply speed 2020/01/20 14:25:40 (permalink)
    0
    The two tests above were with 0 optimization set in the compiler.
    #6
    simong123
    Lab Member No. 003
    • Total Posts : 1359
    • Reward points : 0
    • Joined: 2012/02/07 18:21:03
    • Location: Future Gadget Lab (UK Branch)
    • Status: offline
    Re: pic32mk1024mcf100 floating point multiply speed 2020/01/20 14:56:10 (permalink)
    0
    For some reason the compiler uses the following pattern to load data into the FPU at optimisation 0:
    mem->Integer GPR->FPU GPR->MUL->FPU GPR->Integer GPR->mem
    LW rt,offset[base]
    LW rt,offset[base]
    MTC1 rt,fs
    MTC1 rt,fs
    MUL.S fd,fs,ft
    MFC1 rt,fs
    SW rt,offset[base]

    At Optimisation >0 the loads are direct to the FPU
    mem->FPU GPR->MUL->FPU GPR->mem
    LWC1 ft,offset[base]
    LWC1 ft,offset[base]
    MUL.S fd,fs,ft
    SWC1 fs,offset[base]

     
    Also most of the loads should be eliminated.
    #7
    andersm
    Super Member
    • Total Posts : 2741
    • Reward points : 0
    • Joined: 2012/10/07 14:57:44
    • Location: 0
    • Status: offline
    Re: pic32mk1024mcf100 floating point multiply speed 2020/01/20 14:56:49 (permalink)
    0
    With optimizations enabled, you should get four instructions per operation (also the repeats should be eliminated).
    #8
    MPaulHolmes
    Junior Member
    • Total Posts : 76
    • Reward points : 0
    • Joined: 2009/10/31 10:52:40
    • Location: 0
    • Status: offline
    Re: pic32mk1024mcf100 floating point multiply speed 2020/01/20 17:51:24 (permalink)
    0
    I've never used MPLab X before.  It's always been MPLab 8.  With optimizations set to 1, this is what one floating point multiply (x = y * z) looks like now in the disassembly:
    9D000A78 C7A1003C LWC1 F1, 60(SP)
    9D000A7C C7A00040 LWC1 F0, 64(SP)
    9D000A80 46000802 MUL.S F0, F1, F0
    9D000A84 E7A00010 SWC1 F0, 16(SP)
     
    I may be misunderstanding what a "cycle" is.  Those 4 instructions above take about 25 "cycles", where I am defining a cycle as one tick of TMR3, where timer 3 is initialized as follows:
     
    T3CONbits.T32 = 1; // 32 bit mode enabled.
    T3CONbits.TCKPS = 0b0; // prescaler = 1. So, the timer runs at 120,000,000 Hz I think. 
    T3CONbits.ON = 1;
     
    As for 
    3 Flash wait states.
    Instruction pre-fetch enabled.
    No predictive cache
     
    How do I set instruction pre-fetch to "enabled"?  What does 3 flash wait states mean?  How do I set that?  Is 3 what it should be?
     
    Also, I'm running in debug mode, using the pickit 3 and the ISP with pgc/pgd.
    post edited by MPaulHolmes - 2020/01/20 18:11:12
    #9
    NKurzman
    A Guy on the Net
    • Total Posts : 18266
    • Reward points : 0
    • Joined: 2008/01/16 19:33:48
    • Location: 0
    • Status: offline
    Re: pic32mk1024mcf100 floating point multiply speed 2020/01/20 18:34:28 (permalink)
    0
    Are you configuring the clocks wait states and cache?  Or are you letting Harmony do it?
    The Flash is slower that the Bus so it needs added waits.  The cache and prefetch mitigate that.
    Not Interrupt must be disabled for valid results.
     
    Timer 3 runs off of Peripheral Bus Clock 3,  Not the the main Oscillator.  How is it configured?
    If you want main Bus clocks use _CP0_GET_COUNT()
     
    #10
    MPaulHolmes
    Junior Member
    • Total Posts : 76
    • Reward points : 0
    • Joined: 2009/10/31 10:52:40
    • Location: 0
    • Status: offline
    Re: pic32mk1024mcf100 floating point multiply speed 2020/01/20 20:54:39 (permalink)
    0
    I didn't configure the clocks wait states and cache.  I don't know how to do that, and I don't know what to set them to.  I'm not using harmony.  I just configured DEVCFGx (x = 0,1,2,3), DEVCP, and SEQ by reading the datasheet and then just created main(). 
     
    Hmm, the tmr3 and the _CP0_GET_COUNT() is only running at 60MHz and not 120MHz like I thought.  I better go figure that out.
     
    post edited by MPaulHolmes - 2020/01/20 21:18:31
    #11
    NorthGuy
    Super Member
    • Total Posts : 5917
    • Reward points : 0
    • Joined: 2014/02/23 14:23:23
    • Location: Northern Canada
    • Status: offline
    Re: pic32mk1024mcf100 floating point multiply speed 2020/01/20 21:19:06 (permalink)
    0
    Multiplication throughput is much better than this, but you need to remove the latency. Instead of reading back the result right away, you can load other registers. This way the CPU will multiply and load next registers simultaneously. You can read MIPS docs which list all the latencies and dependencies.
    #12
    NKurzman
    A Guy on the Net
    • Total Posts : 18266
    • Reward points : 0
    • Joined: 2008/01/16 19:33:48
    • Location: 0
    • Status: offline
    Re: pic32mk1024mcf100 floating point multiply speed 2020/01/20 21:55:21 (permalink)
    0
    Not configuring the wait states and cache can cause performance, or stability issues.
    #13
    MPaulHolmes
    Junior Member
    • Total Posts : 76
    • Reward points : 0
    • Joined: 2009/10/31 10:52:40
    • Location: 0
    • Status: offline
    Re: pic32mk1024mcf100 floating point multiply speed 2020/01/21 07:18:22 (permalink)
    5 (1)
    Is there a reasonable way to configure the wait states and cache without using Harmony?   And also some guidelines for "use these wait states, and use this cache setting" for stable operation?  I just want to get to programming.
     
    edit:  I downloaded harmony and I think I can figure out some things from there.
    post edited by MPaulHolmes - 2020/01/21 09:22:38
    #14
    NKurzman
    A Guy on the Net
    • Total Posts : 18266
    • Reward points : 0
    • Joined: 2008/01/16 19:33:48
    • Location: 0
    • Status: offline
    Re: pic32mk1024mcf100 floating point multiply speed 2020/01/21 10:00:07 (permalink)
    0
    An esy thing to do if you do not want to mess with Harmony is to make an empty Harmony Progect.
    Use the Clock configurator to we up the Clock.  And then do your thing.
     
    The Data sheet will have the information on determining the proper number of Flash wait states.  And how to turn on the Cache.  Note for your experiments the cache could affect the results especially if it is in a loop.
    If your goal is to figure out how many op-codes the C Compiler generates for an expression then you may wany to look at the List file.  On the MIPS code timing in not as consistant as the smaller PICs.
    For example the DMA share the internal bus and can steal cycles from the main loop. 
    #15
    simong123
    Lab Member No. 003
    • Total Posts : 1359
    • Reward points : 0
    • Joined: 2012/02/07 18:21:03
    • Location: Future Gadget Lab (UK Branch)
    • Status: offline
    Re: pic32mk1024mcf100 floating point multiply speed 2020/01/21 10:00:48 (permalink)
    5 (1)
    MPaulHolmes
    Is there a reasonable way to configure the wait states and cache without using Harmony?   And also some guidelines for "use these wait states, and use this cache setting" for stable operation?  I just want to get to programming.

    Read the datasheet?
    Search for CHECON. It even gives valid values for different clock speeds.
    #16
    MPaulHolmes
    Junior Member
    • Total Posts : 76
    • Reward points : 0
    • Joined: 2009/10/31 10:52:40
    • Location: 0
    • Status: offline
    Re: pic32mk1024mcf100 floating point multiply speed 2020/01/21 12:29:57 (permalink)
    0
    Here's what harmony made:
    /* Configure CP0.K0 for optimal performance (cached instruction pre-fetch) */
    __builtin_mtc0(16, 0,(__builtin_mfc0(16, 0) | 0x3));
    /* Configure Wait States and Prefetch */
    CHECONbits.PFMWS = 3;
    CHECONbits.PREFEN = 1;
     
    With optimizations set to 1, I get 8 cycles per multiply.  Then, I switched the PREFEN to zero since it's in the errata sheet to make sure it's zero.  Then I still get 8 cycles per multiply. 
     
     
    Then, with optimizations set to 0, I get 10 cycles per multiply.  
     
    But the PREFEN shouldn't be 1 since that's on the errata sheet (thank you jg_ee!!), so I set it to zero, and I STILL get 8 cycles per multiply.
    Is the "__builtin_mtc0" doing anything bad too?  does anyone know where I can find info on all of the "__builtin_" functions and what they do?  I found a partial list from here:
    http://ww1.microchip.com/downloads/en/DeviceDoc/MPLAB%20XC32%20Compiler%20UG%20PIC32M%20DS-50002799A.pdf
    I did a google search for "__builtin_mtc0" and there's 3 search results, and none of them are from any sort of manual or datasheet.  Where can I find out what the heck that thing is doing? 
     
    Could I get a link to a more complete source of documentation (I don't mind paying for it) for the xc32 compiler and the architecture of this microcontroller?  
     
    #17
    andersm
    Super Member
    • Total Posts : 2741
    • Reward points : 0
    • Joined: 2012/10/07 14:57:44
    • Location: 0
    • Status: offline
    Re: pic32mk1024mcf100 floating point multiply speed 2020/01/21 13:09:14 (permalink)
    0
    The documentation for XC32 is included with the install, MIPS architecture documentation can be found here. The mfc0 and mtc0 instructions are used to read and write the core's coprocessor 0 registers. Bits 2-0 of coprocessor 0 register 16 select 0 (aka CP0.Config) controls the cacheability and coherence attributes of the Kseg0 memory space. Setting the value to 3 enables caching.
    #18
    Jump to:
    © 2020 APG vNext Commercial Version 4.5