• AVR Freaks

Hot!horrible performance of XC8-generated code

Page: 12 > Showing page 1 of 2
Author
uln
New Member
  • Total Posts : 18
  • Reward points : 0
  • Joined: 2019/03/20 10:02:31
  • Location: 0
  • Status: offline
2020/01/08 02:53:51 (permalink)
0

horrible performance of XC8-generated code

I'm trying to rewriting an old firmware using  MPLAB X v5.15 / free XC8 v2.10, but I'm frightened
about the horrible performance of XC8-generated code.
 
A simple shift 5 of an int16_t takes 10 microseconds !!!
 
#define PROCESS_ADVALUE(SEL,OFF,RES) {  \
  static int16_t acc = 0; \
  typedef union sAdVal { \
    uint16_t tmp16; \
    uint8_t t8[2]; \
  } tAdVal; \
  tAdVal adval; \
  uint8_t sel; \
  sel = SEL; \
  adval.t8[0] = ADRESL; \
  adval.t8[1] = ADRESH; \
  acc += (adval.tmp16 + OFF) - RES[sel]; \
  sel ^= 1; \
  RES[sel] = acc /* >> 5 */; \
  SEL = sel; \
}

void __interrupt(high_priority) IRQ5kHz(void) {
  static uint8_t slot = 3;

  if (TMR2IE && TMR2IF) {
    LATBbits.LATB0 = 1;
    ADCON0bits.CHS = slot; // select ad channel converted next

    {
      // if/elseif/else-chain faster than switch/case!
      if (slot < 3) {
        if (slot < 1) { // slot 0, last slot 4 => Ugen
          PROCESS_ADVALUE(adcDat.selUgen, 0, adcDat.Ugen);
        } else if (slot == 1) { // slot 1, last slot 0 => Uzwk
          PROCESS_ADVALUE(adcDat.selUzwk, 0, adcDat.Uzwk);
        } else { // slot 2, last slot 1 => Ilem
          PROCESS_ADVALUE(adcDat.selIlem, -512, adcDat.Ilem);
        }
      } else if (slot == 3) { // slot 3, last slot 2 => Tntc
          PROCESS_ADVALUE(adcDat.selTntc, 0, adcDat.Tntc);
      } else { // slot 4, last slot 3 => Ilup
          PROCESS_ADVALUE(adcDat.selIlup, -512, adcDat.Ilup);
      }
    }

    slot = (slot == 4) ? 0 : slot + 1;
    ADCON0bits.GO = 1; // start conversion
    LATBbits.LATB0 = 0;
    TMR2IF = 0; // processed
  }
}

 
Some measurement with an oscilloscope with and without the shift 5;
 
Attachments are not available: Download requirements not met
 
Using an switch/case instead of the if/elseif/else-chain uses another extra 10 microseconds !!!
 
It seems, the factor 4 I won by using the 4xPLL will be completely eaten by using free XC8 v2.10
instead of CCS Compiler (PCWH 3.249)!?
 
 
 

Attachment(s)

Attachments are not available: Download requirements not met
#1

23 Replies Related Threads

    vloki
    Jo, alla!
    • Total Posts : 6815
    • Reward points : 0
    • Joined: 2007/10/15 00:51:49
    • Location: Germany
    • Status: offline
    Re: horrible performance of XC8-generated code 2020/01/08 03:17:37 (permalink)
    +2 (2)
    10us? May be very fast - (we do not even know the clock frequency ;-))
    Better you view the disassembly or post it...

    Uffbasse !
    #2
    JPortici
    Super Member
    • Total Posts : 894
    • Reward points : 0
    • Joined: 2012/11/17 06:27:45
    • Location: Grappaland
    • Status: offline
    Re: horrible performance of XC8-generated code 2020/01/08 03:29:46 (permalink)
    +1 (1)
    (again. sigh.)
     
    meh, with -O2 (included in the free version) it about as good as it gets without resorting to assembly.
    In any case, with any architecture, any compiler operations on types bigger than the natural data size are going to be slow.
     
    Regarding if/else vs switch there is always a number of cases in which the if/else is faster than the switch
    #3
    uln
    New Member
    • Total Posts : 18
    • Reward points : 0
    • Joined: 2019/03/20 10:02:31
    • Location: 0
    • Status: offline
    Re: horrible performance of XC8-generated code 2020/01/08 03:42:46 (permalink)
    0
    vloki
    10us? May be very fast - (we do not even know the clock frequency ;-))
    Better you view the disassembly or post it...



     
    sorry, I forgot  -  instruction cycle time 62,5ns (4MHz)  =>  10us equals 40 instructions
     
    117: PROCESS_ADVALUE(adcDat.selUgen, 0, adcDat.Ugen); 
    0044 C019 MOVFF 0x19, sel
    0046 F03A NOP
    0048 CFC3 MOVFF ADRES, adval
    004A F030 NOP
    004C CFC4 MOVFF ADRESH, 0x31
    004E F031 NOP
    0050 503A MOVF sel, W, ACCESS
    0052 0D02 MULLW 0x2
    0054 50F3 MOVF PROD, W, ACCESS
    0056 0F11 ADDLW 0x11
    0058 6ED9 MOVWF FSR2, ACCESS
    005A 6ADA CLRF FSR2H, ACCESS
    005C CFDE MOVFF POSTINC2, __pcstackCOMRAM
    005E F024 NOP
    0060 CFDD MOVFF POSTDEC2, 0x25
    0062 F025 NOP
    0064 C030 MOVFF adval, 0x26
    0066 F026 NOP
    0068 C031 MOVFF 0x31, 0x27
    006A F027 NOP
    006C 5024 MOVF __pcstackCOMRAM, W, ACCESS
    006E 5E26 SUBWF 0x26, F, ACCESS
    0070 5025 MOVF 0x25, W, ACCESS
    0072 5A27 SUBWFB 0x27, F, ACCESS
    0074 5026 MOVF 0x26, W, ACCESS
    0076 2622 ADDWF acc, F, ACCESS
    0078 5027 MOVF 0x27, W, ACCESS
    007A 2223 ADDWFC 0x23, F, ACCESS
    007C 0E01 MOVLW 0x1
    007E 1A3A XORWF sel, F, ACCESS
    0080 503A MOVF sel, W, ACCESS
    0082 0D02 MULLW 0x2
    0084 50F3 MOVF PROD, W, ACCESS
    0086 0F11 ADDLW 0x11
    0088 6ED9 MOVWF FSR2, ACCESS
    008A 6ADA CLRF FSR2H, ACCESS
    008C C022 MOVFF acc, POSTINC2
    008E FFDE NOP
    0090 C023 MOVFF 0x23, POSTDEC2
    0092 FFDD NOP
    0094 C03A MOVFF sel, 0x19
    0096 F019 NOP

     
    vs
     
    117: PROCESS_ADVALUE(adcDat.selUgen, 0, adcDat.Ugen); 
    0044 C019 MOVFF 0x19, sel
    0046 F03A NOP
    0048 CFC3 MOVFF ADRES, adval
    004A F030 NOP
    004C CFC4 MOVFF ADRESH, 0x31
    004E F031 NOP
    0050 503A MOVF sel, W, ACCESS
    0052 0D02 MULLW 0x2
    0054 50F3 MOVF PROD, W, ACCESS
    0056 0F11 ADDLW 0x11
    0058 6ED9 MOVWF FSR2, ACCESS
    005A 6ADA CLRF FSR2H, ACCESS
    005C CFDE MOVFF POSTINC2, __pcstackCOMRAM
    005E F024 NOP
    0060 CFDD MOVFF POSTDEC2, 0x25
    0062 F025 NOP
    0064 C030 MOVFF adval, 0x26
    0066 F026 NOP
    0068 C031 MOVFF 0x31, 0x27
    006A F027 NOP
    006C 5024 MOVF __pcstackCOMRAM, W, ACCESS
    006E 5E26 SUBWF 0x26, F, ACCESS
    0070 5025 MOVF 0x25, W, ACCESS
    0072 5A27 SUBWFB 0x27, F, ACCESS
    0074 5026 MOVF 0x26, W, ACCESS
    0076 2622 ADDWF acc, F, ACCESS
    0078 5027 MOVF 0x27, W, ACCESS
    007A 2223 ADDWFC 0x23, F, ACCESS
    007C 0E01 MOVLW 0x1
    007E 1A3A XORWF sel, F, ACCESS
    0080 C022 MOVFF acc, __pcstackCOMRAM
    0082 F024 NOP
    0084 C023 MOVFF 0x23, 0x25
    0086 F025 NOP
    0088 0E05 MOVLW 0x5
    008A 6E26 MOVWF 0x26, ACCESS
    008C FFFF NOP
    008E 3425 RLCF 0x25, W, ACCESS
    0090 3225 RRCF 0x25, F, ACCESS
    0092 3224 RRCF __pcstackCOMRAM, F, ACCESS
    0094 2E26 DECFSZ 0x26, F, ACCESS
    0096 D7FA BRA 0x8C
    0098 503A MOVF sel, W, ACCESS
    009A 0D02 MULLW 0x2
    009C 50F3 MOVF PROD, W, ACCESS
    009E 0F11 ADDLW 0x11
    00A0 6ED9 MOVWF FSR2, ACCESS
    00A2 6ADA CLRF FSR2H, ACCESS
    00A4 C024 MOVFF __pcstackCOMRAM, POSTINC2
    00A6 FFDE NOP
    00A8 C025 MOVFF 0x25, POSTDEC2
    00AA FFDD NOP
    00AC C03A MOVFF sel, 0x19
    00AE F019 NOP

     
    It seems, the compiler does, what I supposed => doing a "shift 1" five times in a loop!
    post edited by uln - 2020/01/08 04:14:18
    #4
    ric
    Super Member
    • Total Posts : 25592
    • Reward points : 0
    • Joined: 2003/11/07 12:41:26
    • Location: Australia, Melbourne
    • Status: online
    Re: horrible performance of XC8-generated code 2020/01/08 05:38:13 (permalink)
    +1 (1)
    And did you try setting optimisation O2, which IS available in the free mode?
     

    I also post at: PicForum
    Links to useful PIC information: http://picforum.ric323.co...opic.php?f=59&t=15
    NEW USERS: Posting images, links and code - workaround for restrictions.
    To get a useful answer, always state which PIC you are using!
    #5
    1and0
    Access is Denied
    • Total Posts : 10346
    • Reward points : 0
    • Joined: 2007/05/06 12:03:20
    • Location: Harry's Gray Matter
    • Status: offline
    Re: horrible performance of XC8-generated code 2020/01/08 07:26:37 (permalink)
    +1 (1)
    uln
    It seems, the factor 4 I won by using the 4xPLL will be completely eaten by using free XC8 v2.10
    instead of CCS Compiler (PCWH 3.249)!?

    uln
    It seems, the compiler does, what I supposed => doing a "shift 1" five times in a loop!

    Yes, it does arithmetic right shift of one bit on a 16-bit signed integer five times in a loop. IMO, that disassembly code is pretty good for Free mode.
     
    You claim the CCS Compiler does better, show and compare the disassembly code generated from it.
    #6
    LdB_ECM
    Super Member
    • Total Posts : 236
    • Reward points : 0
    • Joined: 2019/04/16 22:01:25
    • Location: 0
    • Status: offline
    Re: horrible performance of XC8-generated code 2020/01/08 07:35:57 (permalink)
    +3 (3)
    It's generated horrible code because you have written the worst C code I have seen in a while.
    Putting a typedef in a macro and call the macro in each if/then statement section is just the bomb.
     
    I congratulate you because I actually had to scratch my head as to why the compiler didn't complain the typedef already existed and I had to write this to check you were actually allowed local scope on typedefs.
    Yeah I could have looked it up but it was just so funny I had to try.
     unsigned char test = 0;
        if (test)
        {
            typedef union sAdVal {
                uint16_t tmp16;
                uint8_t t8[2];
            } tAdVal;
        }
        else
        {
            typedef union sAdVal {
                uint16_t tmp16;
                uint8_t t8[2];
            } tAdVal;
        }

     
    So you taught me something that I am sure I will never use and never wanted to know but did give me humor.
     
    There is a hint in that code as to why the optimizer is having a heart attack and you got what you deserved in my opinion. I am with the optimizer it's valid code but you are on your own :-)
    post edited by LdB_ECM - 2020/01/08 07:38:28
    #7
    1and0
    Access is Denied
    • Total Posts : 10346
    • Reward points : 0
    • Joined: 2007/05/06 12:03:20
    • Location: Harry's Gray Matter
    • Status: offline
    Re: horrible performance of XC8-generated code 2020/01/08 07:42:43 (permalink)
    +1 (1)
    uln
     
    sorry, I forgot  -  instruction cycle time 62,5ns (4MHz)  =>  10us equals 40 instructions

    Your math is wrong here.  
     
    40 instructions that take 10 us is equivalent to an instruction cycle time Tcy = 250 ns; i.e. Tosc = Tcy/4 = 62.5 ns (Fosc = 16 MHz).
     
    By the way, that 
    RES[sel] = acc >> 5;

    takes 6 + [5*7 - 1] + 10 = 50 Tcy or 12.5 us.
    #8
    Gort2015
    Klaatu Barada Nikto
    • Total Posts : 3663
    • Reward points : 0
    • Joined: 2015/04/30 10:49:57
    • Location: 0
    • Status: offline
    Re: horrible performance of XC8-generated code 2020/01/08 08:58:03 (permalink)
    0
    Didn't spot it at first, I had to look 3 times.
    I thought it was a separate define.
     
    OP should expand the macro to see how it will look.
    right click in source code -> navigate -> View Macro Expansion.
    Scroll through code in macro window.

    MPLab X playing up, bug in your code? Nevermind, Star Trek:Discovery will be with us soon.
    https://www.youtube.com/watch?v=Iu1qa8N2ID0
    + ST:Continues, "What Ships are Made for", Q's back.
    #9
    uln
    New Member
    • Total Posts : 18
    • Reward points : 0
    • Joined: 2019/03/20 10:02:31
    • Location: 0
    • Status: offline
    Re: horrible performance of XC8-generated code 2020/01/22 02:39:59 (permalink)
    0
    ric
    And did you try setting optimisation O2, which IS available in the free mode?
     


    Yes, I've used optimisation level 2  -  and it's a PIC18F252
    post edited by uln - 2020/01/22 03:01:38
    #10
    uln
    New Member
    • Total Posts : 18
    • Reward points : 0
    • Joined: 2019/03/20 10:02:31
    • Location: 0
    • Status: offline
    Re: horrible performance of XC8-generated code 2020/01/22 02:44:16 (permalink)
    0
    1and0
     
    You claim the CCS Compiler does better, show and compare the disassembly code generated from it.




     
    1A5C:  RRCF   x1F,W
    1A5E:  MOVWF  03
    1A60:  RRCF   x1E,W
    1A62:  MOVWF  02
    1A64:  RRCF   03,F
    1A66:  RRCF   02,F
    1A68:  RRCF   03,F
    1A6A:  RRCF   02,F
    1A6C:  RRCF   03,F
    1A6E:  RRCF   02,F
    1A70:  RRCF   03,F
    1A72:  RRCF   02,F
    1A74:  MOVLW  07
    1A76:  ANDWF  03,F
    1A78:  MOVFF  02,FEF
    1A7C:  MOVFF  03,FEC
    1A80:  MOVFF  222,11B
    #11
    uln
    New Member
    • Total Posts : 18
    • Reward points : 0
    • Joined: 2019/03/20 10:02:31
    • Location: 0
    • Status: offline
    Re: horrible performance of XC8-generated code 2020/01/22 02:49:28 (permalink)
    0
    1and0
    uln
     
    sorry, I forgot  -  instruction cycle time 62,5ns (4MHz)  =>  10us equals 40 instructions

    Your math is wrong here.  
     
    40 instructions that take 10 us is equivalent to an instruction cycle time Tcy = 250 ns; i.e. Tosc = Tcy/4 = 62.5 ns (Fosc = 16 MHz)..


    Yes - I've incorrectlly combined Fosc periode with instruction frequency -  instruction cycle time 250nsns (4MHz) is correct
    #12
    uln
    New Member
    • Total Posts : 18
    • Reward points : 0
    • Joined: 2019/03/20 10:02:31
    • Location: 0
    • Status: offline
    Re: horrible performance of XC8-generated code 2020/01/22 04:04:04 (permalink)
    0
    LdB_ECM
     
    There is a hint in that code as to why the optimizer is having a heart attack and you got what you deserved in my opinion. I am with the optimizer it's valid code but you are on your own :-)




    I don't see any problem to hide the typedef within the macro and I don't see any reason, why the typedef should have any impact to the code to be produced - in fact the compiler generates exactly the same code without that typedef!?
    #13
    pcbbc
    Super Member
    • Total Posts : 1507
    • Reward points : 0
    • Joined: 2014/03/27 07:04:41
    • Location: 0
    • Status: offline
    Re: horrible performance of XC8-generated code 2020/01/22 06:27:46 (permalink)
    +1 (1)
    So it loop unrolls.  Big deal.
    All depends what you are optimising for.  Space or speed.
    And there will always be people unhappy either way.
    Give XC8 a hand and code 5 consecutive 1 bit shifts if you want a loop unrolled version.
    #14
    1and0
    Access is Denied
    • Total Posts : 10346
    • Reward points : 0
    • Joined: 2007/05/06 12:03:20
    • Location: Harry's Gray Matter
    • Status: offline
    Re: horrible performance of XC8-generated code 2020/01/22 08:26:06 (permalink)
    +2 (2)
    Comparing the CCS disassemly from your Post #11:
    1A5C:  RRCF   x1F,W                                rrcf    accH,w
    1A5E:  MOVWF  03                                   movwf   tempH
    1A60:  RRCF   x1E,W                                rrcf    accL,w
    1A62:  MOVWF  02                                   movwf   tempL
    1A64:  RRCF   03,F                                 rrcf    tempH
    1A66:  RRCF   02,F                                 rrcf    tempL
    1A68:  RRCF   03,F                                 rrcf    tempH
    1A6A:  RRCF   02,F                                 rrcf    tempL
    1A6C:  RRCF   03,F                                 rrcf    tempH
    1A6E:  RRCF   02,F                                 rrcf    tempL
    1A70:  RRCF   03,F                                 rrcf    tempH
    1A72:  RRCF   02,F                                 rrcf    tempL
    1A74:  MOVLW  07                                   movlw   0x07
    1A76:  ANDWF  03,F                                 andwf   tempH
    1A78:  MOVFF  02,FEF                               movff   tempL,INDF0
    1A7C:  MOVFF  03,FEC                               movff   tempH,PREINC0

    with this XC8 from your Post #4:
    0088 0E05 MOVLW 0x5                                movlw   5
    008A 6E26 MOVWF 0x26, ACCESS                       movwf   count
    008C FFFF NOP                              loop:   nop
    008E 3425 RLCF 0x25, W, ACCESS                     rlcf    tempH,w
    0090 3225 RRCF 0x25, F, ACCESS                     rrcf    tempH
    0092 3224 RRCF __pcstackCOMRAM, F, ACCESS          rrcf    tempL
    0094 2E26 DECFSZ 0x26, F, ACCESS                   decfsz  count
    0096 D7FA BRA 0x8C                                 bra     loop
    ...
    00A4 C024 MOVFF __pcstackCOMRAM, POSTINC2          movff   tempL,POSTINC2
    00A8 C025 MOVFF 0x25, POSTDEC2                     movff   tempH,POSTDEC2

    The CCS compiler performs logical right shifts of signed integer in an unrolled loop. The XC8 compiler performs arithmetic right shifts of signed integer in a loop, and uses an unnecessary NOP.
     
    That being said, right shift on signed number has implementation defined behavior. So either use or cast the signed integer to an unsigned integer for the shift, and XC8 should perform a logical shift reducing the sign extension snippet.
     
    Anyway, if you are so concerned with speed the assembly equivalent is:
            swapf   POSTINC2
            swapf   INDF2
            movf    POSTDEC2,w
            xorwf   INDF2,w
            andlw   0xF0
            xorwf   POSTINC2
            rrcf    POSTDEC2
            rrcf    POSTINC2
            movlw   0x07
            andwf   INDF2

    which takes only 10 instruction cycles.
    #15
    uln
    New Member
    • Total Posts : 18
    • Reward points : 0
    • Joined: 2019/03/20 10:02:31
    • Location: 0
    • Status: offline
    Re: horrible performance of XC8-generated code 2020/01/22 09:50:23 (permalink)
    0


    That being said, right shift on signed number has implementation defined behavior.




     
    Thx for that hint - CCS doesn't reproduce the intended arithmetic right shift!
     
    Doing an unsigned shift with XC8 changes only
    008E 3425 RLCF 0x25, W, ACCESS
    to
    008E 90D8 BCF STATUS, 0, ACCESS
     
    #16
    1and0
    Access is Denied
    • Total Posts : 10346
    • Reward points : 0
    • Joined: 2007/05/06 12:03:20
    • Location: Harry's Gray Matter
    • Status: offline
    Re: horrible performance of XC8-generated code 2020/01/22 10:04:55 (permalink)
    +1 (1)
    uln
    Thx for that hint - CCS doesn't reproduce the intended arithmetic right shift!

    You can implement arithmetic right shift with logical shifts, but of course, it will take more code. ;)
     
    uln
    Doing an unsigned shift with XC8 changes only
    008E 3425 RLCF 0x25, W, ACCESS

    to
    008E 90D8 BCF STATUS, 0, ACCESS


    Yup, clearing the carry bit yields a logical shift. It is what it is. Like I said, you want speed the only way is assembly. ;)
     
    #17
    1and0
    Access is Denied
    • Total Posts : 10346
    • Reward points : 0
    • Joined: 2007/05/06 12:03:20
    • Location: Harry's Gray Matter
    • Status: offline
    Re: horrible performance of XC8-generated code 2020/01/22 10:18:02 (permalink)
    0
    uln
     
    Thx for that hint - CCS doesn't reproduce the intended arithmetic right shift!

    Is CCS smart enough to use arithmetic right shifts for this?
    RES[sel] = acc / 32;

    #18
    uln
    New Member
    • Total Posts : 18
    • Reward points : 0
    • Joined: 2019/03/20 10:02:31
    • Location: 0
    • Status: offline
    Re: horrible performance of XC8-generated code 2020/01/23 01:25:04 (permalink)
    0
    1and0
    Is CCS smart enough to use arithmetic right shifts for this?
    RES[sel] = acc / 32;




    I've only a very old version of the CCS compiler (v3.29) - it generates:

    ....................             static int16_t acc = 0;
    ....................             acc += (((int16) outISO) - 128);
    1ADC:  MOVLB  1
    1ADE:  CLRF   xF0
    1AE0:  MOVLW  80
    1AE2:  SUBWF  x02,W
    1AE4:  MOVWF  00
    1AE6:  MOVLW  00
    1AE8:  SUBWFB xF0,W
    1AEA:  MOVWF  03
    1AEC:  MOVF   00,W
    1AEE:  ADDWF  x07,F
    1AF0:  MOVF   03,W
    1AF2:  ADDWFC x08,F
    ....................             res =  acc / 32;
    1AF4:  MOVFF  108,1F0
    1AF8:  MOVFF  107,1EF
    1AFC:  CLRF   xF2
    1AFE:  MOVLW  20
    1B00:  MOVWF  xF1
    1B02:  MOVLB  0
    1B04:  GOTO   0BA6
    1B08:  MOVFF  02,106
    1B0C:  MOVLB  1
    1B0E:  MOVFF  01,105

    Because code located at 0BA6 isn't included in assembler listing file, I can answer the question.
     
    Although the CCS compiler generates fast code, there are many reasons to move to MPLAB X - annoying implementation details of the CCS compiler, missing debug possibility, poor source code quality (my predecessors preferred to add a dsPIC33 board to the PIC18F252 system when a CAN bus was needed to avoid touching that source code!)
    #19
    ric
    Super Member
    • Total Posts : 25592
    • Reward points : 0
    • Joined: 2003/11/07 12:41:26
    • Location: Australia, Melbourne
    • Status: online
    Re: horrible performance of XC8-generated code 2020/01/23 01:27:12 (permalink)
    +1 (1)
    uln
    ...
    Although the CCS compiler generates fast code, there are many reasons to move to MPLAB X

    CCS is a compiler. MPLABX is an IDE.
    I guess you mean "move to XC8".
    I think it's even possible to run the CCS compiler under MPLABX, rather than XC8.

    I also post at: PicForum
    Links to useful PIC information: http://picforum.ric323.co...opic.php?f=59&t=15
    NEW USERS: Posting images, links and code - workaround for restrictions.
    To get a useful answer, always state which PIC you are using!
    #20
    Page: 12 > Showing page 1 of 2
    Jump to:
    © 2020 APG vNext Commercial Version 4.5