• AVR Freaks

occasional EEPROM corruption - possible causes?

Author
saipan59
Super Member
  • Total Posts : 191
  • Reward points : 0
  • Joined: 2006/06/19 11:27:36
  • Location: 0
  • Status: offline
2009/03/25 13:27:12 (permalink)
0

occasional EEPROM corruption - possible causes?

I need suggestions for what could possibly be causing the EEPROM in an 18LF2420 to occasionally become corrupted.
 
The important facts:
The corruption problem first appeared when the code was ported from PICC18 to C18. Could there be an optimization difference that is making trouble?
There were a couple of "minor" functionality changes included with the C18 port (I am looking at those in detail, but they don't seem likely to be a problem).
The corruption seems to involve a 'bunch' of bytes, not just one or two.
The application is a Li-Ion battery manager, which controls the charging and testing of the cells.
The hardware, and the PICC18-based code, has many thousands of hours of runtime, without having this problem.
The code periodically writes a set of param values to the EEPROM (when there is a 'state change', or every 50 minutes).
The problem is relatively hard to reproduce (I can NOT reproduce it at will, so far).
There is evidence that the problem *may* be related to a power-system state change, where the PIC switches from running from external power (3.3V from an LDO), to running from the battery (nominal 4.1V), or vice-versa. I have made scope measurements of the PIC's Vdd during these transitions - it looks normal.
The PIC operates as an I2C Slave, and is periodically (once per minute) queried by the host.
 
Since the corruption involves a bunch of bytes (not just 1 or 2), perhaps the corruption is really in the RAM-based vars that are being written??
 
I plan to build a code image that updates the EE very frequently, and checks the EE integrity each time, to see if I can make the problem easy to reproduce.
 
Any other ideas?
 
Thanks,
Pete
 
#1

8 Replies Related Threads

    danish.ali
    Super Member
    • Total Posts : 1714
    • Reward points : 0
    • Joined: 2004/11/16 02:02:02
    • Location: Surrey, UK
    • Status: offline
    RE: occasional EEPROM corruption - possible causes? 2009/03/25 15:13:56 (permalink)
    0
    Is it an entire block of bytes that gets corrupted, or only part of a block?

    Can you tell if the corruption occurs more-often later in your block-of-bytes?
    By later, I mean in the later bytes to be written in write-order (rather than strictly memory-order, although these tend to go together).

    One possibility is that you get an unfortunate interrupt part-way through your eeprom-block-write routine, and that fails to fully save/restore the context. Perhaps the only routine sensitive to this failure is the eeprom-block-write. So I would look very carefully at context-save, resources used by all ISRs, and context-restore.

    Do any of your ISRs read EEPROM? Including the I2C handler? That could be a disaster by touching EEADR.

    Are any of the values that get written multi-byte? And can they get modified part-way-through eeprom-write? Or do you take a snapshot at the start of your eeprom-write, which cannot change during the long time it takes to write bytes to eeprom.

    Is it possible that there is more corruption than you know about, and that you only spot it where absurd values turn up? Can you add a checksum to your block-of-bytes to improve the chances of detecting corruption?

    Hope these thoughts help you track down the problem,
    Danish
    #2
    NKurzman
    A Guy on the Net
    • Total Posts : 17934
    • Reward points : 0
    • Joined: 2008/01/16 19:33:48
    • Location: 0
    • Status: online
    RE: occasional EEPROM corruption - possible causes? 2009/03/25 17:29:18 (permalink)
    0
    Look at the compiler source code.  Where is the wait for the Write to complete?
    #3
    saipan59
    Super Member
    • Total Posts : 191
    • Reward points : 0
    • Joined: 2006/06/19 11:27:36
    • Location: 0
    • Status: offline
    RE: occasional EEPROM corruption - possible causes? 2009/03/25 17:50:01 (permalink)
    0
    Thanks much for your comments! Below are some comments to your comments.

    ORIGINAL: danish.ali
    Is it an entire block of bytes that gets corrupted, or only part of a block?
    Can you tell if the corruption occurs more-often later in your block-of-bytes?
    By later, I mean in the later bytes to be written in write-order (rather than strictly memory-order, although these tend to go together).?

    Not sure. Some of the bytes later get re-written with 'good' data. I can't capture the failure immediately after it happens. And I have only been able to really study the aftermath of 4 examples.
    One possibility is that you get an unfortunate interrupt part-way through your eeprom-block-write routine, and that fails to fully save/restore the context. Perhaps the only routine sensitive to this failure is the eeprom-block-write. So I would look very carefully at context-save, resources used by all ISRs, and context-restore.

    Yes, I have thought about that type of thing, from the point of view of "what could be different between PICC18 and C18?". During the EE-write, the code disables everything except INT0, which only happens when switching from "running from main power" to "running from battery power". The INT0 ISR does a "soft reset" by jumping to the Reset vector location, so it (should) never return to the EE-write function. Could there be a difference between PICC18 and C18 related to the soft-reset?
    Do any of your ISRs read EEPROM? Including the I2C handler? That could be a disaster by touching EEADR.
    Are any of the values that get written multi-byte? And can they get modified part-way-through eeprom-write? Or do you take a snapshot at the start of your eeprom-write, which cannot change during the long time it takes to write bytes to eeprom.

    Everything but INT0 is (supposedly) disabled, so the EE-write should run non-stop without interruption.
    Is it possible that there is more corruption than you know about, and that you only spot it where absurd values turn up? Can you add a checksum to your block-of-bytes to improve the chances of detecting corruption?

    Yes, that is possible. One of the 4 examples that I've seen had some clear evidence of "good data being written to the wrong location".
    So, maybe the expected interrupt disable is the best suspect?
    In case there are any hints, this is the relevant code:

    RCONbits.IPEN = 1;               // enable interrupt priorities
    IPR1 = 0;               // all peripherals low priority
    . . .
    // Disable non-essential interrupts (allow only INT0).
    INTCON = 0b10010000; // hi-priority (INT0) only
    INTCON3bits.INT1IE = 0;

    Pete
    post edited by saipan59 - 2009/03/25 18:08:12
    #4
    saipan59
    Super Member
    • Total Posts : 191
    • Reward points : 0
    • Joined: 2006/06/19 11:27:36
    • Location: 0
    • Status: offline
    RE: occasional EEPROM corruption - possible causes? 2009/03/25 18:03:55 (permalink)
    0
    ORIGINAL: NKurzman
    Look at the compiler source code.  Where is the wait for the Write to complete?

    Thanks for your reply. Here is the EE-write (all except INT0 is disabled immediately before, and re-enabled immediately after):
     
    EECON1bits.CFGS = 0;
    EECON1bits.EEPGD = 0;
    while (num)
        {
            EEADR = ee_update_addr;
            EEDATA = *value;
            EECON1bits.WREN = 1;
            EECON2 = 0x55;
            EECON2 = 0xaa;
            EECON1bits.WR = 1;
            do
               {
               ClrWdt();
               } while(EECON1bits.WR);
            EECON1bits.WREN = 0;
            num--;
            ee_update_addr++;
            value++;
        }
     
    Is there anything here that could be different from PICC18 to C18?
    Also, I want to mention again that this works fine "nearly every time"...
     
    Pete
     
    #5
    saipan59
    Super Member
    • Total Posts : 191
    • Reward points : 0
    • Joined: 2006/06/19 11:27:36
    • Location: 0
    • Status: offline
    RE: occasional EEPROM corruption - possible causes? 2009/03/25 19:10:11 (permalink)
    0
    I may have just found a problem(??). For C18, my low-priority ISR is declared like this (see below). Note that I used "#pragma interrupt" instead of "#pragma interruptlow".
    The hi-priority ISR is also declared with "#pragma interrupt".
    I'm unclear about what will happen, but I just read that param "interrupt" means hi-priority. Does it make sense that this could work most of the time??
    Also, there is an errata for rev A1 silicon that says to use 'interruptlow' for *both* ISRs (I have rev A1 silicon)...
    All of my IPR bits are 0 (low priority). INT0 is always high priority.
     
    #pragma code InterruptVectorLow = 0x318 // leave space for bootloader
    void InterruptVectorLow (void)
        {
        _asm  goto InterruptHandlerLow  _endasm
        }
    #pragma code
    #pragma interrupt InterruptHandlerLow
    void InterruptHandlerLow (void)
    {
    . . .
    }
     
    Pete
     
    #6
    danish.ali
    Super Member
    • Total Posts : 1714
    • Reward points : 0
    • Joined: 2004/11/16 02:02:02
    • Location: Surrey, UK
    • Status: offline
    RE: occasional EEPROM corruption - possible causes? 2009/03/26 02:00:47 (permalink)
    0
    #pragma interrupt uses hardware context-save. This cannot be nested as there is only one shadow set of registers.

    If you get a high-priority interrupt during a low-priority interrupt, what will happen is that W, STATUS and BSR will be corrupted for the main program. This will cause crazy values to appear in current calculations and possibly in random memory locations. Maybe your system can recover except where those temporary crazy values get written to eeprom.

    This could explain the symptoms you observe.

    And since you have the A1 silicon you cannot rely on hardware context-save even for high-priority interrupts (assuming you are right about that errata item - I have not checked it).

    Switch everything to #pragma interruptlow

    Regards,
    Danish
    #7
    drh
    30+ years
    • Total Posts : 1054
    • Reward points : 0
    • Joined: 2004/07/12 11:43:22
    • Location: Hemet, Calif. USA
    • Status: offline
    RE: occasional EEPROM corruption - possible causes? 2009/03/26 07:22:28 (permalink)
    0
    ORIGINAL: saipan59

    ORIGINAL: NKurzman
    Look at the compiler source code.  Where is the wait for the Write to complete?

    Thanks for your reply. Here is the EE-write (all except INT0 is disabled immediately before, and re-enabled immediately after):

    EECON1bits.CFGS = 0;
    EECON1bits.EEPGD = 0;
    while (num)
       {
           EEADR = ee_update_addr;
           EEDATA = *value;
           EECON1bits.WREN = 1;
           EECON2 = 0x55;
           EECON2 = 0xaa;
           EECON1bits.WR = 1;
           do
              {
              ClrWdt();
              } while(EECON1bits.WR);
           EECON1bits.WREN = 0;
           num--;
           ee_update_addr++;
           value++;
       }

    Is there anything here that could be different from PICC18 to C18?
    Also, I want to mention again that this works fine "nearly every time"...

    Pete



    From the data sheet:
    To write an EEPROM data location, the address must
    first be written to the EEADR register and the data written
    to the EEDATA register. The sequence in
    Example 7-2 must be followed to initiate the write cycle.
    The write will not begin if this sequence is not exactly
    followed (write 55h to EECON2, write 0AAh to
    EECON2, then set WR bit) for each byte. It is strongly
    recommended that interrupts be disabled during this
    code segment.

    Additionally, the WREN bit in EECON1 must be set to
    enable writes. This mechanism prevents accidental
    writes to data EEPROM due to unexpected code execution
    (i.e., runaway programs). The WREN bit should
    be kept clear at all times, except when updating the
    EEPROM. The WREN bit is not cleared by hardware.
    After a write sequence has been initiated, EECON1,
    EEADR and EEDATA cannot be modified. The WR bit
    will be inhibited from being set unless the WREN bit is
    set. Both WR and WREN cannot be set with the same
    instruction.
    At the completion of the write cycle, the WR bit is
    cleared in hardware and the EEPROM Interrupt Flag
    bit, EEIF, is set. The user may either enable this
    interrupt or poll this bit. EEIF must be cleared by
    software.


    It may just be the INT0 interrupt.


    David
    #8
    saipan59
    Super Member
    • Total Posts : 191
    • Reward points : 0
    • Joined: 2006/06/19 11:27:36
    • Location: 0
    • Status: offline
    RE: occasional EEPROM corruption - possible causes? 2009/03/26 17:19:09 (permalink)
    0
    Thanks for the help folks.
    I will attempt to reproduce the problem, and then verify the fix, and report back.
     
    Pete
     
    #9
    Jump to:
    © 2019 APG vNext Commercial Version 4.5