Used today data correction is far from being optimal

Discussion in 'Programming' started by Jarek Duda, Oct 3, 2008.

  1. #1
    Let's think about theoretical limit of bits of redundancy we have to
    add for bit of information for assumed statistical error distribution
    to be able to fully correct the file.
    To find this threshold, let's think about simpler looking question:
    how many information is stored in such uncertain bit?
    Let's take the simplest error distribution model - for each bit
    probability that it's switched is equal e (near zero), so if we see
    '1' we know that with probability 1-e it's really '1', and with
    probability e it's 0.
    So if we would know which of this cases we have, in which is stored
    h(e)=-e lg(e) - (1-e) lg(1-e) bits, we would have whole bit - such uncertain bit contains 1-h(e) bits.
    To transfer n real bits, we have to use at least n/(1-h(e)) these
    uncertain bits - the theoretical limit to be able to read a message is
    (asymptotically)
    h(e)/(1-h(e)) additional bits of redundancy /bit of information.

    So a perfect data correction coder for e=1/100 error probability,
    should need only additional 0.088 bits/bit to be able to restore the
    message.
    Hamming 4+3 in spite of using additional 0.75 bits/bit, looses 16bits/
    kilobyte with the same error distribution.

    There are two main reasons of being far from optimal:
    - they place the data in independent blocks, making them sensible to
    pessimistic cases (like 2 errors in one block),
    - they don't correspond to the error distribution. For example Hamming
    4+3 assumes that we have one of 8 error scenarios - no error, or
    changed one of 7 bits. It encodes all of this cases equally, but
    usually the scenario that we have no error is much more probable.

    One of approaches to came closer to the optimal data correction method
    can be using new coder: Asymmetric numeral system.
    http://www.c10n.info/archives/720
    It has huge freedom of choice, for example we can get Hamming code or
    bit tripling as some degenerated cases.
    Here is demonstration about it:
    http://demonstrations.wolfram.com/DataCompressionUsingAsymmetricNumeralSystems
    Here is more information about it's use for data correction:
    http://www.thescienceforum.com/Data-correction-methods-resistant-to-pessimistic-cases-13416t.php
     
    Jarek Duda, Oct 3, 2008 IP
  2. Jarek Duda

    Jarek Duda Peon

    Messages:
    5
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #2
    I've finished large paper about ANS. There is added some deeper analysis, gathered rethinked information I've placed on different forums...
    There is also shown that presented data correction approach can really allow to reach theoretical Shannon's limit and looks to have expected linear time

    http://arxiv.org/abs/0902.0271

    The only known 'practical' approach - Low density parity check distributes uniformly huge amount of checksum bits - it takes a long time (N^2 or more) to distribute them and solving NP problem to find the proper correction and still we are not reaching theoretical limit.
    Presented method works is more sophisticated - uses entropy coder and hash value - it can be imagined as path tracking - we know starting and ending position and we want to walk between them using the proper path. When we use this path everything is fine, but when we lost it, in each step we have selected probability to became conscious of this fact. Now we can go back and try to make some correction. We pay for this parameter with capacity, but if this probability is chosen higher than some threshold corresponding exactly to Shannon's limit, the number of corrections we should try doesn't longer grow exponentially, but polynomially and so we can easily verify if it was the proper correction ... checking all of them in pessimistically polynomial time.
     
    Jarek Duda, Mar 18, 2009 IP
  3. Jarek Duda

    Jarek Duda Peon

    Messages:
    5
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #3
    The simulator of correction process has just been published on Wolfram's page:
    http://demonstrations.wolfram.com/CorrectionTrees/
    Is shows that we finally have near Shanon's limit method working in nearly linear time for any noise level.

    For given probability of bit damage (p_b), we choose p_d parameter. The higher this parameter is, the more redundancy we add, the easier to correct errors.
    We want to find the proper correction (red path in simulator). The main correction mechanism is that if we are expanding the proper correction - everything is fine, but in each step of expanding a wrong correction, we have p_d probability of realizing it. With p_d large enough, the number of corrections we should check doesn't longer grow exponentially.
    At each step there is known tree structure and using it we choose the most probable leaf to expand.

    I've realized that for practical correction methods (not requiring exponential correction time), we rather need a bit more redundancy than theoretical (Shannon's) limit. Redundancy allows to reduce the number of corrections to consider. In practical correction methods we rather have to elongate corrections and so we have to assume that the expected number of corrections up to given moment is finite, what requires more redundancy than Shannon's limit (observe that block codes fulfills this assumption).
    This limit is calculated in the last version of the paper (0902.0271). The basic correction algorithm (as in the simulator) works for a bit worse limit (needs larger encoded file by at most 13%), but it can probably be improved.
    Finally this new family of random trees has two phase transitions - for small p_d < p_d^0 the tree will immediately expand exponentially. For p_d^0 < p_d < p_d^2 the tree has generally small width, but rare high error concentrations make that its expected width is infinite (like long tail in probability distribution). For p_d>p_d^2 it has finite expected width.

    Used today error correction methods works practically only for very low noise (p_b < 0.01). Presented approach works well for any noise (p_b < 0.5).
    For small noises it needs size of encoded file practically like for Shannon's limit. The difference starts for large noises: it needs file size at most twice larger than the limit.
    Practical method for large noises give new way to increase capacity of transmition lines and storage devices - for example place two bits where we would normally place one - the cost is large noise increase, but we can handle it now.

    For extremely large noises, we can no longer use ANS. On fig. 3 of the paper is shown how to handle it. For example if we have to increase the size of the file 100 times, we can encode each bit in 100 bits - encode 1 as (11...111 XOR 'hash value of already encoded message'. The same with 0. Now while creating the tree, each split will have different number of corrected bits - different weight.
     
    Jarek Duda, Jun 24, 2009 IP
  4. Jarek Duda

    Jarek Duda Peon

    Messages:
    5
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #4
    I defended my firs PhD, which half was was about this new approach to data correction - basically it's extended convolutional codes concept to use much larger states (which require working not on the whole space of states as usual, but only on used tree of states and allows to practically complete repair in linear time) and using entropy coder to add redundancy (simultaneous data compression and rate can be changed fluently).
    The thesis is the paper from arxiv with a few small improvements ( can be downloaded from
    http://tcs.uj.edu.pl/graduates.php?degree=1&lang=0 )
    Here is presentation I've used - with a few new pictures which could make understanding the concept easier and e.g. some comparison to LDPC:
    https://docs.google.com/fileview?id=0B7ppK4IyMhisMTNkNmQ2NGUtYzBlNy00ODJjLTlhOGUtM2I4YmZhYmYxZjQ3&hl
     
    Jarek Duda, Dec 9, 2010 IP
  5. Jarek Duda

    Jarek Duda Peon

    Messages:
    5
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #5
    I apology for digging this thread up, but finally there is practical implementation and it beats modern state of arts methods in many application. It can be seen as greatly improved convolutional code-like concept – for example no longer using convolution, but carefully designed extremely fast operation allowing to work on much larger states instead. Other main improvements are using bidirectional decoding and heap (logarithmic complexity) instead of stubbornly used stack (linear complexity). For simplicity it will be called Correction Trees (CT).
    The most important improvement is that it can handle larger channel damage for the same rate. Adding redundancy to double (rate ½) or triple (rate 1/3) the message size theoretically should allow to completely repair up to correspondingly 11% or 17.4% damaged bits for Binary Symmetric Channel (each bit has independently this probability to be flipped). Unfortunately, this Shannon limit is rather unreachable - in practice we can only reduce Bit Error Rate (BER) if noise is significantly lower than this limit. Turbo Codes (TC) and Low Density Parity Check (LDPC) are nowadays seen as teh best methods – here is comparison of some their implementations with CT approach – output BER to input noise level:

    [​IMG]

    We can see that CT still repairs when the others has given up – making it perfect for extreme application like far space or underwater communication. Unfortunately repairing such extreme noises requires extreme resources – software implementation on modern PC decodes a few hundreds bytes per second for extreme noises. Additionally, using more resources the correction capability can be further improved (lowering line in the figure above).
    From the other side, CT encoding is extremely fast and correction for low noises is practically for free – like up to 5-6% for rate ½. In comparison, TC correction always requires a lot of calculation, while LDPC additionally requires also a lot of work for encoding only.
    So in opposite to them, CT is just perfect for e.g. hard discs – everyday work uses low noise, so using CT would make it extremely cheap. From the other hand, if it is really badly damaged, there is still correction possible but it becomes costly. Such correction itself could be also made outside, allowing for further improvement of correction capabilities.

    Paper: http://arxiv.org/abs/1204.5317
    Implementation: https://indect-project.eu/correction-trees/
    Simulator: http://demonstrations.wolfram.com/CorrectionTrees/
     
    Jarek Duda, Apr 25, 2012 IP