2018-03-18

Analysis of pifs "compression"

Issue #10 of pifs (a file system that encodes files as their index in pi) asks about the expected amount of space that the index would take, and I became curious about this and did some research:

The index in pi of a file with n bits is practically random. It’s geometrically distributed with probability 1/(2^n) (the probability that the file appears at a given index), which means that the average index is 2^n. The index requires an average of n bits to represent, which is the same amount of space that the original file takes.

Since the distribution of indexes is practically random, pifs isn’t really a compression algorithm (which usually targets redundant information).

Here’s a histogram of indexes for 2-digit files in random sequences of digits:

# NumSamples = 10000; Min = 0.00; Max = 500.00
# 67 values outside of min/max
# Mean = 98.814100; Variance = 9846.521741; SD = 99.229641; Median 68.000000
# each ∎ represents a count of 31
    0.0000 -    25.0000 [  2353]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
   25.0000 -    50.0000 [  1687]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
   50.0000 -    75.0000 [  1284]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
   75.0000 -   100.0000 [  1035]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  100.0000 -   125.0000 [   822]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  125.0000 -   150.0000 [   581]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  150.0000 -   175.0000 [   521]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  175.0000 -   200.0000 [   392]: ∎∎∎∎∎∎∎∎∎∎∎∎
  200.0000 -   225.0000 [   294]: ∎∎∎∎∎∎∎∎∎
  225.0000 -   250.0000 [   246]: ∎∎∎∎∎∎∎
  250.0000 -   275.0000 [   178]: ∎∎∎∎∎
  275.0000 -   300.0000 [   135]: ∎∎∎∎
  300.0000 -   325.0000 [   111]: ∎∎∎
  325.0000 -   350.0000 [    82]: ∎∎
  350.0000 -   375.0000 [    61]: ∎
  375.0000 -   400.0000 [    54]: ∎
  400.0000 -   425.0000 [    34]: ∎
  425.0000 -   450.0000 [    29]:
  450.0000 -   475.0000 [    23]:
  475.0000 -   500.0000 [    11]:

This was generated using histogram.py and this Haskell code:

import Data.List
import Data.Maybe
import System.Random
import Control.Monad
import qualified Data.Algorithms.KMP as KMP
import System.Environment
import Control.Error
import Control.Monad.State
import System.IO.Error

main = flip catchIOError (const (return ())) $ do

  args <- getArgs

  let
    fileLength = args `atMay` 0 >>= readMay ?: 2
    seed = args `atMay` 1 >>= readMay ?: 0
    mkFile = replicateM fileLength (state $ randomR ('0', '9'))
    digits = sequence . repeat $ (state $ randomR ('0', '9'))
    indexOf file = (head . KMP.match (KMP.build file)) <$> digits
    gens = map mkStdGen $ randoms (mkStdGen seed)

  mapM_ print $ map (evalState (mkFile >>= indexOf)) gens

  -- Usage: stack runghc pifs.hs | head -n 10000 | histogram.py -x 500 -b 20