2018-03-18

# Analysis of pifs "compression"

Issue #10 of pifs (a file system that encodes files as their index in pi) asks about the expected amount of space that the index would take, and I became curious about this and did some research:

The index in pi of a file with `n` bits is practically random. It’s geometrically distributed with probability `1/(2^n)` (the probability that the file appears at a given index), which means that the average index is `2^n`. The index requires an average of `n` bits to represent, which is the same amount of space that the original file takes.

Since the distribution of indexes is practically random, pifs isn’t really a compression algorithm (which usually targets redundant information).

Here’s a histogram of indexes for 2-digit files in random sequences of digits:

``````# NumSamples = 10000; Min = 0.00; Max = 500.00
# 67 values outside of min/max
# Mean = 98.814100; Variance = 9846.521741; SD = 99.229641; Median 68.000000
# each ∎ represents a count of 31
0.0000 -    25.0000 [  2353]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
25.0000 -    50.0000 [  1687]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
50.0000 -    75.0000 [  1284]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
75.0000 -   100.0000 [  1035]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
100.0000 -   125.0000 [   822]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
125.0000 -   150.0000 [   581]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
150.0000 -   175.0000 [   521]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
175.0000 -   200.0000 [   392]: ∎∎∎∎∎∎∎∎∎∎∎∎
200.0000 -   225.0000 [   294]: ∎∎∎∎∎∎∎∎∎
225.0000 -   250.0000 [   246]: ∎∎∎∎∎∎∎
250.0000 -   275.0000 [   178]: ∎∎∎∎∎
275.0000 -   300.0000 [   135]: ∎∎∎∎
300.0000 -   325.0000 [   111]: ∎∎∎
325.0000 -   350.0000 [    82]: ∎∎
350.0000 -   375.0000 [    61]: ∎
375.0000 -   400.0000 [    54]: ∎
400.0000 -   425.0000 [    34]: ∎
425.0000 -   450.0000 [    29]:
450.0000 -   475.0000 [    23]:
475.0000 -   500.0000 [    11]:
``````

This was generated using histogram.py and this Haskell code:

``````import Data.List
import Data.Maybe
import System.Random
import qualified Data.Algorithms.KMP as KMP
import System.Environment
import Control.Error
import System.IO.Error

main = flip catchIOError (const (return ())) \$ do

args <- getArgs

let
fileLength = args `atMay` 0 >>= readMay ?: 2
seed = args `atMay` 1 >>= readMay ?: 0
mkFile = replicateM fileLength (state \$ randomR ('0', '9'))
digits = sequence . repeat \$ (state \$ randomR ('0', '9'))
indexOf file = (head . KMP.match (KMP.build file)) <\$> digits
gens = map mkStdGen \$ randoms (mkStdGen seed)

mapM_ print \$ map (evalState (mkFile >>= indexOf)) gens

-- Usage: stack runghc pifs.hs | head -n 10000 | histogram.py -x 500 -b 20
``````