README.md - squiggle.c - Self-contained Monte Carlo estimation in C99

README.md (17469B)
      1 # squiggle.c 
      2 
      3 squiggle.c is a self-contained C99 library that provides functions for simple Monte Carlo estimation, inspired by [Squiggle](https://www.squiggle-language.com/).
      4 
      5 ## Motivation
      6 
      7 ### What am I trying to do here? 
      8 
      9 - I am trying to build a reliable alternative to the original squiggle, that works for me and addresses my frustrations with it.
     10 - Some adjectives: [grug brain](https://grugbrain.dev/), [lindy](https://en.wikipedia.org/wiki/Lindy_effect), [suckless](https://suckless.org/)
     11 - I am trying to make something that is simple enough that I and others can fully understand. squiggle.c is less than 700 lines of C, with a core of <230 lines. You, a somewhat technically sophisticated reader, could just read it and grasp its contents, and is encouraged to do so.
     12 
     13 ### Why C? 
     14 
     15 - Because it is fast
     16 - Because I enjoy it
     17 - Because C is honest
     18 - Because it will last long 
     19 - Because it can be made faster if need be, e.g., with a multi-threading library like OpenMP, by implementing faster but more complex algorithms, or more simply, by inlining the sampling functions (adding an `inline` directive before their function declaration)
     20 - Because there are few abstractions between it and machine code (C => assembly => machine code with gcc, or C => machine code, with tcc), leading to fewer errors beyond the programmer's control.
     21 - Because it can fit in my head
     22 - Because if you can implement something in C, you can implement it anywhere else
     23 
     24 ### Design choices
     25 
     26 This code should aim to be correct, then simple, then fast.
     27 
     28 - It should be correct. The user should be able to rely on it and not think about whether errors come from the library.
     29   - Nonetheless, the user should understand the limitations of sampling-based methods. See the section on [Tests and the long tail of the lognormal](https://git.nunosempere.com/personal/squiggle.c/src/branch/master/FOLK_WISDOM.md#tests-and-the-long-tail-of-the-lognormal) for a discussion of how sampling is bad at capturing some aspects of distributions with long tails.
     30 - It should be clear, conceptually simple. Simple for me to implement, simple for others to understand.
     31 - It should be fast. But when speed conflicts with simplicity, choose simplicity. For example, there might be several possible algorithms to sample a distribution, each of which is faster over part of the domain. In that case, it's conceptually simpler to just pick one algorithm, and pay the—normally small—performance penalty. 
     32   - In any case, though, the code should still be *way faster* than, say, Python.
     33 
     34 Note that being terse, or avoiding verbosity, is a non-goal. This is in part because of the constraints that C imposes. But it also aids with clarity and conceptual simplicity, as the issue of correlated samples illustrates in the next section.
     35 
     36 ## Getting started 
     37 
     38 Download squiggle.c, for instance: 
     39 
     40 ```sh
     41 $ rm -r -f squiggle_c
     42 $ wget https://git.nunosempere.com/personal/squiggle.c/raw/branch/master/squiggle.c
     43 $ wget https://git.nunosempere.com/personal/squiggle.c/raw/branch/master/squiggle.h
     44 $ wget https://git.nunosempere.com/personal/squiggle.c/raw/branch/master/squiggle_more.c
     45 $ wget https://git.nunosempere.com/personal/squiggle.c/raw/branch/master/squiggle_more.h
     46 $ mkdir temp
     47 $ mv squiggle* temp
     48 $ mv temp squiggle_c
     49 ```
     50 
     51 Write your model. For instance, your could replicate [this paper](https://arxiv.org/abs/1806.02404) as follows:
     52 
     53 ```C
     54 #include "squiggle_c/squiggle.h"
     55 #include <math.h>
     56 #include <stdint.h>
     57 #include <stdio.h>
     58 #include <stdlib.h>
     59 
     60 double sample_fermi_logspace(uint64_t * seed)
     61 {
     62     // Replicate <https://arxiv.org/pdf/1806.02404.pdf>, and in particular the red line in page 11.
     63     // You can see a simple version of this function in naive.c in this same folder
     64     double log_rate_of_star_formation = sample_uniform(log(1), log(100), seed);
     65     double log_fraction_of_stars_with_planets = sample_uniform(log(0.1), log(1), seed);
     66     double log_number_of_habitable_planets_per_star_system = sample_uniform(log(0.1), log(1), seed);
     67 
     68     double log_rate_of_life_formation_in_habitable_planets = sample_normal(1, 50, seed);
     69     double log_fraction_of_habitable_planets_in_which_any_life_appears;
     70     /* 
     71         Consider:
     72         a = underlying normal 
     73         b = rate_of_life_formation_in_habitable_planets = exp(underlying normal) = exp(a)
     74         c = 1 - exp(-b) = fraction_of_habitable_planets_in_which_any_life_appears
     75         d = log(c)
     76 
     77         Looking at the Taylor expansion for c = 1 - exp(-b), it's 
     78         b - b^2/2 + b^3/6 - x^b/24, etc.
     79         <https://www.wolframalpha.com/input?i=1-exp%28-x%29>
     80         When b ~ 0 (as is often the case), this is close to b.
     81 
     82         But now, if b ~ 0, c ~ b
     83         and d = log(c) ~ log(b) = log(exp(a)) = a
     84 
     85         Now, we could play around with estimating errors,
     86         and indeed if we want b^2/2 = exp(a)^2/2 < 10^(-n), i.e., to have n decimal digits of precision,
     87         we could compute this as e.g., a < (nlog(10) + log(2))/2
     88         so for example if we want ten digits of precision, that's a < -11
     89         
     90         Empirically, the two numbers as calculated in C do become really close around 11 or so, 
     91         and at 38 that calculation results in a -inf (so probably a floating point error or similar.)
     92         So we should be using that formula for somewhere between -38 << a < -11
     93 
     94         I chose -16 as a happy medium after playing around with 
     95         double invert(double x){
     96             return log(1-exp(-exp(-x)));
     97         }
     98         for(int i=0; i<64; i++){
     99             double j = i;
    100             printf("for %lf, log(1-exp(-exp(-x))) is calculated as... %lf\n", j, invert(j));
    101         }
    102         and <https://www.wolframalpha.com/input?i=log%281-exp%28-exp%28-16%29%29%29>
    103     */
    104     if (log_rate_of_life_formation_in_habitable_planets < -16) {
    105         log_fraction_of_habitable_planets_in_which_any_life_appears = log_rate_of_life_formation_in_habitable_planets;
    106     } else {
    107         double rate_of_life_formation_in_habitable_planets = exp(log_rate_of_life_formation_in_habitable_planets);
    108         double fraction_of_habitable_planets_in_which_any_life_appears = -expm1(-rate_of_life_formation_in_habitable_planets);
    109         log_fraction_of_habitable_planets_in_which_any_life_appears = log(fraction_of_habitable_planets_in_which_any_life_appears);
    110     }
    111 
    112     double log_fraction_of_planets_with_life_in_which_intelligent_life_appears = sample_uniform(log(0.001), log(1), seed);
    113     double log_fraction_of_intelligent_planets_which_are_detectable_as_such = sample_uniform(log(0.01), log(1), seed);
    114     double log_longevity_of_detectable_civilizations = sample_uniform(log(100), log(10000000000), seed);
    115 
    116     double log_n = 
    117         log_rate_of_star_formation + 
    118         log_fraction_of_stars_with_planets + 
    119         log_number_of_habitable_planets_per_star_system + 
    120         log_fraction_of_habitable_planets_in_which_any_life_appears + 
    121         log_fraction_of_planets_with_life_in_which_intelligent_life_appears + 
    122         log_fraction_of_intelligent_planets_which_are_detectable_as_such + 
    123         log_longevity_of_detectable_civilizations;
    124     return log_n;
    125 }
    126 
    127 double sample_are_we_alone_logspace(uint64_t * seed)
    128 {
    129     double log_n = sample_fermi_logspace(seed);
    130     return ((log_n > 0) ? 1 : 0);
    131     // log_n > 0 => n > 1
    132 }
    133 
    134 int main()
    135 {
    136 
    137     // set randomness seed
    138     uint64_t* seed = malloc(sizeof(uint64_t));
    139     *seed = 1010101; // xorshift can't start with a seed of 0
    140     // Note: come back to choice of seed.
    141 
    142     double logspace_fermi_proportion = 0;
    143     int n_samples = 1000 * 1000;
    144     for (int i = 0; i < n_samples; i++) {
    145         double result = sample_are_we_alone_logspace(seed);
    146         logspace_fermi_proportion += result;
    147     }
    148     double p_not_alone = logspace_fermi_proportion / n_samples;
    149     printf("Probability that we are not alone: %lf (%.lf%%)\n", p_not_alone, p_not_alone * 100);
    150 
    151     free(seed);
    152 }
    153 ```
    154 
    155 Compile and run:
    156 
    157 ```
    158 $ gcc -O3 samples.c ./squiggle_c/squiggle.c  ./squiggle_c/squiggle_more.c -lm -fopenmp -o samples
    159 $ ./samples
    160 ```
    161 
    162 ### Core strategy
    163 
    164 The recommended strategy is to:
    165 
    166 1. Define sampler functions, which take a seed, and return 1 sample
    167 2. Compose those sampler functions to define your estimation model
    168 3. Produce an array of samples from a sampler function
    169 4. Get summary statistics for that array of samples.
    170 
    171 ### More examples
    172 
    173 You can follow some example usage in the [examples/](examples]) folder. In [examples/core](examples/core/), we build up some functionality, starting from drawing one sample and finishing with the replication of [Dissolving the Fermi paradox](https://arxiv.org/abs/1806.02404) above. In [examples/more](examples/more), we present a few more complicated examples, like finding confidence intervals, a model of nuclear war, an estimate of how much exercise to do to lose 10kg, or an example using parallelism.
    174 
    175 ## Guarantees
    176 
    177 The bad:
    178 
    179 - I offer no guarantees about stability, correctness, performance, etc. I might, for instance, abandon the version in C and rewrite it in Zig, Nim, Rust, Go.
    180 - This project mostly exists for my own usage & for my own amusement.
    181 - Caution! Think carefully before using this project for anything important.
    182 - If you wanted to pay me to provide some stability or correctness, guarantees, or to tweak this library for your own usage, or to teach you how to use it, you could do so [here](https://nunosempere.com/consulting).
    183 - I am conflicted about parallelism. It *does* add more complexity, complexity that you can be bitten by if you are not careful and don't understand it. And this conflicts with the initial grug-brain motivation. At the same time, it is clever, and it is nice, and I like it a lot.
    184 
    185 The good: 
    186 
    187 - You can vendor the code, i.e., save it as a dependency together with your other files. This way, this renders you immune to any changes I may make. 
    188 - I've been hacking at this project for a while now, and I think I have a good grasp of its correctness and limitations. I've tried Nim and Zig, and I prefer C so far.
    189 - I think the core interface is not likely to change much, though I've recently changed the interface for parallelism and for getting confidence intervals.
    190 - I am using this code for a few important consulting projects, and I trust myself to operate it correctly.
    191 
    192 ## Functions and their usage
    193 
    194 ### squiggle.c
    195 
    196 `squiggle.c` should be pretty tightly scoped. Available functions are:
    197 
    198 ```C
    199 // Underlying pseudo-randomness function
    200 uint64_t xorshift64(uint64_t* seed);
    201 
    202 // Sampling functions
    203 double sample_unit_uniform(uint64_t* seed);
    204 double sample_unit_normal(uint64_t* seed);
    205 double sample_uniform(double start, double end, uint64_t* seed);
    206 double sample_normal(double mean, double sigma, uint64_t* seed);
    207 double sample_lognormal(double logmean, double logsigma, uint64_t* seed);
    208 double sample_normal_from_90_ci(double low, double high, uint64_t* seed);
    209 double sample_to(double low, double high, uint64_t* seed);
    210 double sample_gamma(double alpha, uint64_t* seed);
    211 double sample_beta(double a, double b, uint64_t* seed);
    212 double sample_laplace(double successes, double failures, uint64_t* seed);
    213 
    214 // Array helpers
    215 double array_sum(double* array, int length);
    216 void array_cumsum(double* array_to_sum, double* array_cumsummed, int length);
    217 double array_mean(double* array, int length);
    218 double array_std(double* array, int length);
    219 
    220 // Mixture function
    221 double sample_mixture(double (*samplers[])(uint64_t*), double* weights, int n_dists, uint64_t* seed);
    222 ```
    223 
    224 The samplers syntax for the mixture functions denotes that it takes an array of functions. You can use it as follows:
    225 
    226 ```C
    227 #include "squiggle.h"
    228 
    229 double sample_0(uint64_t* seed) { UNUSED(seed); return 0; }
    230 double sample_1(uint64_t* seed) { UNUSED(seed); return 1; }
    231 double sample_few(uint64_t* seed) { return sample_to(1, 3, seed); }
    232 double sample_many(uint64_t* seed) { return sample_to(2, 10, seed); }
    233 
    234 double sample_model(uint64_t* seed){
    235 
    236     double p_a = 0.8;
    237     double p_b = 0.5;
    238     double p_c = p_a * p_b;
    239 
    240     int n_dists = 4;
    241     double weights[] = { 1 - p_c, p_c / 2, p_c / 4, p_c / 4 };
    242     double (*samplers[])(uint64_t*) = { sample_0, sample_1, sample_few, sample_many };
    243     double result = sample_mixture(samplers, weights, n_dists, seed);
    244 
    245     return result;
    246 }
    247 ```
    248 
    249 ### squiggle_more.h
    250 
    251 `squiggle_more.c` has expansions and convenience functions, which are more meandering. Expansions are in `squiggle_more.c` and `squiggle_more.h`. To use them, take care to link them:
    252 
    253 
    254 ```C
    255 #include "squiggle.h"
    256 #include "squiggle_more.h"
    257 ```
    258 
    259 ```
    260 # When compiling:
    261 $ gcc -std=c99 -Wall -O3 example.c squiggle.c squiggle_more.c -lm -o ./example
    262 
    263 ```
    264 
    265 Available definitions are as follows:
    266 
    267 ```C
    268 #define THOUSAND 1000
    269 #define MILLION 1000000
    270 
    271 /* Parallel sampling */
    272 void sampler_parallel(double (*sampler)(uint64_t* seed), double* results, int n_threads, int n_samples);
    273 
    274 /* Stats */
    275 double array_get_median(double xs[], int n);
    276 typedef struct ci_t {
    277     double low;
    278     double high;
    279 } ci;
    280 ci array_get_ci(ci interval, double* xs, int n);
    281 ci array_get_90_ci(double xs[], int n);
    282 
    283 void array_print_stats(double xs[], int n);
    284 void array_print_histogram(double* xs, int n_samples, int n_bins);
    285 void array_print_90_ci_histogram(double* xs, int n, int n_bins);
    286 
    287 /* Algebra manipulations */
    288 
    289 typedef struct normal_params_t {
    290     double mean;
    291     double std;
    292 } normal_params;
    293 normal_params algebra_sum_normals(normal_params a, normal_params b);
    294 
    295 typedef struct lognormal_params_t {
    296     double logmean;
    297     double logstd;
    298 } lognormal_params;
    299 lognormal_params algebra_product_lognormals(lognormal_params a, lognormal_params b);
    300 
    301 lognormal_params convert_ci_to_lognormal_params(ci x);
    302 ci convert_lognormal_params_to_ci(lognormal_params y);
    303 ```
    304 
    305 On parallelism in particular, see the warnings and caveats in the [FOLK_WISDOM.md](./FOLK_WISDOM.md) file. That file also has many other nuggets, warnings, trinkets, caveats, pointers I've collected over time. 
    306 
    307 Here is an example of using parallelism, and then printing some stats and a histogram:
    308 
    309 ```C
    310 #include "../../../squiggle.h"
    311 #include "../../../squiggle_more.h"
    312 #include <stdio.h>
    313 #include <stdlib.h>
    314 
    315 double sample_beta_3_2(uint64_t* seed) { return sample_beta(3.0, 2.0, seed); }
    316 
    317 int main()
    318 {
    319     // set randomness seed
    320     uint64_t* seed = malloc(sizeof(uint64_t));
    321     *seed = 1000; // xorshift can't start with 0
    322 
    323     int n_samples = 1 * MILLION;
    324     double* xs = malloc(sizeof(double) * (size_t)n_samples);
    325     sampler_parallel(sample_beta_3_2, xs, 16, n_samples);
    326 
    327     printf("\n# Stats\n");
    328     array_print_stats(xs, n_samples);
    329     printf("\n# Histogram\n");
    330     array_print_histogram(xs, n_samples, 23);
    331 
    332     free(seed);
    333 }
    334 ```
    335 
    336 This produces the following output:
    337 
    338 ```
    339 Avg: 0.600036
    340 Std: 0.199851
    341  5%: 0.249009
    342 10%: 0.320816
    343 25%: 0.456413
    344 50%: 0.614356
    345 75%: 0.757000
    346 90%: 0.857256
    347 95%: 0.902290
    348 
    349 # Histogram
    350 [  0.00,   0.05):  391
    351 [  0.05,   0.09): █ 2352
    352 [  0.09,   0.13): ███ 5766
    353 [  0.13,   0.18): ██████ 10517
    354 [  0.18,   0.22): ██████████ 16412
    355 [  0.22,   0.26): ██████████████ 22773
    356 [  0.26,   0.31): ███████████████████ 30120
    357 [  0.31,   0.35): ████████████████████████ 37890
    358 [  0.35,   0.39): █████████████████████████████ 45067
    359 [  0.39,   0.44): █████████████████████████████████ 52174
    360 [  0.44,   0.48): ██████████████████████████████████████ 59636
    361 [  0.48,   0.52): ██████████████████████████████████████████ 64924
    362 [  0.52,   0.57): █████████████████████████████████████████████ 69832
    363 [  0.57,   0.61): ████████████████████████████████████████████████ 74099
    364 [  0.61,   0.65): █████████████████████████████████████████████████ 76776
    365 [  0.65,   0.70): ██████████████████████████████████████████████████ 77001
    366 [  0.70,   0.74): ████████████████████████████████████████████████ 75290
    367 [  0.74,   0.78): ██████████████████████████████████████████████ 71711
    368 [  0.78,   0.83): ██████████████████████████████████████████ 65576
    369 [  0.83,   0.87): ████████████████████████████████████ 56839
    370 [  0.87,   0.91): ████████████████████████████ 44626
    371 [  0.91,   0.96): ███████████████████ 29464
    372 [  0.96,   1.00]: ██████ 10764
    373 ```
    374 
    375 ## Licensing
    376 
    377 This project is released under the MIT license, a permissive open-source license. You can see it in the LICENSE.txt file. 
    378
	squiggle.c Self-contained Monte Carlo estimation in C99
	Log \| Files \| Refs \| README