FOLK_WISDOM.md - squiggle.c - Self-contained Monte Carlo estimation in C99

FOLK_WISDOM.md (14660B)
      1 # Folk Wisdom
      2 
      3 The README.md file was getting too long and messy, so I moved some of the grumpier comments here.
      4 
      5 ### Nested functions and compilation with tcc.
      6 
      7 GCC has an extension which allows a program to define a function inside another function. This makes squiggle.c code more linear and nicer to read, at the cost of becoming dependent on GCC and hence sacrificing portability and increasing compilation times. Conversely, compiling with tcc (tiny c compiler) is almost instantaneous, but leads to longer execution times and doesn't allow for nested functions.
      8 
      9 | GCC | tcc |
     10 | --- | --- | 
     11 | slower compilation | faster compilation | 
     12 | allows nested functions | doesn't allow nested functions |
     13 | faster execution | slower execution | 
     14 
     15 ~~My recommendation would be to use tcc while drawing a small number of samples for fast iteration, and then using gcc for the final version with lots of samples, and possibly with nested functions for ease of reading by others.~~
     16 
     17 My previous recommendation was to use tcc for marginally faster iteration, but nested functions are just really nice. So my current recommendation is to use gcc throughout, though keep in mind that modifying code to not use nested functions is relatively easy, so keep in mind that you can do that if you run in other environments.
     18 
     19 ### Correlated samples
     20 
     21 In the original [squiggle](https://www.squiggle-language.com/) language, there is some ambiguity about what this code means:
     22 
     23 ```js
     24 a = 1 to 10
     25 b = 2 * a
     26 c = b/a
     27 c
     28 ```
     29 
     30 Likewise in [squigglepy](https://github.com/rethinkpriorities/squigglepy):
     31 
     32 ```python
     33 import squigglepy as sq
     34 import numpy as np
     35 
     36 a = sq.to(1, 3)
     37 b = 2 * a  
     38 c = b / a 
     39 
     40 c_samples = sq.sample(c, 10)
     41 
     42 print(c_samples)
     43 ```
     44 
     45 Should `c` be equal to `2`? or should it be equal to 2 times the expected distribution of the ratio of two independent draws from a (`2 * a/a`, as it were)? You don't know, because you are not operating on samples, you are operating on magical objects whose internals are hidden from you.
     46 
     47 In squiggle.c, this ambiguity doesn't exist, at the cost of much greater overhead & verbosity:
     48 
     49 ```c
     50 // correlated samples
     51 // gcc -O3  correlated.c squiggle.c -lm -o correlated
     52 
     53 #include "squiggle.h"
     54 #include <stdint.h>
     55 #include <stdlib.h>
     56 #include <stdio.h>
     57 
     58 int main(){
     59     // set randomness seed
     60     uint64_t* seed = malloc(sizeof(uint64_t));
     61     *seed = 1000; // xorshift can't start with a seed of 0
     62 
     63     double a = sample_to(1, 10, seed);
     64     double b = 2 * a;
     65     double c = b / a;
     66 
     67     printf("a: %f, b: %f, c: %f\n", a, b, c);
     68     // a: 0.607162, b: 1.214325, c: 0.500000
     69 
     70     free(seed);
     71 }
     72 ```
     73 
     74 vs
     75 
     76 ```c
     77 // uncorrelated samples
     78 // gcc -O3    uncorrelated.c ../../squiggle.c -lm -o uncorrelated
     79 
     80 #include "squiggle.h"
     81 #include <stdint.h>
     82 #include <stdlib.h>
     83 #include <stdio.h>
     84 
     85 double draw_xyz(uint64_t* seed){
     86     // function could also be placed inside main with gcc nested functions extension.
     87     return sample_to(1, 20, seed);
     88 }
     89 
     90 
     91 int main(){
     92     // set randomness seed
     93     uint64_t* seed = malloc(sizeof(uint64_t));
     94     *seed = 1000; // xorshift can't start with a seed of 0
     95 
     96     double a = draw_xyz(seed);
     97     double b = 2 * draw_xyz(seed);
     98     double c = b / a;
     99 
    100     printf("a: %f, b: %f, c: %f\n", a, b, c);
    101     // a: 0.522484, b: 10.283501, c: 19.681936
    102 
    103     free(seed)
    104 }
    105 ```
    106 
    107 Exercise for the reader: What possible meanings could the following represent in [squiggle](https://www.squiggle-language.com/playground?v=0.8.6#code=eNqrVkpJTUsszSlxzk9JVbJSys3M08jLL8pNzNEw0FEw1NRUUKoFAOYsC1c%3D)? How would you implement each of those meanings in squiggle.c?
    108 
    109 ```
    110 min(normal(0, 1))
    111 ```
    112 
    113 Hint: See examples/more/13_parallelize_min
    114 
    115 ### Note on sampling strategies
    116 
    117 
    118 Right now, I am drawing samples using a random number generator. It requires some finesse, particularly when using parallelism. But it works fine.
    119 
    120 But..., what if we could do something more elegant, more ingenious? In particular, what if instead of drawing samples, we had a mesh of equally spaced points in the range of floats? Then we could, for a given number of samples, better estimate the, say, mean of the distribution we are trying to model...
    121 
    122 The problem with that is that if we have some code like:
    123  
    124 ```C
    125 double model(...){
    126   double a = sample_to(1, 10, i_mesh++);
    127   double b = sample_to(1, 2, i_mesh);
    128   return a * b;
    129 }
    130 
    131 ```
    132 
    133 Then this doesn't work, because the values of a and b will be correlated: when a is high, b will also be high. What might work would be something like this:
    134 
    135 
    136 ```C
    137 double* model(int n_samples){
    138   double* xs = malloc((size_t)n_samples * sizeof(double));
    139   for(int i_mesh=0; i_mesh < sqrt(n_samples); i_mesh++){
    140       for(int j_mesh=0; j_mesh < sqrt(n_samples); j_mesh++){
    141           double a = sample_to(1, 10, i_mesh);
    142           double b = sample_to(1, 2, j_mesh);
    143       }
    144   }
    145   return xs;
    146 }
    147 
    148 ```
    149 
    150 But that requires us to encode the shape of the model into the sampling function. It leads to an ugly nesting of for loops. It is a more complex approach. It is not [grug-brained](https://grugbrain.dev/). So every now and then I have to remember that this is not the way.
    151 
    152 ### Tests and the long tail of the lognormal
    153 
    154 Distribution functions can be tested with:
    155 
    156 ```sh
    157 cd tests
    158 make && make run
    159 ```
    160 
    161 `make verify` is an alias that runs all the tests and just displays the ones that are failing. 
    162 
    163 These tests are somewhat rudimentary: they get between 1M and 10M samples from a given sampling function, and check that their mean and standard deviations correspond to what they should theoretically should be.
    164 
    165 If you run `make run` (or `make verify`), you will see errors such as these:
    166 
    167 ```
    168 [-] Mean test for normal(47211.047473, 682197.019012) NOT passed.
    169 Mean of normal(47211.047473, 682197.019012): 46933.673278, vs expected mean: 47211.047473
    170 delta: -277.374195, relative delta: -0.005910
    171 
    172 [-] Std test for lognormal(4.584666, 2.180816) NOT passed.
    173 Std of lognormal(4.584666, 2.180816): 11443.588861, vs expected std: 11342.434900
    174 delta: 101.153961, relative delta: 0.008839
    175 
    176 [-] Std test for to(13839.861856, 897828.354318) NOT passed.
    177 Std of to(13839.861856, 897828.354318): 495123.630575, vs expected std: 498075.002499
    178 delta: -2951.371925, relative delta: -0.005961
    179 ```
    180 
    181 These tests I wouldn't worry about. Due to luck of the draw, their relative error is a bit over 0.005, or 0.5%, and so the test fails. But it would surprise me if that had some meaningful practical implication.
    182 
    183 The errors that should raise some worry are:
    184 
    185 ```
    186 [-] Mean test for lognormal(1.210013, 4.766882) NOT passed.
    187 Mean of lognormal(1.210013, 4.766882): 342337.257677, vs expected mean: 288253.061628
    188 delta: 54084.196049, relative delta: 0.157985
    189 [-] Std test for lognormal(1.210013, 4.766882) NOT passed.
    190 Std of lognormal(1.210013, 4.766882): 208107782.972184, vs expected std: 24776840217.604111
    191 delta: -24568732434.631927, relative delta: -118.057730
    192 
    193 [-] Mean test for lognormal(-0.195240, 4.883106) NOT passed.
    194 Mean of lognormal(-0.195240, 4.883106): 87151.733198, vs expected mean: 123886.818303
    195 delta: -36735.085104, relative delta: -0.421507
    196 [-] Std test for lognormal(-0.195240, 4.883106) NOT passed.
    197 Std of lognormal(-0.195240, 4.883106): 33837426.331671, vs expected std: 18657000192.914921
    198 delta: -18623162766.583248, relative delta: -550.371727
    199 
    200 [-] Mean test for lognormal(0.644931, 4.795860) NOT passed.
    201 Mean of lognormal(0.644931, 4.795860): 125053.904456, vs expected mean: 188163.894101
    202 delta: -63109.989645, relative delta: -0.504662
    203 [-] Std test for lognormal(0.644931, 4.795860) NOT passed.
    204 Std of lognormal(0.644931, 4.795860): 39976300.711166, vs expected std: 18577298706.170452
    205 delta: -18537322405.459286, relative delta: -463.707799
    206 ```
    207 
    208 What is happening in this case is that you are taking a normal, like `normal(-0.195240, 4.883106)`, and you are exponentiating it to arrive at a lognormal. But `normal(-0.195240, 4.883106)` is going to have some non-insignificant weight on, say, 18. But `exp(18) = 39976300`, and points like it are going to end up a nontrivial amount to the analytical mean and standard deviation, even though they have little probability mass.
    209 
    210 The reader can also check that for more plausible real-world values, like those fitting a lognormal to a really wide 90% confidence interval from 10 to 10k, errors aren't egregious:
    211 
    212 ```
    213 [x] Mean test for to(10.000000, 10000.000000) PASSED
    214 [-] Std test for to(10.000000, 10000.000000) NOT passed.
    215 Std of to(10.000000, 10000.000000): 23578.091775, vs expected std: 25836.381819
    216 delta: -2258.290043, relative delta: -0.095779
    217 ```
    218 
    219 Overall, I would caution that if you really care about the very far tails of distributions, you might want to instead use tools which can do some of the analytical manipulations for you, like the original Squiggle, Simple Squiggle (both linked below), or even doing lognormal multiplication by hand, relying on the fact that two lognormals multiplied together result in another lognormal with known shape.
    220 
    221 In fact, squiggle.c does have a few functions for algebraic manipulations of simple distributions at the end of squiggle.c. But these are pretty rudimentary, and I don't know whether I'll end up expanding or deleting them.
    222 
    223 ### Compiler warnings
    224 
    225 #### Harsh compilation
    226 
    227 By default, I've enabled -Wall -Wextra -Wdouble-promotion -Wconversion. However, these produce some false positive warnings, which I've dealt with through:
    228 
    229 - For conversion: Explicit casts, particularly from int to size_t when calling malloc.
    230 - For dealing with unused variables: Using an UNUSED macro. If you don't like that approach, you could add -Wno-unused-parameter to your makefile and remove the macro and its usage.
    231 
    232 Some resources on compiler flags: [1](https://nullprogram.com/blog/2023/04/29/), [2](https://news.ycombinator.com/item?id=7371806)
    233 
    234 #### Results of running clang-tidy
    235 
    236 clang-tidy is a utility to detect common errors in C/C++. You can run it with:
    237 
    238 ```
    239 make tidy
    240 ```
    241 
    242 So far in the history of this program it has emitted:
    243 - One false-positive warning about an issue I'd already taken care of (so I've suppressed the warning)
    244 - a warning about an unused variable
    245 
    246 I think this is good news in terms of making me more confident that this simple library is correct :).
    247 
    248 ### Boundaries between sampling functions and arrays of samples
    249 
    250 In squiggle.c, the boundary between working with sampler functions and arrays of samples is clear. Not so in the original squiggle, which hides this distinction from the user in the interest of accessibility. 
    251 
    252 ### Parallelism
    253 
    254 I provide some functions to draw samples in parallel. For "normal" squiggle.c models, where you define one model and then draw samples from it once at the end, they should be fine. 
    255 
    256 But for more complicated use cases, my recommendation would be to not use parallelism unless you know what you are doing, because of intricacies around setting seeds. Some gotchas and exercises for the reader:
    257 
    258 - If you run the `sampler_parallel` function twice, you will get the same result. Why? 
    259 - If you run the `sampler_parallel` function on two different inputs, their outputs will be correlated. E.g., if you run two lognormals, indices which have higher samples in one will tend to have higher samples in the other one. Why?
    260 - For a small amount of samples, if you run the `sampler_parallel` function, you will get better spread out random numbers than if you run things serially. Why?
    261 
    262 That said, I found adding parallelism to be an interesting an engaging task. Most recently, I even optimized the code to ensure that two threads weren't accessing the same cache line at the same time, and it was very satisfying to see a 30% improvement as a result.
    263 
    264 ### Using arbitrary cdfs
    265 
    266 The last commit that has code to sample from arbitrary cdfs is `8f6919fa2...`. You can access it with `git checkout 8f6919fa2`. I removed them because I wasn't using them, and they didn't really fit with the overall ethos of the project. 
    267 
    268 ### Other gotchas
    269 
    270 - Even though the C standard is ambiguous about this, this code assumes that doubles are 64 bit precision (otherwise the xorshift code should be different).
    271 
    272 ### Consider this code
    273 
    274 Consider sampling from the unit uniform in this manner:
    275 
    276 ```
    277 #include "../squiggle.h"
    278 #include "../squiggle_more.h"
    279 #include <math.h>
    280 #include <stdint.h>
    281 #include <stdio.h>
    282 #include <stdlib.h>
    283 
    284 int main()
    285 {
    286     // Replicate <https://arxiv.org/pdf/1806.02404.pdf>, and in particular the red line in page 11.
    287     // Could also be interesting to just produce and save many samples.
    288 
    289     // set randomness seed
    290     uint64_t* seed = malloc(sizeof(uint64_t));
    291     *seed = UINT64_MAX/64; // xorshift can't start with a seed of 0
    292     
    293     int n_samples = 100*MILLION;
    294     int p_sixteenth = 0;
    295     int p_eighth = 0;
    296     int p_quarter = 0;
    297     int p_half = 0;
    298     double sample;
    299     for(int i=0; i<n_samples; i++){
    300         sample = sample_unit_uniform(seed);
    301         // printf("%lf\n", sample);
    302         if (sample < 1.0/16.0){
    303             p_sixteenth++;
    304             p_eighth++;
    305             p_quarter++;
    306             p_half++;
    307         } else if(sample < 0.125){
    308             p_eighth++;
    309             p_quarter++;
    310             p_half++;
    311         } else if(sample < 0.25){
    312             p_quarter++;
    313             p_half++;
    314         } else if(sample < 0.5){
    315             p_half++;
    316         }else{
    317             // printf("Sample > 0.5\n");
    318         }
    319     }
    320     printf("p_16th: %lf; p_eighth; %lf; p_quarter: %lf; p_half: %lf", ((double)p_sixteenth)/n_samples, (double)p_eighth/n_samples, (double)p_quarter/n_samples, (double)p_half/n_samples);
    321 }
    322 ```
    323 
    324 What will be printed out? In particular, consider that not all floating points have the same density. 
    325 
    326 <details style="border-style: solid; border-width: 2px; padding-top: 20px; padding-left: 20px; padding-right: 20px; margin-bottom: 20px;">
    327 <summary>Click on the arrow to see the answer</summary>
    328 The p_eighth will be ~0.125, p_quarter will be ~0.25, p_half will be ~0.5. This is because these random numbers are produced by generating ints and then dividing by the maximum int. There may be additional gotchas here, but this at least ensures that intervals of the same length in [0,1] will have the same number of samples. It's just that those that are smaller will be represented with more precision.
    329 </details>
    330 
    331 ## Related projects
    332 
    333 - [Squiggle](https://www.squiggle-language.com/)
    334 - [SquigglePy](https://github.com/rethinkpriorities/squigglepy)
    335 - [Simple Squiggle](https://nunosempere.com/blog/2022/04/17/simple-squiggle/)
    336 - [time to BOTEC](https://github.com/NunoSempere/time-to-botec)
    337 - [Find a beta distribution that fits your desired confidence interval](https://nunosempere.com/blog/2023/03/15/fit-beta/) 
    338
	squiggle.c Self-contained Monte Carlo estimation in C99
	Log \| Files \| Refs \| README