README.md (17469B)
1 # squiggle.c 2 3 squiggle.c is a self-contained C99 library that provides functions for simple Monte Carlo estimation, inspired by [Squiggle](https://www.squiggle-language.com/). 4 5 ## Motivation 6 7 ### What am I trying to do here? 8 9 - I am trying to build a reliable alternative to the original squiggle, that works for me and addresses my frustrations with it. 10 - Some adjectives: [grug brain](https://grugbrain.dev/), [lindy](https://en.wikipedia.org/wiki/Lindy_effect), [suckless](https://suckless.org/) 11 - I am trying to make something that is simple enough that I and others can fully understand. squiggle.c is less than 700 lines of C, with a core of <230 lines. You, a somewhat technically sophisticated reader, could just read it and grasp its contents, and is encouraged to do so. 12 13 ### Why C? 14 15 - Because it is fast 16 - Because I enjoy it 17 - Because C is honest 18 - Because it will last long 19 - Because it can be made faster if need be, e.g., with a multi-threading library like OpenMP, by implementing faster but more complex algorithms, or more simply, by inlining the sampling functions (adding an `inline` directive before their function declaration) 20 - Because there are few abstractions between it and machine code (C => assembly => machine code with gcc, or C => machine code, with tcc), leading to fewer errors beyond the programmer's control. 21 - Because it can fit in my head 22 - Because if you can implement something in C, you can implement it anywhere else 23 24 ### Design choices 25 26 This code should aim to be correct, then simple, then fast. 27 28 - It should be correct. The user should be able to rely on it and not think about whether errors come from the library. 29 - Nonetheless, the user should understand the limitations of sampling-based methods. See the section on [Tests and the long tail of the lognormal](https://git.nunosempere.com/personal/squiggle.c/src/branch/master/FOLK_WISDOM.md#tests-and-the-long-tail-of-the-lognormal) for a discussion of how sampling is bad at capturing some aspects of distributions with long tails. 30 - It should be clear, conceptually simple. Simple for me to implement, simple for others to understand. 31 - It should be fast. But when speed conflicts with simplicity, choose simplicity. For example, there might be several possible algorithms to sample a distribution, each of which is faster over part of the domain. In that case, it's conceptually simpler to just pick one algorithm, and pay the—normally small—performance penalty. 32 - In any case, though, the code should still be *way faster* than, say, Python. 33 34 Note that being terse, or avoiding verbosity, is a non-goal. This is in part because of the constraints that C imposes. But it also aids with clarity and conceptual simplicity, as the issue of correlated samples illustrates in the next section. 35 36 ## Getting started 37 38 Download squiggle.c, for instance: 39 40 ```sh 41 $ rm -r -f squiggle_c 42 $ wget https://git.nunosempere.com/personal/squiggle.c/raw/branch/master/squiggle.c 43 $ wget https://git.nunosempere.com/personal/squiggle.c/raw/branch/master/squiggle.h 44 $ wget https://git.nunosempere.com/personal/squiggle.c/raw/branch/master/squiggle_more.c 45 $ wget https://git.nunosempere.com/personal/squiggle.c/raw/branch/master/squiggle_more.h 46 $ mkdir temp 47 $ mv squiggle* temp 48 $ mv temp squiggle_c 49 ``` 50 51 Write your model. For instance, your could replicate [this paper](https://arxiv.org/abs/1806.02404) as follows: 52 53 ```C 54 #include "squiggle_c/squiggle.h" 55 #include <math.h> 56 #include <stdint.h> 57 #include <stdio.h> 58 #include <stdlib.h> 59 60 double sample_fermi_logspace(uint64_t * seed) 61 { 62 // Replicate <https://arxiv.org/pdf/1806.02404.pdf>, and in particular the red line in page 11. 63 // You can see a simple version of this function in naive.c in this same folder 64 double log_rate_of_star_formation = sample_uniform(log(1), log(100), seed); 65 double log_fraction_of_stars_with_planets = sample_uniform(log(0.1), log(1), seed); 66 double log_number_of_habitable_planets_per_star_system = sample_uniform(log(0.1), log(1), seed); 67 68 double log_rate_of_life_formation_in_habitable_planets = sample_normal(1, 50, seed); 69 double log_fraction_of_habitable_planets_in_which_any_life_appears; 70 /* 71 Consider: 72 a = underlying normal 73 b = rate_of_life_formation_in_habitable_planets = exp(underlying normal) = exp(a) 74 c = 1 - exp(-b) = fraction_of_habitable_planets_in_which_any_life_appears 75 d = log(c) 76 77 Looking at the Taylor expansion for c = 1 - exp(-b), it's 78 b - b^2/2 + b^3/6 - x^b/24, etc. 79 <https://www.wolframalpha.com/input?i=1-exp%28-x%29> 80 When b ~ 0 (as is often the case), this is close to b. 81 82 But now, if b ~ 0, c ~ b 83 and d = log(c) ~ log(b) = log(exp(a)) = a 84 85 Now, we could play around with estimating errors, 86 and indeed if we want b^2/2 = exp(a)^2/2 < 10^(-n), i.e., to have n decimal digits of precision, 87 we could compute this as e.g., a < (nlog(10) + log(2))/2 88 so for example if we want ten digits of precision, that's a < -11 89 90 Empirically, the two numbers as calculated in C do become really close around 11 or so, 91 and at 38 that calculation results in a -inf (so probably a floating point error or similar.) 92 So we should be using that formula for somewhere between -38 << a < -11 93 94 I chose -16 as a happy medium after playing around with 95 double invert(double x){ 96 return log(1-exp(-exp(-x))); 97 } 98 for(int i=0; i<64; i++){ 99 double j = i; 100 printf("for %lf, log(1-exp(-exp(-x))) is calculated as... %lf\n", j, invert(j)); 101 } 102 and <https://www.wolframalpha.com/input?i=log%281-exp%28-exp%28-16%29%29%29> 103 */ 104 if (log_rate_of_life_formation_in_habitable_planets < -16) { 105 log_fraction_of_habitable_planets_in_which_any_life_appears = log_rate_of_life_formation_in_habitable_planets; 106 } else { 107 double rate_of_life_formation_in_habitable_planets = exp(log_rate_of_life_formation_in_habitable_planets); 108 double fraction_of_habitable_planets_in_which_any_life_appears = -expm1(-rate_of_life_formation_in_habitable_planets); 109 log_fraction_of_habitable_planets_in_which_any_life_appears = log(fraction_of_habitable_planets_in_which_any_life_appears); 110 } 111 112 double log_fraction_of_planets_with_life_in_which_intelligent_life_appears = sample_uniform(log(0.001), log(1), seed); 113 double log_fraction_of_intelligent_planets_which_are_detectable_as_such = sample_uniform(log(0.01), log(1), seed); 114 double log_longevity_of_detectable_civilizations = sample_uniform(log(100), log(10000000000), seed); 115 116 double log_n = 117 log_rate_of_star_formation + 118 log_fraction_of_stars_with_planets + 119 log_number_of_habitable_planets_per_star_system + 120 log_fraction_of_habitable_planets_in_which_any_life_appears + 121 log_fraction_of_planets_with_life_in_which_intelligent_life_appears + 122 log_fraction_of_intelligent_planets_which_are_detectable_as_such + 123 log_longevity_of_detectable_civilizations; 124 return log_n; 125 } 126 127 double sample_are_we_alone_logspace(uint64_t * seed) 128 { 129 double log_n = sample_fermi_logspace(seed); 130 return ((log_n > 0) ? 1 : 0); 131 // log_n > 0 => n > 1 132 } 133 134 int main() 135 { 136 137 // set randomness seed 138 uint64_t* seed = malloc(sizeof(uint64_t)); 139 *seed = 1010101; // xorshift can't start with a seed of 0 140 // Note: come back to choice of seed. 141 142 double logspace_fermi_proportion = 0; 143 int n_samples = 1000 * 1000; 144 for (int i = 0; i < n_samples; i++) { 145 double result = sample_are_we_alone_logspace(seed); 146 logspace_fermi_proportion += result; 147 } 148 double p_not_alone = logspace_fermi_proportion / n_samples; 149 printf("Probability that we are not alone: %lf (%.lf%%)\n", p_not_alone, p_not_alone * 100); 150 151 free(seed); 152 } 153 ``` 154 155 Compile and run: 156 157 ``` 158 $ gcc -O3 samples.c ./squiggle_c/squiggle.c ./squiggle_c/squiggle_more.c -lm -fopenmp -o samples 159 $ ./samples 160 ``` 161 162 ### Core strategy 163 164 The recommended strategy is to: 165 166 1. Define sampler functions, which take a seed, and return 1 sample 167 2. Compose those sampler functions to define your estimation model 168 3. Produce an array of samples from a sampler function 169 4. Get summary statistics for that array of samples. 170 171 ### More examples 172 173 You can follow some example usage in the [examples/](examples]) folder. In [examples/core](examples/core/), we build up some functionality, starting from drawing one sample and finishing with the replication of [Dissolving the Fermi paradox](https://arxiv.org/abs/1806.02404) above. In [examples/more](examples/more), we present a few more complicated examples, like finding confidence intervals, a model of nuclear war, an estimate of how much exercise to do to lose 10kg, or an example using parallelism. 174 175 ## Guarantees 176 177 The bad: 178 179 - I offer no guarantees about stability, correctness, performance, etc. I might, for instance, abandon the version in C and rewrite it in Zig, Nim, Rust, Go. 180 - This project mostly exists for my own usage & for my own amusement. 181 - Caution! Think carefully before using this project for anything important. 182 - If you wanted to pay me to provide some stability or correctness, guarantees, or to tweak this library for your own usage, or to teach you how to use it, you could do so [here](https://nunosempere.com/consulting). 183 - I am conflicted about parallelism. It *does* add more complexity, complexity that you can be bitten by if you are not careful and don't understand it. And this conflicts with the initial grug-brain motivation. At the same time, it is clever, and it is nice, and I like it a lot. 184 185 The good: 186 187 - You can vendor the code, i.e., save it as a dependency together with your other files. This way, this renders you immune to any changes I may make. 188 - I've been hacking at this project for a while now, and I think I have a good grasp of its correctness and limitations. I've tried Nim and Zig, and I prefer C so far. 189 - I think the core interface is not likely to change much, though I've recently changed the interface for parallelism and for getting confidence intervals. 190 - I am using this code for a few important consulting projects, and I trust myself to operate it correctly. 191 192 ## Functions and their usage 193 194 ### squiggle.c 195 196 `squiggle.c` should be pretty tightly scoped. Available functions are: 197 198 ```C 199 // Underlying pseudo-randomness function 200 uint64_t xorshift64(uint64_t* seed); 201 202 // Sampling functions 203 double sample_unit_uniform(uint64_t* seed); 204 double sample_unit_normal(uint64_t* seed); 205 double sample_uniform(double start, double end, uint64_t* seed); 206 double sample_normal(double mean, double sigma, uint64_t* seed); 207 double sample_lognormal(double logmean, double logsigma, uint64_t* seed); 208 double sample_normal_from_90_ci(double low, double high, uint64_t* seed); 209 double sample_to(double low, double high, uint64_t* seed); 210 double sample_gamma(double alpha, uint64_t* seed); 211 double sample_beta(double a, double b, uint64_t* seed); 212 double sample_laplace(double successes, double failures, uint64_t* seed); 213 214 // Array helpers 215 double array_sum(double* array, int length); 216 void array_cumsum(double* array_to_sum, double* array_cumsummed, int length); 217 double array_mean(double* array, int length); 218 double array_std(double* array, int length); 219 220 // Mixture function 221 double sample_mixture(double (*samplers[])(uint64_t*), double* weights, int n_dists, uint64_t* seed); 222 ``` 223 224 The samplers syntax for the mixture functions denotes that it takes an array of functions. You can use it as follows: 225 226 ```C 227 #include "squiggle.h" 228 229 double sample_0(uint64_t* seed) { UNUSED(seed); return 0; } 230 double sample_1(uint64_t* seed) { UNUSED(seed); return 1; } 231 double sample_few(uint64_t* seed) { return sample_to(1, 3, seed); } 232 double sample_many(uint64_t* seed) { return sample_to(2, 10, seed); } 233 234 double sample_model(uint64_t* seed){ 235 236 double p_a = 0.8; 237 double p_b = 0.5; 238 double p_c = p_a * p_b; 239 240 int n_dists = 4; 241 double weights[] = { 1 - p_c, p_c / 2, p_c / 4, p_c / 4 }; 242 double (*samplers[])(uint64_t*) = { sample_0, sample_1, sample_few, sample_many }; 243 double result = sample_mixture(samplers, weights, n_dists, seed); 244 245 return result; 246 } 247 ``` 248 249 ### squiggle_more.h 250 251 `squiggle_more.c` has expansions and convenience functions, which are more meandering. Expansions are in `squiggle_more.c` and `squiggle_more.h`. To use them, take care to link them: 252 253 254 ```C 255 #include "squiggle.h" 256 #include "squiggle_more.h" 257 ``` 258 259 ``` 260 # When compiling: 261 $ gcc -std=c99 -Wall -O3 example.c squiggle.c squiggle_more.c -lm -o ./example 262 263 ``` 264 265 Available definitions are as follows: 266 267 ```C 268 #define THOUSAND 1000 269 #define MILLION 1000000 270 271 /* Parallel sampling */ 272 void sampler_parallel(double (*sampler)(uint64_t* seed), double* results, int n_threads, int n_samples); 273 274 /* Stats */ 275 double array_get_median(double xs[], int n); 276 typedef struct ci_t { 277 double low; 278 double high; 279 } ci; 280 ci array_get_ci(ci interval, double* xs, int n); 281 ci array_get_90_ci(double xs[], int n); 282 283 void array_print_stats(double xs[], int n); 284 void array_print_histogram(double* xs, int n_samples, int n_bins); 285 void array_print_90_ci_histogram(double* xs, int n, int n_bins); 286 287 /* Algebra manipulations */ 288 289 typedef struct normal_params_t { 290 double mean; 291 double std; 292 } normal_params; 293 normal_params algebra_sum_normals(normal_params a, normal_params b); 294 295 typedef struct lognormal_params_t { 296 double logmean; 297 double logstd; 298 } lognormal_params; 299 lognormal_params algebra_product_lognormals(lognormal_params a, lognormal_params b); 300 301 lognormal_params convert_ci_to_lognormal_params(ci x); 302 ci convert_lognormal_params_to_ci(lognormal_params y); 303 ``` 304 305 On parallelism in particular, see the warnings and caveats in the [FOLK_WISDOM.md](./FOLK_WISDOM.md) file. That file also has many other nuggets, warnings, trinkets, caveats, pointers I've collected over time. 306 307 Here is an example of using parallelism, and then printing some stats and a histogram: 308 309 ```C 310 #include "../../../squiggle.h" 311 #include "../../../squiggle_more.h" 312 #include <stdio.h> 313 #include <stdlib.h> 314 315 double sample_beta_3_2(uint64_t* seed) { return sample_beta(3.0, 2.0, seed); } 316 317 int main() 318 { 319 // set randomness seed 320 uint64_t* seed = malloc(sizeof(uint64_t)); 321 *seed = 1000; // xorshift can't start with 0 322 323 int n_samples = 1 * MILLION; 324 double* xs = malloc(sizeof(double) * (size_t)n_samples); 325 sampler_parallel(sample_beta_3_2, xs, 16, n_samples); 326 327 printf("\n# Stats\n"); 328 array_print_stats(xs, n_samples); 329 printf("\n# Histogram\n"); 330 array_print_histogram(xs, n_samples, 23); 331 332 free(seed); 333 } 334 ``` 335 336 This produces the following output: 337 338 ``` 339 Avg: 0.600036 340 Std: 0.199851 341 5%: 0.249009 342 10%: 0.320816 343 25%: 0.456413 344 50%: 0.614356 345 75%: 0.757000 346 90%: 0.857256 347 95%: 0.902290 348 349 # Histogram 350 [ 0.00, 0.05): 391 351 [ 0.05, 0.09): █ 2352 352 [ 0.09, 0.13): ███ 5766 353 [ 0.13, 0.18): ██████ 10517 354 [ 0.18, 0.22): ██████████ 16412 355 [ 0.22, 0.26): ██████████████ 22773 356 [ 0.26, 0.31): ███████████████████ 30120 357 [ 0.31, 0.35): ████████████████████████ 37890 358 [ 0.35, 0.39): █████████████████████████████ 45067 359 [ 0.39, 0.44): █████████████████████████████████ 52174 360 [ 0.44, 0.48): ██████████████████████████████████████ 59636 361 [ 0.48, 0.52): ██████████████████████████████████████████ 64924 362 [ 0.52, 0.57): █████████████████████████████████████████████ 69832 363 [ 0.57, 0.61): ████████████████████████████████████████████████ 74099 364 [ 0.61, 0.65): █████████████████████████████████████████████████ 76776 365 [ 0.65, 0.70): ██████████████████████████████████████████████████ 77001 366 [ 0.70, 0.74): ████████████████████████████████████████████████ 75290 367 [ 0.74, 0.78): ██████████████████████████████████████████████ 71711 368 [ 0.78, 0.83): ██████████████████████████████████████████ 65576 369 [ 0.83, 0.87): ████████████████████████████████████ 56839 370 [ 0.87, 0.91): ████████████████████████████ 44626 371 [ 0.91, 0.96): ███████████████████ 29464 372 [ 0.96, 1.00]: ██████ 10764 373 ``` 374 375 ## Licensing 376 377 This project is released under the MIT license, a permissive open-source license. You can see it in the LICENSE.txt file. 378