FOLK_WISDOM.md (14660B)
1 # Folk Wisdom 2 3 The README.md file was getting too long and messy, so I moved some of the grumpier comments here. 4 5 ### Nested functions and compilation with tcc. 6 7 GCC has an extension which allows a program to define a function inside another function. This makes squiggle.c code more linear and nicer to read, at the cost of becoming dependent on GCC and hence sacrificing portability and increasing compilation times. Conversely, compiling with tcc (tiny c compiler) is almost instantaneous, but leads to longer execution times and doesn't allow for nested functions. 8 9 | GCC | tcc | 10 | --- | --- | 11 | slower compilation | faster compilation | 12 | allows nested functions | doesn't allow nested functions | 13 | faster execution | slower execution | 14 15 ~~My recommendation would be to use tcc while drawing a small number of samples for fast iteration, and then using gcc for the final version with lots of samples, and possibly with nested functions for ease of reading by others.~~ 16 17 My previous recommendation was to use tcc for marginally faster iteration, but nested functions are just really nice. So my current recommendation is to use gcc throughout, though keep in mind that modifying code to not use nested functions is relatively easy, so keep in mind that you can do that if you run in other environments. 18 19 ### Correlated samples 20 21 In the original [squiggle](https://www.squiggle-language.com/) language, there is some ambiguity about what this code means: 22 23 ```js 24 a = 1 to 10 25 b = 2 * a 26 c = b/a 27 c 28 ``` 29 30 Likewise in [squigglepy](https://github.com/rethinkpriorities/squigglepy): 31 32 ```python 33 import squigglepy as sq 34 import numpy as np 35 36 a = sq.to(1, 3) 37 b = 2 * a 38 c = b / a 39 40 c_samples = sq.sample(c, 10) 41 42 print(c_samples) 43 ``` 44 45 Should `c` be equal to `2`? or should it be equal to 2 times the expected distribution of the ratio of two independent draws from a (`2 * a/a`, as it were)? You don't know, because you are not operating on samples, you are operating on magical objects whose internals are hidden from you. 46 47 In squiggle.c, this ambiguity doesn't exist, at the cost of much greater overhead & verbosity: 48 49 ```c 50 // correlated samples 51 // gcc -O3 correlated.c squiggle.c -lm -o correlated 52 53 #include "squiggle.h" 54 #include <stdint.h> 55 #include <stdlib.h> 56 #include <stdio.h> 57 58 int main(){ 59 // set randomness seed 60 uint64_t* seed = malloc(sizeof(uint64_t)); 61 *seed = 1000; // xorshift can't start with a seed of 0 62 63 double a = sample_to(1, 10, seed); 64 double b = 2 * a; 65 double c = b / a; 66 67 printf("a: %f, b: %f, c: %f\n", a, b, c); 68 // a: 0.607162, b: 1.214325, c: 0.500000 69 70 free(seed); 71 } 72 ``` 73 74 vs 75 76 ```c 77 // uncorrelated samples 78 // gcc -O3 uncorrelated.c ../../squiggle.c -lm -o uncorrelated 79 80 #include "squiggle.h" 81 #include <stdint.h> 82 #include <stdlib.h> 83 #include <stdio.h> 84 85 double draw_xyz(uint64_t* seed){ 86 // function could also be placed inside main with gcc nested functions extension. 87 return sample_to(1, 20, seed); 88 } 89 90 91 int main(){ 92 // set randomness seed 93 uint64_t* seed = malloc(sizeof(uint64_t)); 94 *seed = 1000; // xorshift can't start with a seed of 0 95 96 double a = draw_xyz(seed); 97 double b = 2 * draw_xyz(seed); 98 double c = b / a; 99 100 printf("a: %f, b: %f, c: %f\n", a, b, c); 101 // a: 0.522484, b: 10.283501, c: 19.681936 102 103 free(seed) 104 } 105 ``` 106 107 Exercise for the reader: What possible meanings could the following represent in [squiggle](https://www.squiggle-language.com/playground?v=0.8.6#code=eNqrVkpJTUsszSlxzk9JVbJSys3M08jLL8pNzNEw0FEw1NRUUKoFAOYsC1c%3D)? How would you implement each of those meanings in squiggle.c? 108 109 ``` 110 min(normal(0, 1)) 111 ``` 112 113 Hint: See examples/more/13_parallelize_min 114 115 ### Note on sampling strategies 116 117 118 Right now, I am drawing samples using a random number generator. It requires some finesse, particularly when using parallelism. But it works fine. 119 120 But..., what if we could do something more elegant, more ingenious? In particular, what if instead of drawing samples, we had a mesh of equally spaced points in the range of floats? Then we could, for a given number of samples, better estimate the, say, mean of the distribution we are trying to model... 121 122 The problem with that is that if we have some code like: 123 124 ```C 125 double model(...){ 126 double a = sample_to(1, 10, i_mesh++); 127 double b = sample_to(1, 2, i_mesh); 128 return a * b; 129 } 130 131 ``` 132 133 Then this doesn't work, because the values of a and b will be correlated: when a is high, b will also be high. What might work would be something like this: 134 135 136 ```C 137 double* model(int n_samples){ 138 double* xs = malloc((size_t)n_samples * sizeof(double)); 139 for(int i_mesh=0; i_mesh < sqrt(n_samples); i_mesh++){ 140 for(int j_mesh=0; j_mesh < sqrt(n_samples); j_mesh++){ 141 double a = sample_to(1, 10, i_mesh); 142 double b = sample_to(1, 2, j_mesh); 143 } 144 } 145 return xs; 146 } 147 148 ``` 149 150 But that requires us to encode the shape of the model into the sampling function. It leads to an ugly nesting of for loops. It is a more complex approach. It is not [grug-brained](https://grugbrain.dev/). So every now and then I have to remember that this is not the way. 151 152 ### Tests and the long tail of the lognormal 153 154 Distribution functions can be tested with: 155 156 ```sh 157 cd tests 158 make && make run 159 ``` 160 161 `make verify` is an alias that runs all the tests and just displays the ones that are failing. 162 163 These tests are somewhat rudimentary: they get between 1M and 10M samples from a given sampling function, and check that their mean and standard deviations correspond to what they should theoretically should be. 164 165 If you run `make run` (or `make verify`), you will see errors such as these: 166 167 ``` 168 [-] Mean test for normal(47211.047473, 682197.019012) NOT passed. 169 Mean of normal(47211.047473, 682197.019012): 46933.673278, vs expected mean: 47211.047473 170 delta: -277.374195, relative delta: -0.005910 171 172 [-] Std test for lognormal(4.584666, 2.180816) NOT passed. 173 Std of lognormal(4.584666, 2.180816): 11443.588861, vs expected std: 11342.434900 174 delta: 101.153961, relative delta: 0.008839 175 176 [-] Std test for to(13839.861856, 897828.354318) NOT passed. 177 Std of to(13839.861856, 897828.354318): 495123.630575, vs expected std: 498075.002499 178 delta: -2951.371925, relative delta: -0.005961 179 ``` 180 181 These tests I wouldn't worry about. Due to luck of the draw, their relative error is a bit over 0.005, or 0.5%, and so the test fails. But it would surprise me if that had some meaningful practical implication. 182 183 The errors that should raise some worry are: 184 185 ``` 186 [-] Mean test for lognormal(1.210013, 4.766882) NOT passed. 187 Mean of lognormal(1.210013, 4.766882): 342337.257677, vs expected mean: 288253.061628 188 delta: 54084.196049, relative delta: 0.157985 189 [-] Std test for lognormal(1.210013, 4.766882) NOT passed. 190 Std of lognormal(1.210013, 4.766882): 208107782.972184, vs expected std: 24776840217.604111 191 delta: -24568732434.631927, relative delta: -118.057730 192 193 [-] Mean test for lognormal(-0.195240, 4.883106) NOT passed. 194 Mean of lognormal(-0.195240, 4.883106): 87151.733198, vs expected mean: 123886.818303 195 delta: -36735.085104, relative delta: -0.421507 196 [-] Std test for lognormal(-0.195240, 4.883106) NOT passed. 197 Std of lognormal(-0.195240, 4.883106): 33837426.331671, vs expected std: 18657000192.914921 198 delta: -18623162766.583248, relative delta: -550.371727 199 200 [-] Mean test for lognormal(0.644931, 4.795860) NOT passed. 201 Mean of lognormal(0.644931, 4.795860): 125053.904456, vs expected mean: 188163.894101 202 delta: -63109.989645, relative delta: -0.504662 203 [-] Std test for lognormal(0.644931, 4.795860) NOT passed. 204 Std of lognormal(0.644931, 4.795860): 39976300.711166, vs expected std: 18577298706.170452 205 delta: -18537322405.459286, relative delta: -463.707799 206 ``` 207 208 What is happening in this case is that you are taking a normal, like `normal(-0.195240, 4.883106)`, and you are exponentiating it to arrive at a lognormal. But `normal(-0.195240, 4.883106)` is going to have some non-insignificant weight on, say, 18. But `exp(18) = 39976300`, and points like it are going to end up a nontrivial amount to the analytical mean and standard deviation, even though they have little probability mass. 209 210 The reader can also check that for more plausible real-world values, like those fitting a lognormal to a really wide 90% confidence interval from 10 to 10k, errors aren't egregious: 211 212 ``` 213 [x] Mean test for to(10.000000, 10000.000000) PASSED 214 [-] Std test for to(10.000000, 10000.000000) NOT passed. 215 Std of to(10.000000, 10000.000000): 23578.091775, vs expected std: 25836.381819 216 delta: -2258.290043, relative delta: -0.095779 217 ``` 218 219 Overall, I would caution that if you really care about the very far tails of distributions, you might want to instead use tools which can do some of the analytical manipulations for you, like the original Squiggle, Simple Squiggle (both linked below), or even doing lognormal multiplication by hand, relying on the fact that two lognormals multiplied together result in another lognormal with known shape. 220 221 In fact, squiggle.c does have a few functions for algebraic manipulations of simple distributions at the end of squiggle.c. But these are pretty rudimentary, and I don't know whether I'll end up expanding or deleting them. 222 223 ### Compiler warnings 224 225 #### Harsh compilation 226 227 By default, I've enabled -Wall -Wextra -Wdouble-promotion -Wconversion. However, these produce some false positive warnings, which I've dealt with through: 228 229 - For conversion: Explicit casts, particularly from int to size_t when calling malloc. 230 - For dealing with unused variables: Using an UNUSED macro. If you don't like that approach, you could add -Wno-unused-parameter to your makefile and remove the macro and its usage. 231 232 Some resources on compiler flags: [1](https://nullprogram.com/blog/2023/04/29/), [2](https://news.ycombinator.com/item?id=7371806) 233 234 #### Results of running clang-tidy 235 236 clang-tidy is a utility to detect common errors in C/C++. You can run it with: 237 238 ``` 239 make tidy 240 ``` 241 242 So far in the history of this program it has emitted: 243 - One false-positive warning about an issue I'd already taken care of (so I've suppressed the warning) 244 - a warning about an unused variable 245 246 I think this is good news in terms of making me more confident that this simple library is correct :). 247 248 ### Boundaries between sampling functions and arrays of samples 249 250 In squiggle.c, the boundary between working with sampler functions and arrays of samples is clear. Not so in the original squiggle, which hides this distinction from the user in the interest of accessibility. 251 252 ### Parallelism 253 254 I provide some functions to draw samples in parallel. For "normal" squiggle.c models, where you define one model and then draw samples from it once at the end, they should be fine. 255 256 But for more complicated use cases, my recommendation would be to not use parallelism unless you know what you are doing, because of intricacies around setting seeds. Some gotchas and exercises for the reader: 257 258 - If you run the `sampler_parallel` function twice, you will get the same result. Why? 259 - If you run the `sampler_parallel` function on two different inputs, their outputs will be correlated. E.g., if you run two lognormals, indices which have higher samples in one will tend to have higher samples in the other one. Why? 260 - For a small amount of samples, if you run the `sampler_parallel` function, you will get better spread out random numbers than if you run things serially. Why? 261 262 That said, I found adding parallelism to be an interesting an engaging task. Most recently, I even optimized the code to ensure that two threads weren't accessing the same cache line at the same time, and it was very satisfying to see a 30% improvement as a result. 263 264 ### Using arbitrary cdfs 265 266 The last commit that has code to sample from arbitrary cdfs is `8f6919fa2...`. You can access it with `git checkout 8f6919fa2`. I removed them because I wasn't using them, and they didn't really fit with the overall ethos of the project. 267 268 ### Other gotchas 269 270 - Even though the C standard is ambiguous about this, this code assumes that doubles are 64 bit precision (otherwise the xorshift code should be different). 271 272 ### Consider this code 273 274 Consider sampling from the unit uniform in this manner: 275 276 ``` 277 #include "../squiggle.h" 278 #include "../squiggle_more.h" 279 #include <math.h> 280 #include <stdint.h> 281 #include <stdio.h> 282 #include <stdlib.h> 283 284 int main() 285 { 286 // Replicate <https://arxiv.org/pdf/1806.02404.pdf>, and in particular the red line in page 11. 287 // Could also be interesting to just produce and save many samples. 288 289 // set randomness seed 290 uint64_t* seed = malloc(sizeof(uint64_t)); 291 *seed = UINT64_MAX/64; // xorshift can't start with a seed of 0 292 293 int n_samples = 100*MILLION; 294 int p_sixteenth = 0; 295 int p_eighth = 0; 296 int p_quarter = 0; 297 int p_half = 0; 298 double sample; 299 for(int i=0; i<n_samples; i++){ 300 sample = sample_unit_uniform(seed); 301 // printf("%lf\n", sample); 302 if (sample < 1.0/16.0){ 303 p_sixteenth++; 304 p_eighth++; 305 p_quarter++; 306 p_half++; 307 } else if(sample < 0.125){ 308 p_eighth++; 309 p_quarter++; 310 p_half++; 311 } else if(sample < 0.25){ 312 p_quarter++; 313 p_half++; 314 } else if(sample < 0.5){ 315 p_half++; 316 }else{ 317 // printf("Sample > 0.5\n"); 318 } 319 } 320 printf("p_16th: %lf; p_eighth; %lf; p_quarter: %lf; p_half: %lf", ((double)p_sixteenth)/n_samples, (double)p_eighth/n_samples, (double)p_quarter/n_samples, (double)p_half/n_samples); 321 } 322 ``` 323 324 What will be printed out? In particular, consider that not all floating points have the same density. 325 326 <details style="border-style: solid; border-width: 2px; padding-top: 20px; padding-left: 20px; padding-right: 20px; margin-bottom: 20px;"> 327 <summary>Click on the arrow to see the answer</summary> 328 The p_eighth will be ~0.125, p_quarter will be ~0.25, p_half will be ~0.5. This is because these random numbers are produced by generating ints and then dividing by the maximum int. There may be additional gotchas here, but this at least ensures that intervals of the same length in [0,1] will have the same number of samples. It's just that those that are smaller will be represented with more precision. 329 </details> 330 331 ## Related projects 332 333 - [Squiggle](https://www.squiggle-language.com/) 334 - [SquigglePy](https://github.com/rethinkpriorities/squigglepy) 335 - [Simple Squiggle](https://nunosempere.com/blog/2022/04/17/simple-squiggle/) 336 - [time to BOTEC](https://github.com/NunoSempere/time-to-botec) 337 - [Find a beta distribution that fits your desired confidence interval](https://nunosempere.com/blog/2023/03/15/fit-beta/) 338