-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathvariational_methods.tex
1376 lines (1162 loc) · 54.8 KB
/
variational_methods.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\chapter{Probability, Inference, and Thermodynamics}{\label{sec:variationalMethods}
\section{Introduction}
Need a systematic way to make inferences.
This entails using probabilities to model our knowledge of the world.
\section{Probability}
Probability theory studies the consistency of the various statements that can be made of the world.
The statements may or may not be true,
where there is uncertainty in some event,
each possible outcome results in a valid statement.
%there are many valid statements that can be made of the world, one for each possible outcome.
Statements may be atomic, in that they reference no other event,
or they may be compound, in that they depend upon other, simpler, statements.
Given a set of outcomes to an uncertain event,
an individual may ascribe each statement with the degree to which it is to be believed.
Probability theory determines how to combine the beliefs of simpler statements
into the belief of their compound in a consistent manner,
and describes how beliefs should change in order to remain consistent when new information becomes available.
Whether the original assignments are based on experience, prejudice or mere caprice,
and whether they coincide with the assignments of any other individual does not matter.
Probability theory concerns itself only with maintaining a self-consistent set of beliefs.
%that depend upon an \apriori\ model that defines the set of possible outcomes.
%The formulation of a set of possible outcomes and the assignment of their beliefs constitutes a model of the uncertain event.
A model first defines the set of possible outcomes and second assigns the \apriori\ beliefs.
It defines how one interprets the world.
A experimental test can make more precise the parameters of the model,
or one may find that experimental data better supports an alternative model,
but the model is always the starting point.
In this thesis, therefore, an individual's view of the world is characterised by the totality of the models that they use to describe the world.
The subjectivity intrinsic to the possibly capricious formulation of a model may be disconcerting.
A consistent and agreed upon science can result, however,
when there exists a nature to which differing models may be compared.
Consensus can result by subjecting a model to experimental test and by sharing successful models.
%One last comment is in order, however,
%and that is the inherent lack of knowledge involved even when carrying out a measurement.
%By this I do not mean the banal problems of finite experimental precision,
%the {\em experimental noise} that introduces uncertainty into the quantity measured.
%Rather, I refer to the difficulties in determining how distances and times are to be measured at all.
%If a signal, such as light or sound, is used to locate a faraway entity
%then the true location of the entity is no longer measurable.
%The experimenter has no knowledge of what happens to that signal after it has been sent and before it returns:
%both the signal's route and speed are unknown.
%Assuming the signal's route to be direct and speed to be constant throughout is a common and necessary convention, for otherwise no definition of measurement is possible.
%It must, however, be emphasised that in reality the speed of the signal is no more attainable than its path.
%Much more will be said on this matter in \chapref{measurement}.
%For now we just note that our knowledge of the world, through our models and experiments,
%is, and is set to remain, remote from the true natural order of things.
Before moving on,
the world view adopted by probability theory should be located in its philosophical context.
In this regard the closeness of Wittgenstein's {\em Tractatus Logico-Philosophicus} to modern probability theory should be noted.
This thesis shall not attempt to unpick where Wittgenstein's views agree and differ from modern probability theory
(for such an attempt see \cite{WittgensteinLattice}),
and a discussion on the relative merits of other world views is not considered to be within the scope of this thesis.
Nevertheless, %to aid the interested reader in locating the philosophy of probability theory,
the reader is invited to keep the following quotes in mind during this chapter.
Firstly, regarding the primacy of models when conceiving the world, Wittgenstein writes
\begin{quote}
The world is the totality of facts, not of things. For the totality of facts determines both what is the case, and also all that is not the case. Every thing is, as it were, in a space of possible atomic facts. I can think of this space as empty, but not of the thing without the space. (1.1, 1.12, 2.013)
\end{quote}
Indeed, the facts have a definite structure that is determined by how their compound is formed,
and this structure is assumed to be what is shared with nature:
\begin{quote}
Atomic facts are independent of one another. We make to ourselves pictures of facts. The picture is a model of reality.
That the elements of the picture are combined with one another in a definite way, represents that the things are so combined with one another.
This connexion of the elements of the picture is called its structure, and the possibility of this structure is called the form of representation of the picture. (2.061, 2.1, 2.12, 2.15)
\end{quote}
Finally, Wittgenstein expresses very directly the difficulty in articulating any innate truth that is external to a model of the world:
\begin{quote}
The world and life are one. I am my world. (The microcosm)... [A]t death the world does not alter, but comes to an end. (5.621, 5.63, 6.431)
\end{quote}
The natural world exists, but is mystical because it can only be guessed at from a constructed world view,
\begin{quote}
Not how the world is, is the mystical, but that it is.
The contemplation of the world sub specie aeterni is its contemplation as a limited whole.
The feeling that the world is a limited whole is the mystical feeling. (6.44, 6.45)
\end{quote}
% This thesis shall not attempt to unpick where Wittgenstein views agree and differ from modern probability theory
% (for such an attempt see \cite{WittgensteinLattice}).
% In the presentation that follows, however,
% propositions from the {\em Tractatus} will periodically be referenced where we believe the viewpoint is the same.
% I hope that this is of interest to the reader.
% and that is the difficulties inherent in carrying out measurements.
% To measure the world, one must decide how to measure the world,
% and in particular, how to measure distances and times.
% Einstein introduced the {\em convension} of using light to carry out measurements.
% However, as is argued in \chapref{measurement},
% other convensions are possible.
% In ultrasound physics, measurements are carried out with and indeed more natural for certain measurements.
% For example
% prior to any experiment,
% Probability theory enables the model to be updated consistently in the light of new experimental evidence,
% and informs us that an experiment agrees with one model better than another.
% An individuals view of the world therefore,
% In this thesis, therefore, an individual's view of the world is characterised by the totality of the models that they use to describe the world.
% The questions that different individuals puts to the world may be different,
% and so their chosen models will be different too.
% To test the quality of the model, the world must be put to experimental test.
% However, in order to do so, a model is also required for time and space.
% It is impossible to measure beyond these.
% the degree to which each statement should be believed will often be the subject of disagreement.
% Nevertheless, whether the degrees of belief are
% The outcome of an uncertain event
% Individuals may disagree with the likelihood of the various outcomes of an uncertain event.
% Everybody is able to ascribe a degree to which the various outcomes of an uncertain event should be believed.
% but whether based on experience or caprice,
% each can ascribed a degree of belief.
% %but are formed with greater or lesser certainty out of all the possible outcomes to their subject.
% %they are ascribed a degree of belief
% %A degree of belief, whether based on experience or prejudice, can therefore be ascribed to a statement.
\subsection{The Lattice of Statements}
Statements can be ordered in a lattice in a very natural way.
\subsection{Measure}
\subsection{Divergence}
\section{Thermodynamics}
Probability distributions can be used to represent our knowledge of the world.
For example, the value of a experimentally obtained variable will in general
fluctuate around its average.
If different runs of the experiment are independent
then the distribution of obtained values fully describes the experiment.
What is learned from a given experiment is then characterised with how the probability distributions
that represent our knowledge change.
If a hypothesis, $\H$, is that a set of experimental data, $\vx = \{x_n|n=1,\ldots,N\}$, should conform to a model with a set of parameters, $\vw = \{w_i| i=1,\ldots,I\}$,
then our full knowledge of the system is given by the joint probability distribution
\begin{align}
P\lr{\vx,\vw,\H}.
\label{eqn:fullJointDist}
\end{align}
Of greater importance than \eqnref{fullJointDist}, however,
is to determine how our knowledge of the model changes when we collect the experimental data.
This can be found from \eqnref{fullJointDist} by splitting the joint distribution into its conditional probabilities.
\begin{align}
P\lr{\vx,\vw,\H} = P\lr{\vw|\vx,\H}P\lr{\vx|\H}
= P\lr{\vx|\vw,\H} P\lr{\vw|\H}.
\end{align}
from which it follows that
\begin{align}
P\lr{\vw|\vx,\H} = \frac{P\lr{\vw|\vw,\H} P\lr{\vw|\H}}{P\lr{\vx|\H}}.
\label{eqn:BayesTheorem}
\end{align}
Equation \Eqnref{BayesTheorem} is Bayes Theorem.
It states that the probability of the model's parameters, {\em given the data},
can be determined from the probability of the data when the parameters are known, and the a probability of parameters {\em before the data was known}.
It describes exactly the process of inference.
The term $P\lr{\vx|\vw,\H}$ is the likelihood function.
It evaluates the degree to which the model with a given set of parameters agrees with the experimental data.
If it is assumed that every data point is independent, and that each datum should agree with the prediction of the model, $t_n$,
to within Gaussian noise
then the likelihood function would be,
\begin{align}
P\lr{\vx|\vw,\H} = \prod_{n=1}^N \sqrt{\frac{\gamma}{2\pi}}e^{-0.5\gamma\lr{x_n-t_n}^2}.
\end{align}
The variable $\gamma$ is the precision - the inverse of the variance - and is one of the set $\{w_i\}$.
The term $P\lr{\vw|\H}$ in \eqnref{BayesTheorem} is independent of the experimental data $\{x_n\}$.
It represents our knowledge of the parameters before the experiment was carried out.
It could be that the parameters are already known to great precision -
in which case the probability distribution would tend towards a delta function.
Alternatively, it could that the a priori knowledge of the precision, say,
does not expend beyond the requirement that the precision is positive definite.
In this case the prior distribution would be represented by a scale invariant positive definite distribution.
One such example is the Gamma distribution,
\begin{align}
P(\gamma|s,c) = \frac{1}{\Gamma(s)c}\lr{\frac{x}{s}}^{c-1}\exp\lr{-\frac{x}{s}},
\label{eqn:Gamma}
\end{align}
in the limit such that $sc = 1$ and $c\rightarrow 0$ \cite{MacKay2003}.
The hypothesis, $\H$, encompasses all of the assumptions that go into the inference.
These include the choice of the model that is fitted to the data,
the prior probabilities assigned to the model variables and the
the noise model described by the likelihood function.
These assumptions are inevitable - they reflect our uncertainty
prompts the experiment in the first place.
However,
since many different hypotheses can be dreamed up,
it is important to be able to be able to evaluate how each is supported by the experimental data.
For this, Bayes Theorem can be applied a second time:
the probability of the hypothesis, given the data, is
\begin{align}
P\lr{\H | \vx } = \frac{P\lr{\vx|H}P\lr{\H}}{P\lr{\vx}}.
\label{eqn:BayesHyp}
\end{align}
Since the probability of the data, $P\lr{\vx}$,
is independent of the hypothesis
it can be eliminated when comparing two hypotheses, $\H_1$ and $\H_2$,
\begin{align}
\frac{P\lr{\H_1 | \vx }}{P\lr{\H_2 | \vx }} = \frac{P\lr{\vx|\H_1}}{P\lr{\vx|\H_1}}\frac{P\lr{\H_2}}{P\lr{\H_2}}.
\label{eqn:ModelCmp}
\end{align}
The second of the ratios on the right-hand-side of \eqnref{ModelCmp}
give an opportunity, if desired, to prefer one model over another irrespectively of any data collected.
The first quotient is determined from the experimental data.
The term $P\lr{\vx|\H}$ is called the evidence and it is the partition function of \eqnref{BayesHyp}.
A model that is highly constrained will be inflexible in the range of predictions it can make,
whereas a model that has many free parameters will be able to predict a vast number of possible outcomes.
The more constrained model will therefore have a smaller set of likely outcomes,
but each of these will have a much greater probability than the many possible outcomes of the less constrained model.
The right-hand-side of \eqnref{ModelCmp} therefore directly and quantitively embodies Occan's razor,
the rule of thumb that states that `simpler' models should be favoured over more complicated models.
For a more detailed discussion of model comparison and Occan's razor see \cite[Chapter 28]{Mackay2003}.
To evaluate the evidence the numerator in equation \eqnref{BayesTheorem} must be integrated over the entire parameter space,
\begin{align}
P\lr{\vx|\H} = \int_\vw d\vw P\lr{\vw|\vw,\H} P\lr{\vw|\H}
\end{align}
In general this cannot be done analytically.
However, it is often the case that the probability density tightly peaked about the maximum.
In this case the evidence may be evaluated by approximating the peak with a Gaussian, which can be integrated.
This is the saddle point approximation.
Expanding the logarithm of the unnormalised probability distribution, $P^\ast$,
around the maximum, $\vx_0$,
gives
\eq{
\ln P^\ast = \ln P^\ast(\vx_0) - \frac{1}{2}\lr{\vx-\vx_0}^T \vA\lr{\vx-\vx_0 }
}
where
\eq{
\vA = A_{ij} = \frac{\d^2}{\d x_i\d x_j} \ln P^\ast(\vx_0)
}
is the Hessian matrix.
The right-hand-side of equation \eqnref{BayesTheorem} is therefore approximated by the multidimensional Gaussian
\begin{align}
P\lr{\vw|\vx,\H} = P^\ast(\vx_0) \exp \lr{- \frac{1}{2}\lr{\vx-\vx_0}^T \vA\lr{\vx-\vx_0 }}
\end{align}
for which the normalisation constant, the evidence, is
\begin{align}
P^\ast(\vx_0) \sqrt{\frac{\lr{2\pi}^K}{\det \vA}}
\end{align}
\subsection{Conjugate Exponential Variables}
\begin{align}
\ln P(X|Y) = \phi(Y) u(X) + f(X) + g(Y)
\end{align}
then conjugacy implies
\begin{align}
\ln P(W|X) = \tilde\phi(W) u(X) + h(W).
\end{align}
so that variable $X$ is treated the same.
\section{Variational Approach}
\section{Kullback-Leibler divergence}
Variational methods can be used to approximate a probability distribution, $P$, that is impossible to evaluate exactly,
with a probability distribution that is more maluable.
The approximation is varied so that it matches the original distribution as closely as possible.
The amount of information that is lost when a distribution $Q$ is used in place of the distribution $P$ is measured by the relative entropy,
a quantity known as the Kullback-Leibler divergence.
%The Kullback Leibler divergence gives a measure of the similarity of two distributions,
It is defined,
\begin{align}
\KLD{Q}{P} &= \int_\vH Q(\vH|\H) \log\frac{Q(\vH|\H)}{P(\vH|\vD,\H)} d\vH,
\end{align}
where $P$ and $Q$ are probability distributions that model a hypothesis, $\H$.
$\vH$ is a set of unknown variables that form the model and $\vD$ is as set of known variables.
From Gibbs inequality it follows that
\begin{align}
\KLD{Q}{P} \ge 0
\end{align}
and equality is only when $P=Q$.
That is, knowledge of the system is always lost when it is approximated.
The Kullback-Leibler divergence will be minimised in two different ways in this thesis.
\section{Statistical Mechanics}
Let
\begin{align}
P = \frac{1}{Z} e^{-\beta \scalar{E} }
\end{align}
then
\begin{align}
\beta \tilde{F} &= \int Q(x) \ln \frac{Q(x)}{\exp\lr{-\beta \scalar{E} }} \\
&= \int Q(x) \ln \frac{Q(x)}{P} - \ln Z \\
&= \KLD{Q}{P} - \ln Z
&= \KLD{Q}{P} + F.
\end{align}
where $F \equiv - \ln Z$ is the free energy and
\begin{align}
Z =
\end{align}
\section{Variational Ensemble Learning}
This property makes the Kullback-Leibler divergence a useful function minimise.
Indeed, using $P(\vH,\vD|\H) = P(\vH|\vD,\H) P(\vD|\H)$,
we may write,
\begin{align}
\KLD{Q}{P} % &=\int_\vH Q(\vH) \log\frac{Q(\vH)}{P(\vH|\vD)} d\vH \\
&= \int_\vH Q(\vH|\H) \log\frac{Q(\vH|\H)}{P(\vH,\vD|\H)} d\vH + \log P(\vD|\H) \\
&= -S_Q - \int_\vH Q(\vH|\H) \log P(\vH,\vD|\H) d\vH + \log P(\vD|\H)
\end{align}
where $S_Q = - \int_\vH Q(\vH|\H) \log Q(\vH|\H) d\vH$ is the entropy given the hypothesis%
\footnote{
\begin{quote}
Consider, for example, a crystal of Rochelle salt.
For one set of experiments on it, we work with temperature, pressure and volume.
The entropy can therefore be expressed as some function $S_e(T,P)$.
For another set of experiments on the same crystal,
we work with temperature, the component $e_{xy}$ of the strain tensor,
and the component $P_z$ of the electric polarisation;
the entropy as found in these experiments is a function $S_e(T, e_{xy}, P_z)$.
It is clearly meaningless to ask ``What is the entropy of the crystal?''
unless we first specify the set of parameters which define its thermodynamic state.%
One might reply that in each of the experiments cited,
we have used only part of the degrees of freedom of the system,
and there is a ``true'' entropy which is a function of all these parameters simultaneously.
However we can always introduce as many parameters as we please...
There is no end to this search for the ultimate ``true'' entropy until we have reached the point where we control
the location of each atom independently.
But just at that point the notion of entropy collapses, and we are no longer talking thermodynamics!
From this we see that entropy is an anthropomorphic concept,
not only in the well known statistical sense that it measures the extent of human ignorance as to the microstate.
{\em Even at the purely phenomenological level, entropy is an anthropomorphic concept.}
For it is a property, not of the physical system,
but of the particular experiments that you or I choose to perform on it.
\flushright Edwin T. Jaynes\cite{Jaynes1965}
\end{quote}
}.
Define the cost function
\begin{align}
\L = \int_\vH Q(\vH) \log P(\vH,\vD) ) d\vH +S_Q
\end{align}
From which it follows that
\begin{align}
\L &= \log P(\vD,\H) - \KLD{Q}{P} \\
&\le \log P(\vD,\H)
\end{align}
The probability of the model is
\begin{align}
P(\H, \vD) &= \frac{P(\vD, \H) P(\H)}{P(\vD)}\\
&\le \frac{\L(Q)P(\H)}{P(\vD)}
\end{align}
Assuming that the variables are independent gives
\begin{align}
Q\lr{\vH} = \prod_n^N Q_n\lr{H_n}
\end{align}
where $Q_n$ is the independent distribution for the $n$th variable.
Then
\begin{align}
\L&= \int_\vH \prod_n^NQ_n(H_n) \log P(\vH,\vD) ) d\vH - \sum_n^N \int_\vH Q_n(H_n) \log Q_n(H_n|\H) dH_n
\end{align}
Separating out the $j$th element gives
\begin{align}
\L &= \int_\vH Q_j(H_j)\prod_{n\ne J}^NQ_n(H_n) \log P(\vH,\vD) ) d\vH + S_{Q_j} + \sum_{n\ne j}^N S_{Q_{n}}
\\ &= \int_{H_j} Q_j(H_j) \multi{\log P(\vH,\vD)}{\prod_{i\ne j} Q_i\lr{H_i}} +S_{Q_j} + \sum_{n\ne j}^N S_{Q_{n}}
\end{align}
Introducing
\begin{align}
Q^\ast_j = \frac{1}{Z}e^{\multi{\log P(\vH,\vD)}{\prod_{i\ne j} Q_i{H_i}}}
\end{align}
gives
\begin{align}
\L &= \int_{H_j} Q_j(H_j) \log Q^\ast dH_j +S_{Q_j} + \sum_{n\ne j}^N S_{Q_{n}} - \log Z
\\ &= \KLD{Q_j}{Q^\ast_j} - \log Z - \sum_{n\ne j}^N S_{Q_{n}}
\end{align}
which is maximal with respect to $Q_j$ when $Q_j = Q^\ast_j$, and so the function is maximised when
\begin{align}
\log Q^\ast = \multi{\log P(\vH,\vD)}{\prod_{i\ne j} Q_i\lr{H_i}}
\end{align}
Now
\begin{align}
P(X_1, X_2, \ldots, X_N) = \prod_i^N P(X_i|\parents{i})
\end{align}
so
\begin{align}
\ln Q^\ast_j(H_j) = \multi{\ln P\lr{H_j|\parents{j}} + \sum_{i \in \children{j}} \ln P\lr{X_i| H_j, \coparents{j}}}{\sim Q(H_j)} + \const
\end{align}
If conjugate exponential models then
\begin{align}
\ln Q^\ast_j(H_j) &= \multi{\phi(\parents{j}) u(H_j) + f(H_j) + g(\parents{j}) }{\sim Q(H_j)}
\nonumber \\
&+\multi{\sum_{i \in \children{j}} \tilde\phi(X_i,\coparents{j}) u(H_j) + h(X_i,\coparents{j}) }{\sim Q(H_j)} + \const\\
&= \multi{\phi(\parents{j}) + \sum_{i \in \children{j}} \tilde\phi(X_i,\coparents{j}) }{\sim Q(H_j)} u(H_j)+ f(H_j)+\const
\end{align}
from which it follows that
\begin{align}
\phi^\ast_j = \scalar{\phi(\parents{j})} + \sum_{i \in \children{j}} \scalar{\tilde\phi(X_i,\coparents{j}) }
\end{align}
where expectations are with respect to $Q$.
The message from a variable node to a function node is
\begin{align}
m_{X_i\rightarrow f_j} = \Moments{Q_i}
\end{align}
The message from a function node to a variable node is
\begin{align}
m_{f_i\rightarrow X_j} = \Natural{\multi{f_i(\neighbour{i})}{Q_{{\neighbour{i} \bs X_j}}}}
\end{align}
Updated variable node is then given by
\begin{align}
\Natural{Q^\ast(X_i)} = \sum_{j\in\neighbour{i}} m_{f_j \rightarrow X_i}
\end{align}
\section{Independent Components of the pulses}
Assume differing bubble sources as independent components in bubble.
\begin{align}
\vx_t = \vA \vs_t
\end{align}
One model for this is a Gaussian
\begin{align}
P(x_t| \Lambda) = \G(x_t; AS_t, \Lambda)
\end{align}
However,
from the \figref{} it is seen that the Gaussian noise model is not good.
A better alternative is to use Fourier decomposition,
\begin{align}
\vx_\omega = \vA \vs_\omega
\end{align}
such that
\begin{align}
P(\vx_\omega| \Lambda_\omega) = \G( \vx_\omega ; AS_\omega, \Lambda_\omega)
\end{align}
\begin{align}
P(\vH|\vD) = G(\vH)
\end{align}
\begin{align}
\KLD{Q}{P} &= \int_\vH Q(\vH) \log\frac{Q(\vH)}{P(\vH|\vD)} d\vH \\
&= \int_\vH Q(\vH) \log\frac{Q(\vH)}{P(\vH,\vD)} d\vH + \int_\vH Q(\vH) \log P(\vD) d\vH\\
&= \int_\vH Q(\vH) \log Q(\vH) d\vH - \int_\vH Q(\vH) \log Q(\vH) \log P(\vH,\vD) ) d\vH + \log P(\vD)
\end{align}
Bring cost function
\begin{align}
\L = \int_\vH Q(\vH) \log Q(\vH) \log P(\vH,\vD) ) d\vH - Q(\vH) \log Q(\vH) d\vH
\end{align}
From which it follows that
\begin{align}
\L &= \log P(\vD,\H) - \KLD{Q}{P} \\
&\le \log P(\vD,\H)
\end{align}
The probability of the model is
\begin{align}
P(\H, \vD) &= \frac{P(\vD, \H) P(\H)}{P(\vD)}
&\le \frac{\L(Q)P(\H)}{P(\vD)}
\end{align}
\subsubsection{The model}
\begin{align}
P(s_{m\omega}| \H) &= \sum_{c=1}^{N_c} \pi_{mc}\G(s_{m\omega};0,\beta_{\omega c})\\
P(\beta_{\omega c} &= \GammaDistr(\beta_{mc} ; b^{(\beta)}, c^{(\beta)})\\
P(\{\pi_{mc}\}_{c=1}^{N_c}|\H) &= \Dirichlet\lr{ \{\pi_{mc}\}_{c=1}^{N_c} | c^{(\pi)}}
\end{align}
And mixture
\begin{align}
P(A_{nm}|\H) &= \G(A_{nm}; 0,\alpha_m)\\
P(\alpha_m| \H) &= \GammaDistr(\alpha_m| b^{(\alpha)},c^{(alpha)})
\end{align}
and gamma
\begin{align}
P(\Lambda_{\omega},\H) = \GammaDistr(\Lambda_\omega;b^{(\Lambda)},c^{(\Lambda)} )
\end{align}
Simplify the distribution
\begin{align}
Q\lr{\vs, \vA, \pi, \beta, \alpha, \Lambda} = Q\lr{s_{\omega m}}Q\lr{A_{nm}}Q\lr{\pi}Q\lr{\beta}Q\lr{\alpha}Q\lr{\Lambda}
\end{align}
\begin{align}
Q\lr{s_{nm}} = \G(s_{m \omega};\hat{s}_{m\omega}, \tilde{s}_{m\omega})\\
Q\lr{A_{nm}} = \G(A_{mn};\hat{A}_{mn}, \tilde{A}_{mn})\\
Q\lr{\beta_{mc}} = \Gamma(\beta_{mc};\hat{A}_{mn}, \tilde{A}_{mn})
\end{align}
\section{Density Functional Theory theory}\label{app:DFT}
\subsection{Introduction}
\Dft\ relaxes the capillary approximation used in \cnt.
The density of the nucleated bubble is not assumed to be that of the bulk,
and the interface is not assumed to be macroscopic and plainer\cite{Oxtoby1992, Oxtoby1998}.
\Dft\ therefore does a much better job at modelling the interface than \cnt.
Rather than it being a sudden boundary,
there is a finite interval over which the density varies from that of the fluid to that of the vapour
and the bubble is modelled for what it is - a fluctuation in density -
rather than a vapour entrapped in flexible boundary.
If spherical symmetry is assumed then
the bubble boundary is defined by its radius.
The critical radius is such that\cite{Oxtoby1992,Oxtoby1998}
\begin{align}
\frac{d \Omega}{d a} =0,\quad\text{at $a = \astar$} \label{eqn:DFT:astarR}
\end{align}
where $\Omega$ is the {\em grand potential}.
The grand potential in \eqnref{DFT:astar} is difficult to evaluate, however,
as it is a functional of the
phase space positions of all $N$ molecules in the system.
Specifically,
\begin{align}
\Omega = -\beta^{-1}\ln \Xi.
\end{align}
where $\Xi$ is the grand partition function
\begin{align}
\Xi = \Tr \exp\lr{-\beta \lr{H_N - \mu N}}. \label{eqn:nuc:GPF}
\end{align}
$\Tr$ denotes the trace operator
\begin{align}
\Tr \equiv \sum_{N=0}^\infty \frac{1}{h^{3N}N!} \iint d\cx_1 d\cp_1
\end{align}
and we have compacted the integral by writing
\begin{align}
d\cx_n &\equiv dr_n dr_{n+1}\ldots dr_N, &&\quad\text{and}&
d\cp_n &\equiv dp_n dp_{n+1}\ldots dp_N.
\label{eqn:dshorthand}
\end{align}
In \eqnref{nuc:GPF} $\mu$ denotes the chemical potential and $\H$ denotes the Hamiltonian of the molecules.
The difficulty in evaluating $\Omega$ comes from the interactions between the molecules.
In order to consider the couplings explicitly we split $\H$ into
\sub{
\begin{align}
% \begin{array}{ll}
\KE &= \sum_i^N \frac{p_i^2}{2m}, && \text{which is the kinetic energy,}\label{eqn:nuc:Kinetic}\\
\UE &= \UE(\cx_1), && \text{ the internal energy and}\\
\VE &= \sum_i^N V_\ext(r_i) && \text{the external potential.}
% \end{array}
\end{align}
}
so that the overall Hamiltonian can be written %in terms of the intrinsic potentials, $\H_\in$ and external potential, $\H_\ext$,
\begin{align}
\H = %\H_\in + \H_\ext =
\KE + \UE + \VE.
\end{align}
Here were have extended the shorthand employed in \eqnref{dshorthand} so that
\begin{align}
\cx_n \equiv r_n,r_{n+1},\ldots,r_N, \quad\text{and}\quad
\cp_n \equiv p_n,p_{n+1},\ldots, p_N.
\end{align}
%The coupled terms are therefore the internal energy, $\UE$.
Separating the Hamiltonian in this way lets us split the grand partition function
\begin{align}
\Xi = \Tr e^{-\beta (\KE -\mu N)}e^{-\beta(\UE +\VE)}= \frac{1}{N!}Z_\KE Z_{\UE+\VE}, \label{eqn:XiSeparate}
\end{align}
with the second equality following because $\KE$ is a function of only the particle momentums,
and $\UE$ and $\VE$ are functions of positions.
%At equilibrium the joint probability density of the distribution is the Boltzmann distribution
%\begin{align}
% p_0(\cx_1, \cp_1) = \Xi^{-1} \exp\lr{-\beta \lr{H_N - \mu N}}.
%\end{align}
%which can be derived with the Maximum entropy principle\cite{}, for example.
%Here were have extended our shorthand of \eqnref{dshorthand} so that
%\begin{align}
%\cx_n \equiv r_n,r_{n+1},\ldots,r_N, \quad\text{and}\quad
%\cp_n \equiv p_n,p_{n+1},\ldots, p_N.
%\end{align}
The two functions on the right of in \eqnref{XiSeparate} can be considered separately:
\nlist{
\item The momentum integrals in \eqnref{XiSeparate} form the partition function of an ideal gas,
with
\begin{align}
Z_\KE = \int d \cp_1 e^{-\beta \lr{\sum_i^N \frac{p_i^2}{2m}-\mu N}} = \lr{\frac{m}{2\pi\hbar^2 \beta}}^{3N/2} \equiv n_Q^N.
\end{align}
The term $n_Q$ is sometimes known as the {\em quantum concentration} and is related to the {\em thermal de Broglie wavelength}, $\lambda_T$ by $n_Q = 1/\lambda_T$.
We demote the derivation of this standard result to \appref{DFT}.
\item
The remaining $\frac{1}{N! Z_{\UE+\VE}}$ is the partition function to the joint probability distribution of the molecular positions,
\begin{align}
p_0(\cx) = \frac{1}{N!Z_{\UE+\VE} } e^{-\beta\lr{\UE+\VE}}
\end{align}
To make progress we must approximate the coupled interaction term, $\UE$.
%so that \eqnref{XiSeparate} can be solved.
Here we assume that only the two particle interactions are important
and write
\begin{align}
\UE(\cx) \approx \Phi(\cx) = \sum_{j>i} \sum_i^N \phi(\vr_i, \vr_j),
\end{align}
where $\phi(\vr_i, \vr_j)$ is the two particle potential between a particle at $r_i$ and $r_j$.
%At equilibrium the true joint probability density is the Boltzmann distribution
%\begin{align}
% p_0(\cx_1, \cp_1) &= \Xi^{-1} \exp\lr{-\beta \lr{\H_N - \mu N}}
%\end{align}
%
The approximate Hamiltonian is then $H \equiv \KE + \Phi + \VE$,
and is described by the approximate probability density, $p$,
\begin{align}
p(\cx) = \frac{1}{N!Z_{\UE+\VE} } e^{-\beta\sum_{j>i} \sum_i^N \phi(\vr_i, \vr_j) -\beta\sum_i^N V_\ext(\vr_i)} \label{eqn:pspatial}
\end{align}
Marginalising equation \eqnref{pspatial} for the 1-particle distribution gives
\begin{align}
p^{(1)}(\vr_1) = \frac{N}{N! Z_{\UE+\VE}} \int e^{-\beta\sum_i^N V_\ext(\vr_i)} d\cx_2. \label{eqn:ponespatial}
\end{align}
The 2-particle density is
\begin{align}
p^{(2)}(\vr_1, \vr_2) = \frac{N(N-1)}{N!Z_{\UE+\VE}}\int e^{-\beta\sum_{j>i} \sum_i^N \phi(\vr_i, \vr_j)-\beta\sum_i^N V_\ext(\vr_i)} d\cx_3.
\end{align}
The approximate number density, $\rho(\vr)$, is a such that
\begin{align}
\int \rho_0(\vr) d\vr = N .
\end{align}
It follows that
\begin{align}
\rho(\vr) = N! p^{(1)}(\vr_1). \label{eqn:rhoone}
\end{align}
From \eqnref{rhoone} and \eqnref{ponespatial} we find that the {\em density is functional of the external potential.}
}
The converse is also true:
{\em the external potential is uniquely determined by the density},
a result known as the Hohenberg-Kohn theorem.
The probability density is then determined by the external potential,
from which it follows that the probability density is a unique functional of the density.
We outline a proof of the Hohenberg-Kohn theorem in \appref{Hohenberg_Kohn}.
It is thereby permissible to work with the mass density rather than the probability density when considering the thermodynamics of the bubble.
Since the density is the term of interest bubble nucleation, the density functional approach is much more direct.
%Re-expressing the grand potential as a functional of mass density rather than probability density
%does not get us any closer to being able to evaluate $\Omega$, however.
%So far the argument is standard from statistical physics.
%To make progress we must approximate the coupled interaction term, $\UE$.
%so that \eqnref{XiSeparate} can be solved.
%Here we assume that only the two particle interactions are important
%and write
%\begin{align}
%% \UE(\cx) \approx \Phi(\cx) = \sum_{j>i} \sum_i^N \phi(\vr_i, \vr_j),
%\end{align}
%where $\phi(\vr_i, \vr_j)$ is the two particle potential between a particle at $r_i$ and $r_j$.
%At equilibrium the true joint probability density is the Boltzmann distribution
%\begin{align}
% p_0(\cx_1, \cp_1) &= \Xi^{-1} \exp\lr{-\beta \lr{\H_N - \mu N}}
%\end{align}
%
%The approximate Hamiltonian is then $H \equiv \KE + \Phi + \VE$,
%and is described by the approximate probability density, $p$.
The approximate density function is $\rho$, which defines an approximate grand potential $\Omega_V\lrs{\rho}$.
The task is then to find the distribution $\rho$ that comes closest to approximating $\rho_0$.
The {\em relative entropy } or {\em Kullback-Leibler divergance} gives the amount of information lost
when using the approximate distribution $p$ rather than the correct distribution $p_0$,
and is defined
\begin{align}
\KLD{p}{p_0} = \Tr p \log \frac{p}{p_0} \label{eqn:nuc:KLD}
\end{align}
$\KLD{p}{p_0} \ge 0$, which follows from Gibbs inequality, and only if $p=p_0$ does $\KLD{p}{p_0} = 0$.
We may therefore define
\begin{align}
\Omega_V\lrs{\rho} \equiv \beta^{-1}\KLD{p}{p_0}+ \Omega\lrs{\rho_0},
\end{align}
The approximate grand potential approaches the true value when
it vanishes with respect to $\rho$.
$\Omega_V\lrs{\rho} $ will then be at thermodynamic equilibrium
which occurs at the critical radius.
Therefore,
condition \eqnref{}
may be expressed\cite{Oxtoby1992}
\begin{align}
\frac{\delta \Omega_V}{\delta \rho} =0,\quad\text{at $\rho = \rhostar$.} \label{eqn:DFT:astar}
\end{align}
More generally, from \eqnref{} we have
\begin{align}
\Omega_V\lrs{p} &= \beta^{-1} \Tr p \log \frac{p}{p_0} - \beta^{-1}\ln \Xi \\
&= \beta^{-1} \Tr p \log \frac{p}{e^{-\beta\lr{\H - \mu N}}}\\
% &= T S + \Tr p \lr{H_N - \mu N} \\
&= - T S_p + \H_p - \mu N_p \\
&= F_p - \mu N_p\\
&= \F + \int V_\ext d\rho - \int \mu d \rho.
\end{align}
where the subscript $p$ indicates an average with respect to the distribution $p$ so that $S_p$ is the entropy with respect to $p$,
\begin{align}
\F\lrs {\rho_0} = \Tr p \log\frac{p}{e^{-\beta H_\in}} = \KE + \Phi - TS_p
\end{align}
and the labels `$\in$' and `$ext$' indicates the intrinsic and external parts of the Hamiltonian.
The energy $\Phi$ at when $\rho = \rhostar$ (thermodynamic equilibrium)
may be evaluated from by minimising $\Phi\lrs{\phi}$,
\begin{align}
\frac{\delta \Phi}{\delta \phi(r_i, r_j) } &= - \beta^{-1} \frac{\delta \ln Z_{\Phi+\VE}}{\delta \phi(r_i, r_j)} \\
&= \frac{N(N-1)}{2 Z_{\Phi+\VE}}\int d\cx_1 \phi(\vr_1,\vr_2) e^{-\beta\sum_{j>i} \sum_i^N \phi(\vr_i, \vr_j)-\beta\sum_i^N V_\ext(\vr_i)} \\
&= \half \iint d\vr_1 \dr_2 \phi p^{(2)}(\vr_1,\vr_2)
\end{align}
%\begin{align}
% &= \frac{1}{N!}\frac{e^{-\beta (\KE -\mu N)}}{Z_\KE}\frac{e^{-\beta(\UE +\VE)}}{Z_{\UE+\VE}} \label{eqn:Jointpzero} %\equiv p_\KE p_{\lr{\UE + VE}}
%\end{align}
%and introduce radial distribution function
%\begin{align}
% g(r_{12}) = \frac{V^2}{N^2} P_2(r_1, r_2)
%\end{align}
%Since $\UE$ is a measured quantity, it is averaged equilibrium joint probability distribution, $p_0$,
%where $p_0$ is a Boltzmann distribution,
%\begin{align}
% p_0(\cx_1, \cp_1) = \Xi^{-1} \exp\lr{-\beta \lr{H_N - \mu N}}. \label{nuc:pzero}
%\end{align}
%Decoupling the interactions in $\H$ implies that the approximate Hamiltonian, $H$, is described by some likewise decoupled approximate probability distribution, $p$.
%The task is then to vary $p$ so that it matches $p_0$ as closely as possible. %, given its new structure, so that $H$ approaches $\H$.
%From \eqnref{nuc:pzero} $p_0$ is explicitly a function of $\V$.
% This is the
% When considering the thermodynamics of the bubble
% The density functional approach is entirely analogous to this next step but works directly with an approximate mass density, $\rho$, rather than the approximate probability density, $p$.
% The mass density is then varied directly to find the functional form that best matches the equilibrium density, $\rho_0$, and whence the equilibrium Hamiltonian, $\H$.
% Since the density is the term of interest bubble nucleation, the density functional approach is much more direct.
% \Dft\ works at all because
% \nlist{
% \item
% The mass density is a functional of the probability density, $\rho = \rho[p]$.
% This follows almost trivially from the fact that the equilibrium density is a measured quantity,
% and therefore an average over $p_0$.
% Denoting the average
% \begin{align}
% \rho\lrs{p} = \scalar{\rho(\cx_1, \cp_1) }_{p} \equiv \Tr p(\cx_1, \cp_1) \rho(\cx_1, \cp_1),
% \end{align}
% we find that $\rho\lrs{p_0} = \scalar{\rho(\cx_1, \cp_1) }_{p_0} = \rho_0$.
% Since the density is a functional of $p_0$, which is in turn a function of $\VE$,
% it follows that the {\em density is functional of the external potential.}
% \item
% The converse is also true:
% the external potential is uniquely determined by the density,
% a result known as the Hohenberg-Kohn theorem.
% The probability density is then determined by the external potential,
% from which it follows that the probability density is a unique functional of the density.
% We outline a proof of the Hohenberg-Kohn theorem in \appref{Hohenberg_Kohn}.
% %It is deferred to the appendix because the proof is not constructive.
% }
% While many approximations can be made to the internal energy,
% we here consider only
% A number of approximat
% The approximate Hamiltonian we there
%which we denote
%\begin{align}
% \UE = \scalar{U(\cx)}_{\rho_0} \equiv Tr p_0(\cx_1, \cp_1) \rho_0(\cx_1, \cp_1)
%\end{align}
%over the full joint distribution of
\subsection{Bubble Nucleation}
But density $\rho(r)$ should not be constrained other than to require that it approach the bulk vapor density at large distance.
Then
\begin{align}
\frac{\delta \Omega_V}{\delta \rho(r)} = 0
\end{align}
at $\rho(r) = \rho^\ast(r)$.
The mulitdimensainal free energy has a minium at the uniform vapor pressure,
and a second lower minimum at the uniform liquid density. Between these saddle point sound by setting the funciional deriate to zero.
The matirx of second deriatives containes a negative eigenvalue correspionding to deirection of motion over the barrier.
For equilibrium gas-liquid interface similar except zero eigenvalue not negative.
sadle point in fucntional space refs in \cite{shen2003}
Density in bubble sufiently far from coexistence differs appreciably from that of stable vapor,
at least order of magnitude ref 28 in \cite{Shen2003}
Agrees well with energy brarrier in vicinity of phase coexistance but does vanish at spinodal.
Nucleation theorem ref 74 in \cite{shen2003}
Have
\begin{align}
\Omega_V = F - \mu N = F - \mu \int dr \rho(r).
\end{align}
Then \begin{align}
\frac{\delta F}{\delta \rho(r)} = \mu
\end{align}
at $\rho(r) = \rho^\ast(r)$.
To make further progress we need to write the grand potential $\Omega$ as a function of the density.
\subsection{Background}
To do so we consider Hamiltonians that are separated in terms of their intrinsic and external contributions
\begin{align}
\H = \H_\in + \H_\ext = \lr{\KE + \UE} + \VE
\end{align}
where
\sub{
\begin{align}
% \begin{array}{ll}
\KE &= \sum_i^N \frac{p_i^2}{2m} && \text{is the kinetic energy,}\label{eqn:nuc:Kinetic}\\
\UE &= U(\cx) && \text{is the internal energy, and}\\
\VE &= \sum_i^N V_\ext(r_i) && \text{is the external potential.}
% \end{array}
\end{align}
}
The internal energy depends upon the locations of the particles, which are denoted with
\sub{
\begin{align}
\cx_n &\equiv r_n,r_{n+1},\ldots,r_N, \quad\text{the set of $N-n+1$ particle spatial positions.}
\intertext{Similarly, the momentums of the particles are denoted}
\cp_n &\equiv p_n,p_{n+1},\ldots, p_N.
\end{align}
}
This notation is usefully extended by defining
\sub{
\begin{align}
d\cx_n &\equiv dr_n dr_{n+1}\ldots dr_N, &&\quad\text{and}&
d\cp_n &\equiv dp_n dp_{n+1}\ldots dp_N.
\end{align}
}
Both the energy, $\H$ and the equilibrium density $\rho_0$ are measured quantities
and as such, both averaged over the probability of the phase space locations of the $N$ particles $p_0 = p_0(\cx,\cp)$.
The average with respect to $p_0$ is defined
\begin{align}
\rho_0 = \scalar{\rho_0(\cx_1, \cp_1) }_{p_0} \equiv \Tr p_0(\cx_1, \cp_1) \rho_0(\cx_1, \cp_1)
\end{align}
where $\Tr$ denotes the trace operator
\begin{align}
\Tr \equiv \sum_{N=0}^\infty \frac{1}{h^{3N}N!} \iint d\cx_1 d\cp_1.
\end{align}
The average energy is defined similarly.
The the equilibrium
joint
probability density is the Boltzmann distribution
\begin{align}
p_0(\cx_1, \cp_1) = \Xi^{-1} \exp\lr{-\beta \lr{H_N - \mu N}}.
\end{align}
where
\begin{align}
\Xi = \Tr \exp\lr{-\beta \lr{H_N - \mu N}}. \label{eqn:nuc:GPF}
\end{align}
is called the {\em grand partition function}.
The grand potential then follows according to
\begin{align}
\Omega = -\beta^{-1}\ln \Xi.
\end{align}
We may eliminate the momentum terms from the grand partition function, equation \eqnref{nuc:GPF} immediately
\begin{align}
\Xi = \frac{n_Q^N}{N!} Z_U Z_V Z_N
\end{align}
where $n_Q = \lr{\frac{m}{2\pi\hbar^2 \beta}}^{3/2}$ is the {\em quantum concentration}.
and
\begin{align}
Z_U Z_V Z_N =
\end{align}
Since the density is a functional of $p_0$, which is in turn a function of $V_\ext$,
it follows that the density is functional of the external potential.
The converse is also true:
the external potential is uniquely determined by the density,
a result known as the Hohenberg-Kohn theorem.
The probability density is then determined by the external potential,
from which it follows that the probability density is a unique functional of the density.
We outline a proof of the Hohenberg-Kohn theorem in \appref{Hohenberg_Kohn}.
It is deferred to the appendix because the proof is not constructive.
The approximate grand potential can therefore be expressed as a unique functional of the density,
\begin{align}
\Omega_V\lrs{\rho_0} &= F - \mu N \\
&= \lr{\KE + \UE - TS} + \int d\rho\lr{ V_\ext -\mu }\\
&= \F + \int d\rho\lr{ V_\ext -\mu }\\
\end{align}
where $\F$ is the intrinsic Helmholtz free energy.
The grand potential, through $U$, is a function of $p_0(\cx)$, the probability describing the locations of all $N$ particles.
This associated multi-particle interactions are complicated and difficult to model.
We decouple these interactions by introducing the approximate probability distribution $ p = p(\vr_i, \vr_j)$
- dependent now only on two particle interactions -
to evaluate our thermodynamic variables.
This in turn reduces $U$ to two particle interactions,
\begin{align}
\Phi = \sum_{i\ne j} \sum_i^N \phi(\vr_i, \vr_j).
\end{align}
Furthermore, we assume that the external potential $\VE$ influences each particle equally.
Therefore, our approximate Hamiltonin is
\begin{align}
\H = \KE + \Phi + NV_\ext.
\end{align}
The {\em relative entropy } or {\em Kullback-Leibler divergance} gives the amount of information lost
when using the approximate distribution $p$ rather than the correct distribution $p_0$,
and is defined
\begin{align}
\KLD{p}{p_0} = \Tr p \log \frac{p}{p_0} \label{eqn:nuc:KLD}
\end{align}
$\KLD{p}{p_0} \ge 0$, which follows from Gibbs inequality with equality if and only if $p=p_0$.
It is convenient for $p$ to be evaluated via a variational principle and so we defined $p_0$ according to
\begin{align}
\Omega_V\lrs{\rho} = \beta^{-1}\KLD{p}{p_0}+ \Omega,
\end{align}
so that the approximate grand potential approaches the true value on application of a variational principle.
It then follows that
\begin{align}
\Omega_V\lrs{\rho} &= \beta^{-1} \Tr p \log \frac{p}{p_0} - \beta^{-1}\ln \Xi \\
&= \beta^{-1} \Tr p \log \frac{p}{e^{-\beta\lr{\H - \mu N}}}\\
% &= T S + \Tr p \lr{H_N - \mu N} \\
&= - T S_p + \H_p - \mu N_p \\
&= F_p - \mu N_p\\
&= \F + \int V_\ext d\rho - \int \mu d \rho.
\end{align}
where
\begin{align}
\F\lrs \rho_0 = \Tr p \log\frac{p}{e^{-\beta\H_\in}} = \KE + \UE - TS_p
\end{align}
and
where the subscript indicates that the thermodynamic quantities are evaluated with the approximate $p$ rather than $p_0$.
Have
\begin{align}
F = \int dr \rho_0 V_\ext + \F\lrs{\rho_0}
\end{align}
and
\begin{align}
V_\ext + \mu_\in\lrs{\rho_0} = \mu
\end{align}
where
\begin{align}
\mu_\in \equiv \deltarho \F.
\end{align}
Integration of interaction potential.