Restricted Boltzmann machine parameters for a product distribution

A restricted Boltzmann machine (RBM) is a tool for modelling discrete probability distributions. Specifically, it's a collection of numbers (weights WijW_{ij} and biases aia_i, bjb_j), along with algorithms for training (setting the weights and biases to reflect the desired distribution) and sampling (obtaining samples that are distributed according to the desired distribution).

The joint probability distribution for an RBM is P(v,h)eaTv+bTh+vTWh. P(v, h) \propto e^{a^\mathrm{T} v + b^\mathrm{T} h + v^\mathrm{T} W h}. Here, the vector vv (resp. hh) represents the visible (resp. hidden) units, and each element viv_i (resp. hih_i) can take on the values 00 and 11. Pictorially, this looks like the following: Schematic of a general restricted Boltzmann machine. In this example, there are Nv=6N_v = 6 visible units, Nh=3N_h = 3 hidden units, and NvNh=18N_v N_h = 18 weights connecting them.

The hidden units are an artifact of the model; the actual distribution being modelled is only for the visible units vv. The effective distribution that we obtain from the RBM is the marginal distribution P(v)=hP(v,h), P(v) = \sum_h P(v, h), where the sum over hh is understood to run over all 2Nh2^{N_h} states of the hidden layer. The question that I have in mind is the following: What are the appropriate weights and biases so that P(v)P(v) is a product of univariate distributions over viv_i?

We can express the marginal distribution as P(v)heaTv+bTh+vTWh=ei=1Nvaivihej=1Nh(bT+vTW)jhj=[i=1Nveaivi][j=1Nh(1+ebj+(vTW)j)]. \begin{aligned} P(v) &\propto \sum_h e^{a^\mathrm{T} v + b^\mathrm{T} h + v^\mathrm{T} W h} \\ &= e^{\sum_{i=1}^{N_v} a_i v_i} \sum_h e^{\sum_{j=1}^{N_h} (b^\mathrm{T} + v^\mathrm{T} W)_j h_j} \\ &= \left[ \prod_{i=1}^{N_v} e^{a_i v_i} \right] \left[ \prod_{j=1}^{N_h} \left( 1 + e^{b_j + (v^\mathrm{T} W)_j} \right) \right]. \end{aligned} The viv_i are in general coupled via the weights, so we will make a simplifying assumption: we require that the weight matrix WW have no more than one non-zero entry in each row and each column. Without loss of generality, we can place these entries on the diagonal and call them Wi=WiiW_i = W_{ii}, since that only amounts to rearranging the visible and hidden units. If Nv<NhN_v < N_h, the weight matrix then has the form (W10000000W20000000W30000000W4000), \begin{pmatrix} W_1 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & W_2 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & W_3 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & W_4 & 0 & 0 & 0 \end{pmatrix}, if Nv=NhN_v = N_h, it has the form (W10000W20000W30000W4), \begin{pmatrix} W_1 & 0 & 0 & 0 \\ 0 & W_2 & 0 & 0 \\ 0 & 0 & W_3 & 0 \\ 0 & 0 & 0 & W_4 \end{pmatrix}, and if Nv>NhN_v > N_h, it has the form (W10000W20000W30000W4000000000000). \begin{pmatrix} W_1 & 0 & 0 & 0 \\ 0 & W_2 & 0 & 0 \\ 0 & 0 & W_3 & 0 \\ 0 & 0 & 0 & W_4 \\ 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 \end{pmatrix}. Thus, under this assumption only the first min(Nv,Nh)\min(N_v, N_h) visible and hidden units are allowed to be connected, and only pairwise. For brevity, we will assume in the following that Nv=NhN_v = N_h, which looks like Schematic of a restricted Boltzmann machine with diagonal weights. This detangled layout already suggests to us that the visible units should act independently of one another.

That this is indeed the case is very easy to show rigorously. With a square and diagonal weight matrix, the marginal distribution simplifies to P(v)i=1Nveaivi(1+ebi+Wivi), P(v) \propto \prod_{i=1}^{N_v} e^{a_i v_i} \left( 1 + e^{b_i + W_i v_i} \right), which is the product of univariate distributions, as desired: P(v)=i=1NvPi(vi). P(v) = \prod_{i=1}^{N_v} P_i(v_i). Thus, there is a straightforward answer to the question: It's sufficient (but possibly not strictly necessary) to have no off-diagonal weights.

Because a univariate distribution over 00 and 11 is a Bernoulli distribution, we have decomposed the marginal distribution into a product of Bernoulli distributions, each specified by a single parameter pi=Pi(vi=1)=eai(1+ebi+Wi)eai(1+ebi+Wi)+(1+ebi)=11+1+ebieai(1+ebi+Wi). p_i = P_i(v_i = 1) = \frac{e^{a_i} \left( 1 + e^{b_i + W_i} \right)}{e^{a_i} \left( 1 + e^{b_i + W_i} \right) + \left( 1 + e^{b_i} \right)} = \frac{1}{1 + \frac{1 + e^{b_i}}{e^{a_i} \left( 1 + e^{b_i + W_i} \right)}}. Hence, we see that we have three knobs to turn independently for each visible unit viv_i:

xx xx \to -\infty x=0x = 0 x+x \to +\infty
aia_i pi0p_i \to 0 pi1p_i \to 1
bib_i pieai1+eaip_i \to \frac{e^{a_i}}{1 + e^{a_i}} pieai+Wi1+eai+Wip_i \to \frac{e^{a_i + W_i}}{1 + e^{a_i + W_i}}
WiW_i pieai1+eai+ebip_i \to \frac{e^{a_i}}{1 + e^{a_i} + e^{b_i}} pi=eai1+eaip_i = \frac{e^{a_i}}{1 + e^{a_i}} pi1p_i \to 1

These results make intuitive sense:

The last of these observations corresponds to the trivial layout Schematic of a restricted Boltzmann machine with no hidden units. where we have removed the hidden units entirely.