diff --git a/doc/pub/week9/html/._week9-bs012.html b/doc/pub/week9/html/._week9-bs012.html index 7610e62a..c945828b 100644 --- a/doc/pub/week9/html/._week9-bs012.html +++ b/doc/pub/week9/html/._week9-bs012.html @@ -657,6 +657,8 @@
The code here shows the evolution of \( \kappa_d \) as a function of \( d \) for a series of random numbers. We see that the function \( \kappa_d \) approaches \( 0 \) as \( d\rightarrow \infty \).
+Note: code will be inserted here later.
+
The code here shows the evolution of \( \kappa_d \) as a function of \( d \) for a series of random numbers. We see that the function \( \kappa_d \) approaches \( 0 \) as \( d\rightarrow \infty \).
+ +Note: code will be inserted here later.
The code here shows the evolution of \( \kappa_d \) as a function of \( d \) for a series of random numbers. We see that the function \( \kappa_d \) approaches \( 0 \) as \( d\rightarrow \infty \).
+Note: code will be inserted here later.
+The code here shows the evolution of \( \kappa_d \) as a function of \( d \) for a series of random numbers. We see that the function \( \kappa_d \) approaches \( 0 \) as \( d\rightarrow \infty \).
+Note: code will be inserted here later.
+
+ + +
+ +
+ +
+ + +
Note, these notes contain additional material om optimization and parallelization. Parts of this material will be discussed this week.
+ ++ +
+ +
+ + +
+ +
+ +
+ + +
+ +
+ +
+ + +
As you will see below, due to correlations between various +measurements, we need to evaluate the so-called covariance in order to +establish a proper evaluation of the total variance and the thereby +the standard deviation of a given expectation value. +
+ +The covariance however, leads to an evaluation of a double sum over the various stochastic variables. This becomes computationally too expensive to evaluate. +Methods like the Bootstrap, the Jackknife and/or Blocking allow us to circumvent this problem. +
+ ++ +
+ +
+ + +
Last week we derived the central limit theorem with the following assumptions:
+ +We assumed that each individual measurement \( x_{ij} \) is represented by stochastic variables which independent and identically distributed (iid). +This defined the sample mean of of experiment \( i \) with \( n \) samples as +
+$$ +\overline{x}_i=\frac{1}{n}\sum_{j} x_{ij}. +$$ + +and the sample variance
+$$ +\sigma^2_i=\frac{1}{n}\sum_{j} \left(x_{ij}-\overline{x}_i\right)^2. +$$ ++ +
+ +
+ + +
Note that we use \( n \) instead of \( n-1 \) in the definition of +variance. The sample variance and the sample mean are not necessarily equal to +the exact values we would get if we knew the corresponding probability +distribution. +
+ ++ +
+ +
+ + +
With the assumption that the average measurements \( i \) are also defined as iid stochastic variables and have the same probability function \( p \), +we defined the total average over \( m \) experiments as +
+$$ +\overline{X}=\frac{1}{m}\sum_{i} \overline{x}_{i}. +$$ + +and the total variance
+$$ +\sigma^2_{m}=\frac{1}{m}\sum_{i} \left( \overline{x}_{i}-\overline{X}\right)^2. +$$ +These are the quantities we used in showing that if the individual mean values are iid stochastic variables, then in the limit \( m\rightarrow \infty \), the distribution for \( \overline{X} \) is given by a Gaussian distribution with variance \( \sigma^2_m \).
+ ++ +
+ +
+ + +
The total sample variance over the \( mn \) measurements is defined as
+$$ +\sigma^2=\frac{1}{mn}\sum_{i=1}^{m} \sum_{j=1}^{n}\left(x_{ij}-\overline{X}\right)^2. +$$ + +We have from the equation for \( \sigma_m^2 \)
+$$ +\overline{x}_i-\overline{X}=\frac{1}{n}\sum_{j=1}^{n}\left(x_{i}-\overline{X}\right), +$$ + +and introducing the centered value \( \tilde{x}_{ij}=x_{ij}-\overline{X} \), we can rewrite \( \sigma_m^2 \) as
+$$ +\sigma^2_{m}=\frac{1}{m}\sum_{i} \left( \overline{x}_{i}-\overline{X}\right)^2=\frac{1}{m}\sum_{i=1}^{m}\left[ \frac{i}{n}\sum_{j=1}^{n}\tilde{x}_{ij}\right]^2. +$$ + + ++ +
+ +
+ + +
We can rewrite the latter in terms of a sum over diagonal elements only and another sum which contains the non-diagonal elements
+$$ +\begin{align*} +\sigma^2_{m}& =\frac{1}{m}\sum_{i=1}^{m}\left[ \frac{i}{n}\sum_{j=1}^{n}\tilde{x}_{ij}\right]^2 \\ + & = \frac{1}{mn^2}\sum_{i=1}^{m} \sum_{j=1}^{n}\tilde{x}_{ij}^2+\frac{2}{mn^2}\sum_{i=1}^{m} \sum_{j < k}^{n}\tilde{x}_{ij}\tilde{x}_{ik}. +\end{align*} +$$ + +The first term on the last rhs is nothing but the total sample variance \( \sigma^2 \) divided by \( m \). The second term represents the covariance.
+ ++ +
+ +
+ + +
Using the definition of the total sample variance we have
+$$ +\begin{align*} +\sigma^2_{m}& = \frac{\sigma^2}{m}+\frac{2}{mn^2}\sum_{i=1}^{m} \sum_{j < k}^{n}\tilde{x}_{ij}\tilde{x}_{ik}. +\end{align*} +$$ + +The first term is what we have used till now in order to estimate the +standard deviation. However, the second term which gives us a measure +of the correlations between different stochastic events, can result in +contributions which give rise to a larger standard deviation and +variance \( \sigma_m^2 \). Note also the evaluation of the second term +leads to a double sum over all events. If we run a VMC calculation +with say \( 10^9 \) Monte carlo samples, the latter term would lead to +\( 10^{18} \) function evaluations. We don't want to, by obvious reasons, to venture into that many evaluations. +
+ +Note also that if our stochastic events are iid then the covariance terms is zero.
+ ++ +
+ +
+ + +
We introduce now a variable \( d=\vert j-k\vert \) and rewrite
+$$ +\frac{2}{mn^2}\sum_{i=1}^{m} \sum_{j < k}^{n}\tilde{x}_{ij}\tilde{x}_{ik}, +$$ + +in terms of a function
+$$ +f_d=\frac{2}{mn}\sum_{i=1}^{m} \sum_{k=1}^{n-d}\tilde{x}_{ik}\tilde{x}_{i(k+d)}. +$$ + +We note that for \( d=0 \) we have
+$$ +f_0=\frac{2}{mn}\sum_{i=1}^{m} \sum_{k=1}^{n}\tilde{x}_{ik}\tilde{x}_{i(k)}=\sigma^2! +$$ + + ++ +
+ +
+ + +
We introduce then a correlation function \( \kappa_d=f_d/\sigma^2 \). Note that \( \kappa_0 =1 \). We rewrite the variance \( \sigma_m^2 \) as
+$$ +\begin{align*} +\sigma^2_{m}& = \frac{\sigma^2}{m}\left[1+2\sum_{d=1}^{n-1} \kappa_d\right]. +\end{align*} +$$ + +The code here shows the evolution of \( \kappa_d \) as a function of \( d \) for a series of random numbers. We see that the function \( \kappa_d \) approaches \( 0 \) as \( d\rightarrow \infty \).
+ +Note: code will be inserted here later.
+ ++ +
+ +
+ + +
The blocking method was made popular by Flyvbjerg and Pedersen (1989) +and has become one of the standard ways to estimate the variance +\( \mathrm{var}(\widehat{\theta}) \) for exactly one estimator \( \widehat{\theta} \), namely +\( \widehat{\theta} = \overline{X} \), the mean value. +
+ +Assume \( n = 2^d \) for some integer \( d>1 \) and \( X_1,X_2,\cdots, X_n \) is a stationary time series to begin with. +Moreover, assume that the series is asymptotically uncorrelated. We switch to vector notation by arranging \( X_1,X_2,\cdots,X_n \) in an \( n \)-tuple. Define: +
+$$ +\begin{align*} +\hat{X} = (X_1,X_2,\cdots,X_n). +\end{align*} +$$ + + ++ +
+ +
+ + +
The strength of the blocking method is when the number of +observations, \( n \) is large. For large \( n \), the complexity of dependent +bootstrapping scales poorly, but the blocking method does not, +moreover, it becomes more accurate the larger \( n \) is. +
+ ++ +
+ +
+ + +
We now define the blocking transformations. The idea is to take the mean of subsequent +pair of elements from \( \boldsymbol{X} \) and form a new vector +\( \boldsymbol{X}_1 \). Continuing in the same way by taking the mean of +subsequent pairs of elements of \( \boldsymbol{X}_1 \) we obtain \( \boldsymbol{X}_2 \), and +so on. +Define \( \boldsymbol{X}_i \) recursively by: +
+ +$$ +\begin{align} +(\boldsymbol{X}_0)_k &\equiv (\boldsymbol{X})_k \nonumber \\ +(\boldsymbol{X}_{i+1})_k &\equiv \frac{1}{2}\Big( (\boldsymbol{X}_i)_{2k-1} + +(\boldsymbol{X}_i)_{2k} \Big) \qquad \text{for all} \qquad 1 \leq i \leq d-1 +\tag{1} +\end{align} +$$ + + ++ +
+ +
+ + +
The quantity \( \boldsymbol{X}_k \) is +subject to \( k \) blocking transformations. We now have \( d \) vectors +\( \boldsymbol{X}_0, \boldsymbol{X}_1,\cdots,\vec X_{d-1} \) containing the subsequent +averages of observations. It turns out that if the components of +\( \boldsymbol{X} \) is a stationary time series, then the components of +\( \boldsymbol{X}_i \) is a stationary time series for all \( 0 \leq i \leq d-1 \) +
+ +We can then compute the autocovariance, the variance, sample mean, and +number of observations for each \( i \). +Let \( \gamma_i, \sigma_i^2, +\overline{X}_i \) denote the covariance, variance and average of the +elements of \( \boldsymbol{X}_i \) and let \( n_i \) be the number of elements of +\( \boldsymbol{X}_i \). It follows by induction that \( n_i = n/2^i \). +
+ ++ +
+ +
+ + +
Using the +definition of the blocking transformation and the distributive +property of the covariance, it is clear that since \( h =|i-j| \) +we can define +
+$$ +\begin{align} +\gamma_{k+1}(h) &= cov\left( ({X}_{k+1})_{i}, ({X}_{k+1})_{j} \right) \nonumber \\ +&= \frac{1}{4}cov\left( ({X}_{k})_{2i-1} + ({X}_{k})_{2i}, ({X}_{k})_{2j-1} + ({X}_{k})_{2j} \right) \nonumber \\ +&= \frac{1}{2}\gamma_{k}(2h) + \frac{1}{2}\gamma_k(2h+1) \hspace{0.1cm} \mathrm{h = 0} +\tag{2}\\ +&=\frac{1}{4}\gamma_k(2h-1) + \frac{1}{2}\gamma_k(2h) + \frac{1}{4}\gamma_k(2h+1) \quad \mathrm{else} +\tag{3} +\end{align} +$$ + +The quantity \( \hat{X} \) is asymptotically uncorrelated by assumption, \( \hat{X}_k \) is also asymptotic uncorrelated. Let's turn our attention to the variance of the sample +mean \( \mathrm{var}(\overline{X}) \). +
+ ++ +
+ +
+ + +
We have
+$$ +\begin{align} +\mathrm{var}(\overline{X}_k) = \frac{\sigma_k^2}{n_k} + \underbrace{\frac{2}{n_k} \sum_{h=1}^{n_k-1}\left( 1 - \frac{h}{n_k} \right)\gamma_k(h)}_{\equiv e_k} = \frac{\sigma^2_k}{n_k} + e_k \quad \text{if} \quad \gamma_k(0) = \sigma_k^2. +\tag{4} +\end{align} +$$ + +The term \( e_k \) is called the truncation error:
+$$ +\begin{equation} +e_k = \frac{2}{n_k} \sum_{h=1}^{n_k-1}\left( 1 - \frac{h}{n_k} \right)\gamma_k(h). +\tag{5} +\end{equation} +$$ + +We can show that \( \mathrm{var}(\overline{X}_i) = \mathrm{var}(\overline{X}_j) \) for all \( 0 \leq i \leq d-1 \) and \( 0 \leq j \leq d-1 \).
+ ++ +
+ +
+ + +
We can then wrap up
+$$ +\begin{align} +n_{j+1} \overline{X}_{j+1} &= \sum_{i=1}^{n_{j+1}} (\hat{X}_{j+1})_i = \frac{1}{2}\sum_{i=1}^{n_{j}/2} (\hat{X}_{j})_{2i-1} + (\hat{X}_{j})_{2i} \nonumber \\ +&= \frac{1}{2}\left[ (\hat{X}_j)_1 + (\hat{X}_j)_2 + \cdots + (\hat{X}_j)_{n_j} \right] = \underbrace{\frac{n_j}{2}}_{=n_{j+1}} \overline{X}_j = n_{j+1}\overline{X}_j. +\tag{6} +\end{align} +$$ + +By repeated use of this equation we get \( \mathrm{var}(\overline{X}_i) = \mathrm{var}(\overline{X}_0) = \mathrm{var}(\overline{X}) \) for all \( 0 \leq i \leq d-1 \). This has the consequence that
+$$ +\begin{align} +\mathrm{var}(\overline{X}) = \frac{\sigma_k^2}{n_k} + e_k \qquad \text{for all} \qquad 0 \leq k \leq d-1. \tag{7} +\end{align} +$$ + + ++ +
+ +
+ + +
Flyvbjerg and Petersen demonstrated that the sequence +\( \{e_k\}_{k=0}^{d-1} \) is decreasing, and conjecture that the term +\( e_k \) can be made as small as we would like by making \( k \) (and hence +\( d \)) sufficiently large. The sequence is decreasing. +It means we can apply blocking transformations until +\( e_k \) is sufficiently small, and then estimate \( \mathrm{var}(\overline{X}) \) by +\( \widehat{\sigma}^2_k/n_k \). +
+ +For an elegant solution and proof of the blocking method, see the recent article of Marius Jonsson (former MSc student of the Computational Physics group).
+ ++ +
+ +
+ + +
# 2-electron VMC code for 2dim quantum dot with importance sampling
+# Using gaussian rng for new positions and Metropolis- Hastings
+# Added energy minimization
+from math import exp, sqrt
+from random import random, seed, normalvariate
+import numpy as np
+import matplotlib.pyplot as plt
+from mpl_toolkits.mplot3d import Axes3D
+from matplotlib import cm
+from matplotlib.ticker import LinearLocator, FormatStrFormatter
+from scipy.optimize import minimize
+import sys
+import os
+
+# Where to save data files
+PROJECT_ROOT_DIR = "Results"
+DATA_ID = "Results/EnergyMin"
+
+if not os.path.exists(PROJECT_ROOT_DIR):
+ os.mkdir(PROJECT_ROOT_DIR)
+
+if not os.path.exists(DATA_ID):
+ os.makedirs(DATA_ID)
+
+def data_path(dat_id):
+ return os.path.join(DATA_ID, dat_id)
+
+outfile = open(data_path("Energies.dat"),'w')
+
+
+# Trial wave function for the 2-electron quantum dot in two dims
+def WaveFunction(r,alpha,beta):
+ r1 = r[0,0]**2 + r[0,1]**2
+ r2 = r[1,0]**2 + r[1,1]**2
+ r12 = sqrt((r[0,0]-r[1,0])**2 + (r[0,1]-r[1,1])**2)
+ deno = r12/(1+beta*r12)
+ return exp(-0.5*alpha*(r1+r2)+deno)
+
+# Local energy for the 2-electron quantum dot in two dims, using analytical local energy
+def LocalEnergy(r,alpha,beta):
+
+ r1 = (r[0,0]**2 + r[0,1]**2)
+ r2 = (r[1,0]**2 + r[1,1]**2)
+ r12 = sqrt((r[0,0]-r[1,0])**2 + (r[0,1]-r[1,1])**2)
+ deno = 1.0/(1+beta*r12)
+ deno2 = deno*deno
+ return 0.5*(1-alpha*alpha)*(r1 + r2) +2.0*alpha + 1.0/r12+deno2*(alpha*r12-deno2+2*beta*deno-1.0/r12)
+
+# Derivate of wave function ansatz as function of variational parameters
+def DerivativeWFansatz(r,alpha,beta):
+
+ WfDer = np.zeros((2), np.double)
+ r1 = (r[0,0]**2 + r[0,1]**2)
+ r2 = (r[1,0]**2 + r[1,1]**2)
+ r12 = sqrt((r[0,0]-r[1,0])**2 + (r[0,1]-r[1,1])**2)
+ deno = 1.0/(1+beta*r12)
+ deno2 = deno*deno
+ WfDer[0] = -0.5*(r1+r2)
+ WfDer[1] = -r12*r12*deno2
+ return WfDer
+
+# Setting up the quantum force for the two-electron quantum dot, recall that it is a vector
+def QuantumForce(r,alpha,beta):
+
+ qforce = np.zeros((NumberParticles,Dimension), np.double)
+ r12 = sqrt((r[0,0]-r[1,0])**2 + (r[0,1]-r[1,1])**2)
+ deno = 1.0/(1+beta*r12)
+ qforce[0,:] = -2*r[0,:]*alpha*(r[0,:]-r[1,:])*deno*deno/r12
+ qforce[1,:] = -2*r[1,:]*alpha*(r[1,:]-r[0,:])*deno*deno/r12
+ return qforce
+
+
+# Computing the derivative of the energy and the energy
+def EnergyDerivative(x0):
+
+
+ # Parameters in the Fokker-Planck simulation of the quantum force
+ D = 0.5
+ TimeStep = 0.05
+ # positions
+ PositionOld = np.zeros((NumberParticles,Dimension), np.double)
+ PositionNew = np.zeros((NumberParticles,Dimension), np.double)
+ # Quantum force
+ QuantumForceOld = np.zeros((NumberParticles,Dimension), np.double)
+ QuantumForceNew = np.zeros((NumberParticles,Dimension), np.double)
+
+ energy = 0.0
+ DeltaE = 0.0
+ alpha = x0[0]
+ beta = x0[1]
+ EnergyDer = 0.0
+ DeltaPsi = 0.0
+ DerivativePsiE = 0.0
+ #Initial position
+ for i in range(NumberParticles):
+ for j in range(Dimension):
+ PositionOld[i,j] = normalvariate(0.0,1.0)*sqrt(TimeStep)
+ wfold = WaveFunction(PositionOld,alpha,beta)
+ QuantumForceOld = QuantumForce(PositionOld,alpha, beta)
+
+ #Loop over MC MCcycles
+ for MCcycle in range(NumberMCcycles):
+ #Trial position moving one particle at the time
+ for i in range(NumberParticles):
+ for j in range(Dimension):
+ PositionNew[i,j] = PositionOld[i,j]+normalvariate(0.0,1.0)*sqrt(TimeStep)+\
+ QuantumForceOld[i,j]*TimeStep*D
+ wfnew = WaveFunction(PositionNew,alpha,beta)
+ QuantumForceNew = QuantumForce(PositionNew,alpha, beta)
+ GreensFunction = 0.0
+ for j in range(Dimension):
+ GreensFunction += 0.5*(QuantumForceOld[i,j]+QuantumForceNew[i,j])*\
+ (D*TimeStep*0.5*(QuantumForceOld[i,j]-QuantumForceNew[i,j])-\
+ PositionNew[i,j]+PositionOld[i,j])
+
+ GreensFunction = exp(GreensFunction)
+ ProbabilityRatio = GreensFunction*wfnew**2/wfold**2
+ #Metropolis-Hastings test to see whether we accept the move
+ if random() <= ProbabilityRatio:
+ for j in range(Dimension):
+ PositionOld[i,j] = PositionNew[i,j]
+ QuantumForceOld[i,j] = QuantumForceNew[i,j]
+ wfold = wfnew
+ DeltaE = LocalEnergy(PositionOld,alpha,beta)
+ DerPsi = DerivativeWFansatz(PositionOld,alpha,beta)
+ DeltaPsi += DerPsi
+ energy += DeltaE
+ DerivativePsiE += DerPsi*DeltaE
+
+ # We calculate mean values
+ energy /= NumberMCcycles
+ DerivativePsiE /= NumberMCcycles
+ DeltaPsi /= NumberMCcycles
+ EnergyDer = 2*(DerivativePsiE-DeltaPsi*energy)
+ return EnergyDer
+
+
+# Computing the expectation value of the local energy
+def Energy(x0):
+ # Parameters in the Fokker-Planck simulation of the quantum force
+ D = 0.5
+ TimeStep = 0.05
+ # positions
+ PositionOld = np.zeros((NumberParticles,Dimension), np.double)
+ PositionNew = np.zeros((NumberParticles,Dimension), np.double)
+ # Quantum force
+ QuantumForceOld = np.zeros((NumberParticles,Dimension), np.double)
+ QuantumForceNew = np.zeros((NumberParticles,Dimension), np.double)
+
+ energy = 0.0
+ DeltaE = 0.0
+ alpha = x0[0]
+ beta = x0[1]
+ #Initial position
+ for i in range(NumberParticles):
+ for j in range(Dimension):
+ PositionOld[i,j] = normalvariate(0.0,1.0)*sqrt(TimeStep)
+ wfold = WaveFunction(PositionOld,alpha,beta)
+ QuantumForceOld = QuantumForce(PositionOld,alpha, beta)
+
+ #Loop over MC MCcycles
+ for MCcycle in range(NumberMCcycles):
+ #Trial position moving one particle at the time
+ for i in range(NumberParticles):
+ for j in range(Dimension):
+ PositionNew[i,j] = PositionOld[i,j]+normalvariate(0.0,1.0)*sqrt(TimeStep)+\
+ QuantumForceOld[i,j]*TimeStep*D
+ wfnew = WaveFunction(PositionNew,alpha,beta)
+ QuantumForceNew = QuantumForce(PositionNew,alpha, beta)
+ GreensFunction = 0.0
+ for j in range(Dimension):
+ GreensFunction += 0.5*(QuantumForceOld[i,j]+QuantumForceNew[i,j])*\
+ (D*TimeStep*0.5*(QuantumForceOld[i,j]-QuantumForceNew[i,j])-\
+ PositionNew[i,j]+PositionOld[i,j])
+
+ GreensFunction = exp(GreensFunction)
+ ProbabilityRatio = GreensFunction*wfnew**2/wfold**2
+ #Metropolis-Hastings test to see whether we accept the move
+ if random() <= ProbabilityRatio:
+ for j in range(Dimension):
+ PositionOld[i,j] = PositionNew[i,j]
+ QuantumForceOld[i,j] = QuantumForceNew[i,j]
+ wfold = wfnew
+ DeltaE = LocalEnergy(PositionOld,alpha,beta)
+ energy += DeltaE
+ if Printout:
+ outfile.write('%f\n' %(energy/(MCcycle+1.0)))
+ # We calculate mean values
+ energy /= NumberMCcycles
+ return energy
+
+#Here starts the main program with variable declarations
+NumberParticles = 2
+Dimension = 2
+# seed for rng generator
+seed()
+# Monte Carlo cycles for parameter optimization
+Printout = False
+NumberMCcycles= 10000
+# guess for variational parameters
+x0 = np.array([0.9,0.2])
+# Using Broydens method to find optimal parameters
+res = minimize(Energy, x0, method='BFGS', jac=EnergyDerivative, options={'gtol': 1e-4,'disp': True})
+x0 = res.x
+# Compute the energy again with the optimal parameters and increased number of Monte Cycles
+NumberMCcycles= 2**19
+Printout = True
+FinalEnergy = Energy(x0)
+EResult = np.array([FinalEnergy,FinalEnergy])
+outfile.close()
+#nice printout with Pandas
+import pandas as pd
+from pandas import DataFrame
+data ={'Optimal Parameters':x0, 'Final Energy':EResult}
+frame = pd.DataFrame(data)
+print(frame)
+
++ +
+ +
+ + +
The next step is then to use the above data sets and perform a +resampling analysis using the blocking method +The blocking code, based on the article of Marius Jonsson is given here +
+ + + +# Common imports
+import os
+
+# Where to save the figures and data files
+DATA_ID = "Results/EnergyMin"
+
+def data_path(dat_id):
+ return os.path.join(DATA_ID, dat_id)
+
+infile = open(data_path("Energies.dat"),'r')
+
+from numpy import log2, zeros, mean, var, sum, loadtxt, arange, array, cumsum, dot, transpose, diagonal, sqrt
+from numpy.linalg import inv
+
+def block(x):
+ # preliminaries
+ n = len(x)
+ d = int(log2(n))
+ s, gamma = zeros(d), zeros(d)
+ mu = mean(x)
+
+ # estimate the auto-covariance and variances
+ # for each blocking transformation
+ for i in arange(0,d):
+ n = len(x)
+ # estimate autocovariance of x
+ gamma[i] = (n)**(-1)*sum( (x[0:(n-1)]-mu)*(x[1:n]-mu) )
+ # estimate variance of x
+ s[i] = var(x)
+ # perform blocking transformation
+ x = 0.5*(x[0::2] + x[1::2])
+
+ # generate the test observator M_k from the theorem
+ M = (cumsum( ((gamma/s)**2*2**arange(1,d+1)[::-1])[::-1] ) )[::-1]
+
+ # we need a list of magic numbers
+ q =array([6.634897,9.210340, 11.344867, 13.276704, 15.086272, 16.811894, 18.475307, 20.090235, 21.665994, 23.209251, 24.724970, 26.216967, 27.688250, 29.141238, 30.577914, 31.999927, 33.408664, 34.805306, 36.190869, 37.566235, 38.932173, 40.289360, 41.638398, 42.979820, 44.314105, 45.641683, 46.962942, 48.278236, 49.587884, 50.892181])
+
+ # use magic to determine when we should have stopped blocking
+ for k in arange(0,d):
+ if(M[k] < q[k]):
+ break
+ if (k >= d-1):
+ print("Warning: Use more data")
+ return mu, s[k]/2**(d-k)
+
+
+x = loadtxt(infile)
+(mean, var) = block(x)
+std = sqrt(var)
+import pandas as pd
+from pandas import DataFrame
+data ={'Mean':[mean], 'STDev':[std]}
+frame = pd.DataFrame(data,index=['Values'])
+print(frame)
+
++ +
+ +
+ + +
+ +
+ +
+ + +
Till now we have not paid much attention to speed and possible optimization possibilities +inherent in the various compilers. We have compiled and linked as +
+ + +c++ -c mycode.cpp
+c++ -o mycode.exe mycode.o
+
+For Fortran replace with for example gfortran or ifort. +This is what we call a flat compiler option and should be used when we develop the code. +It produces normally a very large and slow code when translated to machine instructions. +We use this option for debugging and for establishing the correct program output because +every operation is done precisely as the user specified it. +
+ +It is instructive to look up the compiler manual for further instructions by writing
+ + +man c++
+
++ +
+ +
+ + +
We have additional compiler options for optimization. These may include procedure inlining where +performance may be improved, moving constants inside loops outside the loop, +identify potential parallelism, include automatic vectorization or replace a division with a reciprocal +and a multiplication if this speeds up the code. +
+ + +c++ -O3 -c mycode.cpp
+c++ -O3 -o mycode.exe mycode.o
+
+This (other options are -O2 or -Ofast) is the recommended option.
++ +
+ +
+ + +
It is also useful to profile your program under the development stage. +You would then compile with +
+ + +c++ -pg -O3 -c mycode.cpp
+c++ -pg -O3 -o mycode.exe mycode.o
+
+After you have run the code you can obtain the profiling information via
+ + +gprof mycode.exe > ProfileOutput
+
+When you have profiled properly your code, you must take out this option as it +slows down performance. +For memory tests use valgrind. An excellent environment for all these aspects, and much more, is Qt creator. +
++ +
+ +
+ + +
Adding debugging options is a very useful alternative under the development stage of a program. +You would then compile with +
+ + +c++ -g -O0 -c mycode.cpp
+c++ -g -O0 -o mycode.exe mycode.o
+
+This option generates debugging information allowing you to trace for example if an array is properly allocated. Some compilers work best with the no optimization option -O0.
+Depending on the compiler, one can add flags which generate code that catches integer overflow errors. +The flag -ftrapv does this for the CLANG compiler on OS X operating systems. +
++ +
+ +
+ + +
In general, irrespective of compiler options, it is useful to
+Here is an example of a part of a program where specific operations lead to a slower code
+ + +k = n-1;
+for (i = 0; i < n; i++){
+ a[i] = b[i] +c*d;
+ e = g[k];
+}
+
+A better code is
+ + +temp = c*d;
+for (i = 0; i < n; i++){
+ a[i] = b[i] + temp;
+}
+e = g[n-1];
+
+Here we avoid a repeated multiplication inside a loop. +Most compilers, depending on compiler flags, identify and optimize such bottlenecks on their own, without requiring any particular action by the programmer. However, it is always useful to single out and avoid code examples like the first one discussed here. +
++ +
+ +
+ + +
Present CPUs are highly parallel processors with varying levels of parallelism. The typical situation can be described via the following three statements.
+Before we proceed with a more detailed discussion of topics like vectorization and parallelization, we need to remind ourselves about some basic features of different hardware models.
++ +
+ +
+ + +
+ +
+ +
+ + +
One way of categorizing modern parallel computers is to look at the memory configuration.
+The CPUs are connected by some network and may exchange messages.
++ +
+ +
+ + +
+ +
+ +
+ + +
+ +
+ +
+ + +
Vectorization is a special +case of Single Instructions Multiple Data (SIMD) to denote a single +instruction stream capable of operating on multiple data elements in +parallel. +We can think of vectorization as the unrolling of loops accompanied with SIMD instructions. +
+ +Vectorization is the process of converting an algorithm that performs scalar operations +(typically one operation at the time) to vector operations where a single operation can refer to many simultaneous operations. +Consider the following example +
+ + +for (i = 0; i < n; i++){
+ a[i] = b[i] + c[i];
+}
+
+If the code is not vectorized, the compiler will simply start with the first element and +then perform subsequent additions operating on one address in memory at the time. +
+ ++ +
+ +
+ + +
A SIMD instruction can operate on multiple data elements in one single instruction. +It uses the so-called 128-bit SIMD floating-point register. +In this sense, vectorization adds some form of parallelism since one instruction is applied +to many parts of say a vector. +
+ +The number of elements which can be operated on in parallel +range from four single-precision floating point data elements in so-called +Streaming SIMD Extensions and two double-precision floating-point data +elements in Streaming SIMD Extensions 2 to sixteen byte operations in +a 128-bit register in Streaming SIMD Extensions 2. Thus, vector-length +ranges from 2 to 16, depending on the instruction extensions used and +on the data type. +
+ +IN summary, our instructions operate on 128 bit (16 byte) operands
++ +
+ +
+ + +
We start with the simple scalar operations given by
+ + +for (i = 0; i < n; i++){
+ a[i] = b[i] + c[i];
+}
+
+If the code is not vectorized and we have a 128-bit register to store a 32 bits floating point number, +it means that we have \( 3\times 32 \) bits that are not used. +
+ +We have thus unused space in our SIMD registers. These registers could hold three additional integers.
+ ++ +
+ +
+ + +
The code
+ + +for (i = 0; i < n; i++){
+ a[i] = b[i] + c[i];
+}
+
+has for \( n \) repeats
++ +
+ +
+ + +
If we vectorize the code, we can perform, with a 128-bit register four simultaneous operations, that is +we have +
+ + +for (i = 0; i < n; i+=4){
+ a[i] = b[i] + c[i];
+ a[i+1] = b[i+1] + c[i+1];
+ a[i+2] = b[i+2] + c[i+2];
+ a[i+3] = b[i+3] + c[i+3];
+}
+
+Four additions are now done in a single step.
+ ++ +
+ +
+ + +
For \( n/4 \) repeats assuming floats or integers
++ +
+ +
+ + +
We implement these operations in a simple c++ program that computes at the end the norm of a vector.
+ + + +#include <cstdlib>
+#include <iostream>
+#include <cmath>
+#include <iomanip>
+#include "time.h"
+
+using namespace std; // note use of namespace
+int main (int argc, char* argv[])
+{
+ // read in dimension of square matrix
+ int n = atoi(argv[1]);
+ double s = 1.0/sqrt( (double) n);
+ double *a, *b, *c;
+ // Start timing
+ clock_t start, finish;
+ start = clock();
+// Allocate space for the vectors to be used
+ a = new double [n]; b = new double [n]; c = new double [n];
+ // Define parallel region
+ // Set up values for vectors a and b
+ for (int i = 0; i < n; i++){
+ double angle = 2.0*M_PI*i/ (( double ) n);
+ a[i] = s*(sin(angle) + cos(angle));
+ b[i] = s*sin(2.0*angle);
+ c[i] = 0.0;
+ }
+ // Then perform the vector addition
+ for (int i = 0; i < n; i++){
+ c[i] += a[i]+b[i];
+ }
+ // Compute now the norm-2
+ double Norm2 = 0.0;
+ for (int i = 0; i < n; i++){
+ Norm2 += c[i]*c[i];
+ }
+ finish = clock();
+ double timeused = (double) (finish - start)/(CLOCKS_PER_SEC );
+ cout << setiosflags(ios::showpoint | ios::uppercase);
+ cout << setprecision(10) << setw(20) << "Time used for norm computation=" << timeused << endl;
+ cout << " Norm-2 = " << Norm2 << endl;
+ // Free up space
+ delete[] a;
+ delete[] b;
+ delete[] c;
+ return 0;
+}
+
++ +
+ +
+ + +
We can compile and link without vectorization using the clang c++ compiler
+ + +clang -o novec.x vecexample.cpp
+
+and with vectorization (and additional optimizations)
+ + +clang++ -O3 -Rpass=loop-vectorize -o vec.x vecexample.cpp
+
+The speedup depends on the size of the vectors. In the example here we have run with \( 10^7 \) elements. +The example here was run on an IMac17.1 with OSX El Capitan (10.11.4) as operating system and an Intel i5 3.3 GHz CPU. +
+ + +Compphys:~ hjensen$ ./vec.x 10000000
+Time used for norm computation=0.04720500000
+Compphys:~ hjensen$ ./novec.x 10000000
+Time used for norm computation=0.03311700000
+
+This particular C++ compiler speeds up the above loop operations with a factor of 1.5 +Performing the same operations for \( 10^9 \) elements results in a smaller speedup since reading from main memory is required. The non-vectorized code is seemingly faster. +
+ + +Compphys:~ hjensen$ ./vec.x 1000000000
+Time used for norm computation=58.41391100
+Compphys:~ hjensen$ ./novec.x 1000000000
+Time used for norm computation=46.51295300
+
+We will discuss these issues further in the next slides.
+ ++ +
+ +
+ + +
We can compile and link without vectorization with clang compiler
+ + +clang++ -o -fno-vectorize novec.x vecexample.cpp
+
+and with vectorization
+ + +clang++ -O3 -Rpass=loop-vectorize -o vec.x vecexample.cpp
+
+We can also add vectorization analysis, see for example
+ + +clang++ -O3 -Rpass-analysis=loop-vectorize -o vec.x vecexample.cpp
+
+or figure out if vectorization was missed
+ + +clang++ -O3 -Rpass-missed=loop-vectorize -o vec.x vecexample.cpp
+
++ +
+ +
+ + +
Not all loops can be vectorized, as discussed in Intel's guide to vectorization
+ +An important criteria is that the loop counter \( n \) is known at the entry of the loop.
+ + + for (int j = 0; j < n; j++) {
+ a[j] = cos(j*1.0);
+ }
+
+The variable \( n \) does need to be known at compile time. However, this variable must stay the same for the entire duration of the loop. It implies that an exit statement inside the loop cannot be data dependent.
+ ++ +
+ +
+ + +
An exit statement should in general be avoided. +If the exit statement contains data-dependent conditions, the loop cannot be vectorized. +The following is an example of a non-vectorizable loop +
+ + + for (int j = 0; j < n; j++) {
+ a[j] = cos(j*1.0);
+ if (a[j] < 0 ) break;
+ }
+
+Avoid loop termination conditions and opt for a single entry loop variable \( n \). The lower and upper bounds have to be kept fixed within the loop.
+ ++ +
+ +
+ + +
SIMD instructions perform the same type of operations multiple times. +A switch statement leads thus to a non-vectorizable loop since different statemens cannot branch. +The following code can however be vectorized since the if statement is implemented as a masked assignment. +
+ + + for (int j = 0; j < n; j++) {
+ double x = cos(j*1.0);
+ if (x > 0 ) {
+ a[j] = x*sin(j*2.0);
+ }
+ else {
+ a[j] = 0.0;
+ }
+ }
+
+These operations can be performed for all data elements but only those elements which the mask evaluates as true are stored. In general, one should avoid branches such as switch, go to, or return statements or if constructs that cannot be treated as masked assignments.
+ ++ +
+ +
+ + +
Only the innermost loop of the following example is vectorized
+ + + for (int i = 0; i < n; i++) {
+ for (int j = 0; j < n; j++) {
+ a[i][j] += b[i][j];
+ }
+ }
+
+The exception is if an original outer loop is transformed into an inner loop as the result of compiler optimizations.
+ ++ +
+ +
+ + +
Calls to programmer defined functions ruin vectorization. However, calls to intrinsic functions like +\( \sin{x} \), \( \cos{x} \), \( \exp{x} \) etc are allowed since they are normally efficiently vectorized. +The following example is fully vectorizable +
+ + + for (int i = 0; i < n; i++) {
+ a[i] = log10(i)*cos(i);
+ }
+
+Similarly, inline functions defined by the programmer, allow for vectorization since the function statements are glued into the actual place where the function is called.
+ ++ +
+ +
+ + +
One has to keep in mind that vectorization changes the order of operations inside a loop. A so-called +read-after-write statement with an explicit flow dependency cannot be vectorized. The following code +
+ + + double b = 15.;
+ for (int i = 1; i < n; i++) {
+ a[i] = a[i-1] + b;
+ }
+
+is an example of flow dependency and results in wrong numerical results if vectorized. For a scalar operation, the value \( a[i-1] \) computed during the iteration is loaded into the right-hand side and the results are fine. In vector mode however, with a vector length of four, the values \( a[0] \), \( a[1] \), \( a[2] \) and \( a[3] \) from the previous loop will be loaded into the right-hand side and produce wrong results. That is, we have
+ + + a[1] = a[0] + b;
+ a[2] = a[1] + b;
+ a[3] = a[2] + b;
+ a[4] = a[3] + b;
+
+and if the two first iterations are executed at the same by the SIMD instruction, the value of say \( a[1] \) could be used by the second iteration before it has been calculated by the first iteration, leading thereby to wrong results.
+ ++ +
+ +
+ + +
On the other hand, a so-called +write-after-read statement can be vectorized. The following code +
+ + + double b = 15.;
+ for (int i = 1; i < n; i++) {
+ a[i-1] = a[i] + b;
+ }
+
+is an example of flow dependency that can be vectorized since no iteration with a higher value of \( i \) +can complete before an iteration with a lower value of \( i \). However, such code leads to problems with parallelization. +
+ ++ +
+ +
+ + +
For C++ programmers it is also worth keeping in mind that an array notation is preferred to the more compact use of pointers to access array elements. The compiler can often not tell if it is safe to vectorize the code.
+ +When dealing with arrays, you should also avoid memory stride, since this slows down considerably vectorization. When you access array element, write for example the inner loop to vectorize using unit stride, that is, access successively the next array element in memory, as shown here
+ + + for (int i = 0; i < n; i++) {
+ for (int j = 0; j < n; j++) {
+ a[i][j] += b[i][j];
+ }
+ }
+
++ +
+ +
+ + +
The main memory contains the program data
+Loads and stores to memory can be as important as floating point operations when we measure performance.
+ ++ +
+ +
+ + +
Many of these performance features are not captured in most programming languages.
+ ++ +
+ +
+ + +
How do we measure performance? What is wrong with this code to time a loop?
+ + + clock_t start, finish;
+ start = clock();
+ for (int j = 0; j < i; j++) {
+ a[j] = b[j]+b[j]*c[j];
+ }
+ finish = clock();
+ double timeused = (double) (finish - start)/(CLOCKS_PER_SEC );
+
++ +
+ +
+ + +
+ +
+ +
+ + +
What happens when the code is executed? The assumption is that the code is ready to +execute. But +
++ +
+ +
+ + +
+ +
+ +
+ + +
+ +
+ +
+ + +
+ +
+ +
+ + +
+ +
+ +
+ + +
The first step is to multiply the first row by \( a_0/b_0 \) and subtract it from the second row. This is known as the forward substitution step. We obtain then
+$$ + a_i = 0, +$$ + + +$$ + b_i = b_i - \frac{a_{i-1}}{b_{i-1}}c_{i-1}, +$$ + +and
+$$ + f_i = f_i - \frac{a_{i-1}}{b_{i-1}}f_{i-1}. +$$ + +At this point the simplified equation, with only an upper triangular matrix takes the form
+$$ +\left( \begin{array}{ccccc} + b_0 & c_0 & & & \\ + & b_1 & c_1 & & \\ + & & \ddots & & \\ + & & & b_{m-2} & c_{m-2} \\ + & & & & b_{m-1} + \end{array} \right)\left( \begin{array}{c} + x_0 \\ + x_1 \\ + \vdots \\ + x_{m-2} \\ + x_{m-1} + \end{array} \right)=\left( \begin{array}{c} + f_0 \\ + f_1 \\ + \vdots \\ + f_{m-2} \\ + f_{m-1} \\ + \end{array} \right) +$$ ++ +
+ +
+ + +
The next step is the backward substitution step. The last row is multiplied by \( c_{N-3}/b_{N-2} \) and subtracted from the second to last row, thus eliminating \( c_{N-3} \) from the last row. The general backward substitution procedure is
+$$ + c_i = 0, +$$ + +and
+$$ + f_{i-1} = f_{i-1} - \frac{c_{i-1}}{b_i}f_i +$$ + +All that ramains to be computed is the solution, which is the very straight forward process of
+$$ +x_i = \frac{f_i}{b_i} +$$ ++ +
+ +
+ + +
We have in specific case the following operations with the floating operations
+ +// Forward substitution
+// Note that we can simplify by precalculating a[i-1]/b[i-1]
+ for (int i=1; i < n; i++) {
+ b[i] = b[i] - (a[i-1]*c[i-1])/b[i-1];
+ f[i] = g[i] - (a[i-1]*f[i-1])/b[i-1];
+ }
+ x[n-1] = f[n-1] / b[n-1];
+ // Backwards substitution
+ for (int i = n-2; i >= 0; i--) {
+ f[i] = f[i] - c[i]*f[i+1]/b[i+1];
+ x[i] = f[i]/b[i];
+ }
+
++ +
+ +
+ + +
#include <cstdlib>
+#include <iostream>
+#include <cmath>
+#include <iomanip>
+#include "time.h"
+
+using namespace std; // note use of namespace
+int main (int argc, char* argv[])
+{
+ // read in dimension of square matrix
+ int n = atoi(argv[1]);
+ double **A, **B;
+ // Allocate space for the two matrices
+ A = new double*[n]; B = new double*[n];
+ for (int i = 0; i < n; i++){
+ A[i] = new double[n];
+ B[i] = new double[n];
+ }
+ // Set up values for matrix A
+ for (int i = 0; i < n; i++){
+ for (int j = 0; j < n; j++) {
+ A[i][j] = cos(i*1.0)*sin(j*3.0);
+ }
+ }
+ clock_t start, finish;
+ start = clock();
+ // Then compute the transpose
+ for (int i = 0; i < n; i++){
+ for (int j = 0; j < n; j++) {
+ B[i][j]= A[j][i];
+ }
+ }
+
+ finish = clock();
+ double timeused = (double) (finish - start)/(CLOCKS_PER_SEC );
+ cout << setiosflags(ios::showpoint | ios::uppercase);
+ cout << setprecision(10) << setw(20) << "Time used for setting up transpose of matrix=" << timeused << endl;
+
+ // Free up space
+ for (int i = 0; i < n; i++){
+ delete[] A[i];
+ delete[] B[i];
+ }
+ delete[] A;
+ delete[] B;
+ return 0;
+}
+
++ +
+ +
+ + +
This the matrix-matrix multiplication code with plain c++ memory allocation. It computes at the end the Frobenius norm.
+ + + +#include <cstdlib>
+#include <iostream>
+#include <cmath>
+#include <iomanip>
+#include "time.h"
+
+using namespace std; // note use of namespace
+int main (int argc, char* argv[])
+{
+ // read in dimension of square matrix
+ int n = atoi(argv[1]);
+ double s = 1.0/sqrt( (double) n);
+ double **A, **B, **C;
+ // Start timing
+ clock_t start, finish;
+ start = clock();
+ // Allocate space for the two matrices
+ A = new double*[n]; B = new double*[n]; C = new double*[n];
+ for (int i = 0; i < n; i++){
+ A[i] = new double[n];
+ B[i] = new double[n];
+ C[i] = new double[n];
+ }
+ // Set up values for matrix A and B and zero matrix C
+ for (int i = 0; i < n; i++){
+ for (int j = 0; j < n; j++) {
+ double angle = 2.0*M_PI*i*j/ (( double ) n);
+ A[i][j] = s * ( sin ( angle ) + cos ( angle ) );
+ B[j][i] = A[i][j];
+ }
+ }
+ // Then perform the matrix-matrix multiplication
+ for (int i = 0; i < n; i++){
+ for (int j = 0; j < n; j++) {
+ double sum = 0.0;
+ for (int k = 0; k < n; k++) {
+ sum += B[i][k]*A[k][j];
+ }
+ C[i][j] = sum;
+ }
+ }
+ // Compute now the Frobenius norm
+ double Fsum = 0.0;
+ for (int i = 0; i < n; i++){
+ for (int j = 0; j < n; j++) {
+ Fsum += C[i][j]*C[i][j];
+ }
+ }
+ Fsum = sqrt(Fsum);
+ finish = clock();
+ double timeused = (double) (finish - start)/(CLOCKS_PER_SEC );
+ cout << setiosflags(ios::showpoint | ios::uppercase);
+ cout << setprecision(10) << setw(20) << "Time used for matrix-matrix multiplication=" << timeused << endl;
+ cout << " Frobenius norm = " << Fsum << endl;
+ // Free up space
+ for (int i = 0; i < n; i++){
+ delete[] A[i];
+ delete[] B[i];
+ delete[] C[i];
+ }
+ delete[] A;
+ delete[] B;
+ delete[] C;
+ return 0;
+}
+
++ +
+ +
+ + +
+ +
+ +
+ + +
The key is choosing the correct baseline for comparison
++ +
+ +
+ + +
For parallel applications, speedup is typically defined as
+Here \( T_1 \) is the time on one processor and \( T_p \) is the time using \( p \) processors.
++ +
+ +
+ + +
The speedup on \( p \) processors can +be greater than \( p \) if memory usage is optimal! +Consider the case of a memorybound computation with \( M \) words of memory +
++ +
+ +
+ + +
Assume that almost all parts of a code are perfectly +parallelizable (fraction \( f \)). The remainder, +fraction \( (1-f) \) cannot be parallelized at all. +
+ +That is, there is work that takes time \( W \) on one process; a fraction \( f \) of that work will take +time \( Wf/p \) on \( p \) processors. +
++ +
+ +
+ + +
On one processor we have
+$$ +T_1 = (1-f)W + fW = W +$$ + +On \( p \) processors we have
+$$ +T_p = (1-f)W + \frac{fW}{p}, +$$ + +resulting in a speedup of
+$$ +\frac{T_1}{T_p} = \frac{W}{(1-f)W+fW/p} +$$ + +As \( p \) goes to infinity, \( fW/p \) goes to zero, and the maximum speedup is
+$$ +\frac{1}{1-f}, +$$ + +meaning that if +if \( f = 0.99 \) (all but \( 1\% \) parallelizable), the maximum speedup +is \( 1/(1-.99)=100 \)! +
++ +
+ +
+ + +
If any non-parallel code slips into the +application, the parallel +performance is limited. +
+ +In many simulations, however, the fraction of non-parallelizable work +is \( 10^{-6} \) or less due to large arrays or objects that are perfectly parallelizable. +
++ +
+ +
+ + +
Our lectures will focus on both MPI and OpenMP.
++ +
+ +
+ + +
Due to the above overhead and that certain parts of a sequential +algorithm cannot be parallelized we may not achieve an optimal parallelization. +
++ +
+ +
+ + +
+ +
+ +
+ + +
+ +
+ +
+ + +
To install MPI is rather easy on hardware running unix/linux as operating systems, follow simply the instructions from the OpenMPI website. See also subsequent slides. +When you have made sure you have installed MPI on your PC/laptop, +
+ # Compile and link
+ mpic++ -O3 -o nameofprog.x nameofprog.cpp
+ # run code with for example 8 processes using mpirun/mpiexec
+ mpiexec -n 8 ./nameofprog.x
+
++ +
+ +
+ + +
If you wish to install MPI and OpenMP +on your laptop/PC, we recommend the following: +
+ + brew install libomp
+
+and compile and link as
+ + +c++ -o <name executable> <name program.cpp> -lomp
+
++ +
+ +
+ + +
For linux/ubuntu users, you need to install two packages (alternatively use the synaptic package manager)
+ + + sudo apt-get install libopenmpi-dev
+ sudo apt-get install openmpi-bin
+
+For OS X users, install brew (after having installed xcode and gcc, needed for the +gfortran compiler of openmpi) and then install with brew +
+ + + brew install openmpi
+
+When running an executable (code.x), run as
+ + + mpirun -n 10 ./code.x
+
+where we indicate that we want the number of processes to be 10.
++ +
+ +
+ + +
With openmpi installed, when using Qt, add to your .pro file the instructions here
+ +You may need to tell Qt where openmpi is stored.
++ +
+ +
+ + +
MPI is a library, not a language. It specifies the names, calling sequences and results of functions +or subroutines to be called from C/C++ or Fortran programs, and the classes and methods that make up the MPI C++ +library. The programs that users write in Fortran, C or C++ are compiled with ordinary compilers and linked +with the MPI library. +
+ +MPI programs should be able to run +on all possible machines and run all MPI implementetations without change. +
+ +An MPI computation is a collection of processes communicating with messages.
++ +
+ +
+ + +
Task parallelism: the work of a global problem can be divided +into a number of independent tasks, which rarely need to synchronize. +Monte Carlo simulations or numerical integration are examples of this. +
+ +MPI is a message-passing library where all the routines +have corresponding C/C++-binding +
+ + + MPI_Command_name
+
+and Fortran-binding (routine names are in uppercase, but can also be in lower case)
+ + + MPI_COMMAND_NAME
+
++ +
+ +
+ + +
MPI is a library specification for the message passing interface, +proposed as a standard. +
+ +A message passing standard for portability and ease-of-use. +Designed for high performance. +
+ +Insert communication and synchronization functions where necessary.
++ +
+ +
+ + +
MPI is a message-passing library where all the routines +have corresponding C/C++-binding +
+ + + MPI_Command_name
+
+and Fortran-binding (routine names are in uppercase, but can also be in lower case)
+ + + MPI_COMMAND_NAME
+
+The discussion in these slides focuses on the C++ binding.
++ +
+ +
+ + +
MPI_COMM_WORLD
+
++ +
+ +
+ + +
+ +
+ +
+ + +
Let every process write "Hello world" (oh not this program again!!) on the standard output.
+ + +using namespace std;
+#include <mpi.h>
+#include <iostream>
+int main (int nargs, char* args[])
+{
+int numprocs, my_rank;
+// MPI initializations
+MPI_Init (&nargs, &args);
+MPI_Comm_size (MPI_COMM_WORLD, &numprocs);
+MPI_Comm_rank (MPI_COMM_WORLD, &my_rank);
+cout << "Hello world, I have rank " << my_rank << " out of "
+ << numprocs << endl;
+// End MPI
+MPI_Finalize ();
+
++ +
+ +
+ + +
PROGRAM hello
+INCLUDE "mpif.h"
+INTEGER:: size, my_rank, ierr
+
+CALL MPI_INIT(ierr)
+CALL MPI_COMM_SIZE(MPI_COMM_WORLD, size, ierr)
+CALL MPI_COMM_RANK(MPI_COMM_WORLD, my_rank, ierr)
+WRITE(*,*)"Hello world, I've rank ",my_rank," out of ",size
+CALL MPI_FINALIZE(ierr)
+
+END PROGRAM hello
+
++ +
+ +
+ + +
+ +
+ +
+ + +
int main (int nargs, char* args[])
+{
+ int numprocs, my_rank, i;
+ MPI_Init (&nargs, &args);
+ MPI_Comm_size (MPI_COMM_WORLD, &numprocs);
+ MPI_Comm_rank (MPI_COMM_WORLD, &my_rank);
+ for (i = 0; i < numprocs; i++) {}
+ MPI_Barrier (MPI_COMM_WORLD);
+ if (i == my_rank) {
+ cout << "Hello world, I have rank " << my_rank <<
+ " out of " << numprocs << endl;}
+ MPI_Finalize ();
+
++ +
+ +
+ + +
However, this is slightly more time-consuming since the processes synchronize between themselves as many times as there +are processes. In the next Hello world example we use the send and receive functions in order to a have a synchronized +action. +
++ +
+ +
+ + +
.....
+int numprocs, my_rank, flag;
+MPI_Status status;
+MPI_Init (&nargs, &args);
+MPI_Comm_size (MPI_COMM_WORLD, &numprocs);
+MPI_Comm_rank (MPI_COMM_WORLD, &my_rank);
+if (my_rank > 0)
+MPI_Recv (&flag, 1, MPI_INT, my_rank-1, 100,
+ MPI_COMM_WORLD, &status);
+cout << "Hello world, I have rank " << my_rank << " out of "
+<< numprocs << endl;
+if (my_rank < numprocs-1)
+MPI_Send (&my_rank, 1, MPI_INT, my_rank+1,
+ 100, MPI_COMM_WORLD);
+MPI_Finalize ();
+
++ +
+ +
+ + +
The basic sending of messages is given by the function \( MPI\_SEND \), which in C/C++ +is defined as +
+ + +int MPI_Send(void *buf, int count,
+ MPI_Datatype datatype,
+ int dest, int tag, MPI_Comm comm)}
+
+This single command allows the passing of any kind of variable, even a large array, to any group of tasks. +The variable buf is the variable we wish to send while count +is the number of variables we are passing. If we are passing only a single value, this should be 1. +
+ +If we transfer an array, it is the overall size of the array. +For example, if we want to send a 10 by 10 array, count would be \( 10\times 10=100 \) +since we are actually passing 100 values. +
++ +
+ +
+ + +
Once you have sent a message, you must receive it on another task. The function \( MPI\_RECV \) +is similar to the send call. +
+ + +int MPI_Recv( void *buf, int count, MPI_Datatype datatype,
+ int source,
+ int tag, MPI_Comm comm, MPI_Status *status )
+
+The arguments that are different from those in MPI\_SEND are +buf which is the name of the variable where you will be storing the received data, +source which replaces the destination in the send command. This is the return ID of the sender. +
+ +Finally, we have used \( MPI\_Status\_status \), +where one can check if the receive was completed. +
+ +The output of this code is the same as the previous example, but now +process 0 sends a message to process 1, which forwards it further +to process 2, and so forth. +
++ +
+ +
+ + +
Click on this link for the full program.
++ +
+ +
+ + +
// Trapezoidal rule and numerical integration usign MPI
+using namespace std;
+#include <mpi.h>
+#include <iostream>
+
+// Here we define various functions called by the main program
+
+double int_function(double );
+double trapezoidal_rule(double , double , int , double (*)(double));
+
+// Main function begins here
+int main (int nargs, char* args[])
+{
+ int n, local_n, numprocs, my_rank;
+ double a, b, h, local_a, local_b, total_sum, local_sum;
+ double time_start, time_end, total_time;
+
++ +
+ +
+ + +
// MPI initializations
+ MPI_Init (&nargs, &args);
+ MPI_Comm_size (MPI_COMM_WORLD, &numprocs);
+ MPI_Comm_rank (MPI_COMM_WORLD, &my_rank);
+ time_start = MPI_Wtime();
+ // Fixed values for a, b and n
+ a = 0.0 ; b = 1.0; n = 1000;
+ h = (b-a)/n; // h is the same for all processes
+ local_n = n/numprocs;
+ // make sure n > numprocs, else integer division gives zero
+ // Length of each process' interval of
+ // integration = local_n*h.
+ local_a = a + my_rank*local_n*h;
+ local_b = local_a + local_n*h;
+
++ +
+ +
+ + +
total_sum = 0.0;
+ local_sum = trapezoidal_rule(local_a, local_b, local_n,
+ &int_function);
+ MPI_Reduce(&local_sum, &total_sum, 1, MPI_DOUBLE,
+ MPI_SUM, 0, MPI_COMM_WORLD);
+ time_end = MPI_Wtime();
+ total_time = time_end-time_start;
+ if ( my_rank == 0) {
+ cout << "Trapezoidal rule = " << total_sum << endl;
+ cout << "Time = " << total_time
+ << " on number of processors: " << numprocs << endl;
+ }
+ // End MPI
+ MPI_Finalize ();
+ return 0;
+} // end of main program
+
++ +
+ +
+ + +
Here we have used
+ + +MPI_reduce( void *senddata, void* resultdata, int count,
+ MPI_Datatype datatype, MPI_Op, int root, MPI_Comm comm)
+
+The two variables \( senddata \) and \( resultdata \) are obvious, besides the fact that one sends the address +of the variable or the first element of an array. If they are arrays they need to have the same size. +The variable \( count \) represents the total dimensionality, 1 in case of just one variable, +while \( MPI\_Datatype \) +defines the type of variable which is sent and received. +
+ +The new feature is \( MPI\_Op \). It defines the type +of operation we want to do. +
++ +
+ +
+ + +
In our case, since we are summing +the rectangle contributions from every process we define \( MPI\_Op = MPI\_SUM \). +If we have an array or matrix we can search for the largest og smallest element by sending either \( MPI\_MAX \) or +\( MPI\_MIN \). If we want the location as well (which array element) we simply transfer +\( MPI\_MAXLOC \) or \( MPI\_MINOC \). If we want the product we write \( MPI\_PROD \). +
+ +\( MPI\_Allreduce \) is defined as
+ + +MPI_Allreduce( void *senddata, void* resultdata, int count,
+ MPI_Datatype datatype, MPI_Op, MPI_Comm comm)
+
++ +
+ +
+ + +
We use \( MPI\_reduce \) to collect data from each process. Note also the use of the function +\( MPI\_Wtime \). +
+ + +// this function defines the function to integrate
+double int_function(double x)
+{
+ double value = 4./(1.+x*x);
+ return value;
+} // end of function to evaluate
+
++ +
+ +
+ + +
// this function defines the trapezoidal rule
+double trapezoidal_rule(double a, double b, int n,
+ double (*func)(double))
+{
+ double trapez_sum;
+ double fa, fb, x, step;
+ int j;
+ step=(b-a)/((double) n);
+ fa=(*func)(a)/2. ;
+ fb=(*func)(b)/2. ;
+ trapez_sum=0.;
+ for (j=1; j <= n-1; j++){
+ x=j*step+a;
+ trapez_sum+=(*func)(x);
+ }
+ trapez_sum=(trapez_sum+fb+fa)*step;
+ return trapez_sum;
+} // end trapezoidal_rule
+
++ +
+ +
+ + +
// Variational Monte Carlo for atoms with importance sampling, slater det
+// Test case for 2-electron quantum dot, no classes using Mersenne-Twister RNG
+#include "mpi.h"
+#include <cmath>
+#include <random>
+#include <string>
+#include <iostream>
+#include <fstream>
+#include <iomanip>
+#include "vectormatrixclass.h"
+
+using namespace std;
+// output file as global variable
+ofstream ofile;
+// the step length and its squared inverse for the second derivative
+// Here we define global variables used in various functions
+// These can be changed by using classes
+int Dimension = 2;
+int NumberParticles = 2; // we fix also the number of electrons to be 2
+
+// declaration of functions
+
+// The Mc sampling for the variational Monte Carlo
+void MonteCarloSampling(int, double &, double &, Vector &);
+
+// The variational wave function
+double WaveFunction(Matrix &, Vector &);
+
+// The local energy
+double LocalEnergy(Matrix &, Vector &);
+
+// The quantum force
+void QuantumForce(Matrix &, Matrix &, Vector &);
+
+
+// inline function for single-particle wave function
+inline double SPwavefunction(double r, double alpha) {
+ return exp(-alpha*r*0.5);
+}
+
+// inline function for derivative of single-particle wave function
+inline double DerivativeSPwavefunction(double r, double alpha) {
+ return -r*alpha;
+}
+
+// function for absolute value of relative distance
+double RelativeDistance(Matrix &r, int i, int j) {
+ double r_ij = 0;
+ for (int k = 0; k < Dimension; k++) {
+ r_ij += (r(i,k)-r(j,k))*(r(i,k)-r(j,k));
+ }
+ return sqrt(r_ij);
+}
+
+// inline function for derivative of Jastrow factor
+inline double JastrowDerivative(Matrix &r, double beta, int i, int j, int k){
+ return (r(i,k)-r(j,k))/(RelativeDistance(r, i, j)*pow(1.0+beta*RelativeDistance(r, i, j),2));
+}
+
+// function for square of position of single particle
+double singleparticle_pos2(Matrix &r, int i) {
+ double r_single_particle = 0;
+ for (int j = 0; j < Dimension; j++) {
+ r_single_particle += r(i,j)*r(i,j);
+ }
+ return r_single_particle;
+}
+
+void lnsrch(int n, Vector &xold, double fold, Vector &g, Vector &p, Vector &x,
+ double *f, double stpmax, int *check, double (*func)(Vector &p));
+
+void dfpmin(Vector &p, int n, double gtol, int *iter, double *fret,
+ double(*func)(Vector &p), void (*dfunc)(Vector &p, Vector &g));
+
+static double sqrarg;
+#define SQR(a) ((sqrarg=(a)) == 0.0 ? 0.0 : sqrarg*sqrarg)
+
+
+static double maxarg1,maxarg2;
+#define FMAX(a,b) (maxarg1=(a),maxarg2=(b),(maxarg1) > (maxarg2) ?\
+ (maxarg1) : (maxarg2))
+
+
+// Begin of main program
+
+int main(int argc, char* argv[])
+{
+
+ // MPI initializations
+ int NumberProcesses, MyRank, NumberMCsamples;
+ MPI_Init (&argc, &argv);
+ MPI_Comm_size (MPI_COMM_WORLD, &NumberProcesses);
+ MPI_Comm_rank (MPI_COMM_WORLD, &MyRank);
+ double StartTime = MPI_Wtime();
+ if (MyRank == 0 && argc <= 1) {
+ cout << "Bad Usage: " << argv[0] <<
+ " Read also output file on same line and number of Monte Carlo cycles" << endl;
+ }
+ // Read filename and number of Monte Carlo cycles from the command line
+ if (MyRank == 0 && argc > 2) {
+ string filename = argv[1]; // first command line argument after name of program
+ NumberMCsamples = atoi(argv[2]);
+ string fileout = filename;
+ string argument = to_string(NumberMCsamples);
+ // Final filename as filename+NumberMCsamples
+ fileout.append(argument);
+ ofile.open(fileout);
+ }
+ // broadcast the number of Monte Carlo samples
+ MPI_Bcast (&NumberMCsamples, 1, MPI_INT, 0, MPI_COMM_WORLD);
+ // Two variational parameters only
+ Vector VariationalParameters(2);
+ int TotalNumberMCsamples = NumberMCsamples*NumberProcesses;
+ // Loop over variational parameters
+ for (double alpha = 0.5; alpha <= 1.5; alpha +=0.1){
+ for (double beta = 0.1; beta <= 0.5; beta +=0.05){
+ VariationalParameters(0) = alpha; // value of alpha
+ VariationalParameters(1) = beta; // value of beta
+ // Do the mc sampling and accumulate data with MPI_Reduce
+ double TotalEnergy, TotalEnergySquared, LocalProcessEnergy, LocalProcessEnergy2;
+ LocalProcessEnergy = LocalProcessEnergy2 = 0.0;
+ MonteCarloSampling(NumberMCsamples, LocalProcessEnergy, LocalProcessEnergy2, VariationalParameters);
+ // Collect data in total averages
+ MPI_Reduce(&LocalProcessEnergy, &TotalEnergy, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);
+ MPI_Reduce(&LocalProcessEnergy2, &TotalEnergySquared, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);
+ // Print out results in case of Master node, set to MyRank = 0
+ if ( MyRank == 0) {
+ double Energy = TotalEnergy/( (double)NumberProcesses);
+ double Variance = TotalEnergySquared/( (double)NumberProcesses)-Energy*Energy;
+ double StandardDeviation = sqrt(Variance/((double)TotalNumberMCsamples)); // over optimistic error
+ ofile << setiosflags(ios::showpoint | ios::uppercase);
+ ofile << setw(15) << setprecision(8) << VariationalParameters(0);
+ ofile << setw(15) << setprecision(8) << VariationalParameters(1);
+ ofile << setw(15) << setprecision(8) << Energy;
+ ofile << setw(15) << setprecision(8) << Variance;
+ ofile << setw(15) << setprecision(8) << StandardDeviation << endl;
+ }
+ }
+ }
+ double EndTime = MPI_Wtime();
+ double TotalTime = EndTime-StartTime;
+ if ( MyRank == 0 ) cout << "Time = " << TotalTime << " on number of processors: " << NumberProcesses << endl;
+ if (MyRank == 0) ofile.close(); // close output file
+ // End MPI
+ MPI_Finalize ();
+ return 0;
+} // end of main function
+
+
+// Monte Carlo sampling with the Metropolis algorithm
+
+void MonteCarloSampling(int NumberMCsamples, double &cumulative_e, double &cumulative_e2, Vector &VariationalParameters)
+{
+
+ // Initialize the seed and call the Mersienne algo
+ std::random_device rd;
+ std::mt19937_64 gen(rd());
+ // Set up the uniform distribution for x \in [[0, 1]
+ std::uniform_real_distribution<double> UniformNumberGenerator(0.0,1.0);
+ std::normal_distribution<double> Normaldistribution(0.0,1.0);
+ // diffusion constant from Schroedinger equation
+ double D = 0.5;
+ double timestep = 0.05; // we fix the time step for the gaussian deviate
+ // allocate matrices which contain the position of the particles
+ Matrix OldPosition( NumberParticles, Dimension), NewPosition( NumberParticles, Dimension);
+ Matrix OldQuantumForce(NumberParticles, Dimension), NewQuantumForce(NumberParticles, Dimension);
+ double Energy = 0.0; double EnergySquared = 0.0; double DeltaE = 0.0;
+ // initial trial positions
+ for (int i = 0; i < NumberParticles; i++) {
+ for (int j = 0; j < Dimension; j++) {
+ OldPosition(i,j) = Normaldistribution(gen)*sqrt(timestep);
+ }
+ }
+ double OldWaveFunction = WaveFunction(OldPosition, VariationalParameters);
+ QuantumForce(OldPosition, OldQuantumForce, VariationalParameters);
+ // loop over monte carlo cycles
+ for (int cycles = 1; cycles <= NumberMCsamples; cycles++){
+ // new position
+ for (int i = 0; i < NumberParticles; i++) {
+ for (int j = 0; j < Dimension; j++) {
+ // gaussian deviate to compute new positions using a given timestep
+ NewPosition(i,j) = OldPosition(i,j) + Normaldistribution(gen)*sqrt(timestep)+OldQuantumForce(i,j)*timestep*D;
+ // NewPosition(i,j) = OldPosition(i,j) + gaussian_deviate(&idum)*sqrt(timestep)+OldQuantumForce(i,j)*timestep*D;
+ }
+ // for the other particles we need to set the position to the old position since
+ // we move only one particle at the time
+ for (int k = 0; k < NumberParticles; k++) {
+ if ( k != i) {
+ for (int j = 0; j < Dimension; j++) {
+ NewPosition(k,j) = OldPosition(k,j);
+ }
+ }
+ }
+ double NewWaveFunction = WaveFunction(NewPosition, VariationalParameters);
+ QuantumForce(NewPosition, NewQuantumForce, VariationalParameters);
+ // we compute the log of the ratio of the greens functions to be used in the
+ // Metropolis-Hastings algorithm
+ double GreensFunction = 0.0;
+ for (int j = 0; j < Dimension; j++) {
+ GreensFunction += 0.5*(OldQuantumForce(i,j)+NewQuantumForce(i,j))*
+ (D*timestep*0.5*(OldQuantumForce(i,j)-NewQuantumForce(i,j))-NewPosition(i,j)+OldPosition(i,j));
+ }
+ GreensFunction = exp(GreensFunction);
+ // The Metropolis test is performed by moving one particle at the time
+ if(UniformNumberGenerator(gen) <= GreensFunction*NewWaveFunction*NewWaveFunction/OldWaveFunction/OldWaveFunction ) {
+ for (int j = 0; j < Dimension; j++) {
+ OldPosition(i,j) = NewPosition(i,j);
+ OldQuantumForce(i,j) = NewQuantumForce(i,j);
+ }
+ OldWaveFunction = NewWaveFunction;
+ }
+ } // end of loop over particles
+ // compute local energy
+ double DeltaE = LocalEnergy(OldPosition, VariationalParameters);
+ // update energies
+ Energy += DeltaE;
+ EnergySquared += DeltaE*DeltaE;
+ } // end of loop over MC trials
+ // update the energy average and its squared
+ cumulative_e = Energy/NumberMCsamples;
+ cumulative_e2 = EnergySquared/NumberMCsamples;
+} // end MonteCarloSampling function
+
+
+// Function to compute the squared wave function and the quantum force
+
+double WaveFunction(Matrix &r, Vector &VariationalParameters)
+{
+ double wf = 0.0;
+ // full Slater determinant for two particles, replace with Slater det for more particles
+ wf = SPwavefunction(singleparticle_pos2(r, 0), VariationalParameters(0))*SPwavefunction(singleparticle_pos2(r, 1),VariationalParameters(0));
+ // contribution from Jastrow factor
+ for (int i = 0; i < NumberParticles-1; i++) {
+ for (int j = i+1; j < NumberParticles; j++) {
+ wf *= exp(RelativeDistance(r, i, j)/((1.0+VariationalParameters(1)*RelativeDistance(r, i, j))));
+ }
+ }
+ return wf;
+}
+
+// Function to calculate the local energy without numerical derivation of kinetic energy
+
+double LocalEnergy(Matrix &r, Vector &VariationalParameters)
+{
+
+ // compute the kinetic and potential energy from the single-particle part
+ // for a many-electron system this has to be replaced by a Slater determinant
+ // The absolute value of the interparticle length
+ Matrix length( NumberParticles, NumberParticles);
+ // Set up interparticle distance
+ for (int i = 0; i < NumberParticles-1; i++) {
+ for(int j = i+1; j < NumberParticles; j++){
+ length(i,j) = RelativeDistance(r, i, j);
+ length(j,i) = length(i,j);
+ }
+ }
+ double KineticEnergy = 0.0;
+ // Set up kinetic energy from Slater and Jastrow terms
+ for (int i = 0; i < NumberParticles; i++) {
+ for (int k = 0; k < Dimension; k++) {
+ double sum1 = 0.0;
+ for(int j = 0; j < NumberParticles; j++){
+ if ( j != i) {
+ sum1 += JastrowDerivative(r, VariationalParameters(1), i, j, k);
+ }
+ }
+ KineticEnergy += (sum1+DerivativeSPwavefunction(r(i,k),VariationalParameters(0)))*(sum1+DerivativeSPwavefunction(r(i,k),VariationalParameters(0)));
+ }
+ }
+ KineticEnergy += -2*VariationalParameters(0)*NumberParticles;
+ for (int i = 0; i < NumberParticles-1; i++) {
+ for (int j = i+1; j < NumberParticles; j++) {
+ KineticEnergy += 2.0/(pow(1.0 + VariationalParameters(1)*length(i,j),2))*(1.0/length(i,j)-2*VariationalParameters(1)/(1+VariationalParameters(1)*length(i,j)) );
+ }
+ }
+ KineticEnergy *= -0.5;
+ // Set up potential energy, external potential + eventual electron-electron repulsion
+ double PotentialEnergy = 0;
+ for (int i = 0; i < NumberParticles; i++) {
+ double DistanceSquared = singleparticle_pos2(r, i);
+ PotentialEnergy += 0.5*DistanceSquared; // sp energy HO part, note it has the oscillator frequency set to 1!
+ }
+ // Add the electron-electron repulsion
+ for (int i = 0; i < NumberParticles-1; i++) {
+ for (int j = i+1; j < NumberParticles; j++) {
+ PotentialEnergy += 1.0/length(i,j);
+ }
+ }
+ double LocalE = KineticEnergy+PotentialEnergy;
+ return LocalE;
+}
+
+// Compute the analytical expression for the quantum force
+void QuantumForce(Matrix &r, Matrix &qforce, Vector &VariationalParameters)
+{
+ // compute the first derivative
+ for (int i = 0; i < NumberParticles; i++) {
+ for (int k = 0; k < Dimension; k++) {
+ // single-particle part, replace with Slater det for larger systems
+ double sppart = DerivativeSPwavefunction(r(i,k),VariationalParameters(0));
+ // Jastrow factor contribution
+ double Jsum = 0.0;
+ for (int j = 0; j < NumberParticles; j++) {
+ if ( j != i) {
+ Jsum += JastrowDerivative(r, VariationalParameters(1), i, j, k);
+ }
+ }
+ qforce(i,k) = 2.0*(Jsum+sppart);
+ }
+ }
+} // end of QuantumForce function
+
+
+#define ITMAX 200
+#define EPS 3.0e-8
+#define TOLX (4*EPS)
+#define STPMX 100.0
+
+void dfpmin(Vector &p, int n, double gtol, int *iter, double *fret,
+ double(*func)(Vector &p), void (*dfunc)(Vector &p, Vector &g))
+{
+
+ int check,i,its,j;
+ double den,fac,fad,fae,fp,stpmax,sum=0.0,sumdg,sumxi,temp,test;
+ Vector dg(n), g(n), hdg(n), pnew(n), xi(n);
+ Matrix hessian(n,n);
+
+ fp=(*func)(p);
+ (*dfunc)(p,g);
+ for (i = 0;i < n;i++) {
+ for (j = 0; j< n;j++) hessian(i,j)=0.0;
+ hessian(i,i)=1.0;
+ xi(i) = -g(i);
+ sum += p(i)*p(i);
+ }
+ stpmax=STPMX*FMAX(sqrt(sum),(double)n);
+ for (its=1;its<=ITMAX;its++) {
+ *iter=its;
+ lnsrch(n,p,fp,g,xi,pnew,fret,stpmax,&check,func);
+ fp = *fret;
+ for (i = 0; i< n;i++) {
+ xi(i)=pnew(i)-p(i);
+ p(i)=pnew(i);
+ }
+ test=0.0;
+ for (i = 0;i< n;i++) {
+ temp=fabs(xi(i))/FMAX(fabs(p(i)),1.0);
+ if (temp > test) test=temp;
+ }
+ if (test < TOLX) {
+ return;
+ }
+ for (i=0;i<n;i++) dg(i)=g(i);
+ (*dfunc)(p,g);
+ test=0.0;
+ den=FMAX(*fret,1.0);
+ for (i=0;i<n;i++) {
+ temp=fabs(g(i))*FMAX(fabs(p(i)),1.0)/den;
+ if (temp > test) test=temp;
+ }
+ if (test < gtol) {
+ return;
+ }
+ for (i=0;i<n;i++) dg(i)=g(i)-dg(i);
+ for (i=0;i<n;i++) {
+ hdg(i)=0.0;
+ for (j=0;j<n;j++) hdg(i) += hessian(i,j)*dg(j);
+ }
+ fac=fae=sumdg=sumxi=0.0;
+ for (i=0;i<n;i++) {
+ fac += dg(i)*xi(i);
+ fae += dg(i)*hdg(i);
+ sumdg += SQR(dg(i));
+ sumxi += SQR(xi(i));
+ }
+ if (fac*fac > EPS*sumdg*sumxi) {
+ fac=1.0/fac;
+ fad=1.0/fae;
+ for (i=0;i<n;i++) dg(i)=fac*xi(i)-fad*hdg(i);
+ for (i=0;i<n;i++) {
+ for (j=0;j<n;j++) {
+ hessian(i,j) += fac*xi(i)*xi(j)
+ -fad*hdg(i)*hdg(j)+fae*dg(i)*dg(j);
+ }
+ }
+ }
+ for (i=0;i<n;i++) {
+ xi(i)=0.0;
+ for (j=0;j<n;j++) xi(i) -= hessian(i,j)*g(j);
+ }
+ }
+ cout << "too many iterations in dfpmin" << endl;
+}
+#undef ITMAX
+#undef EPS
+#undef TOLX
+#undef STPMX
+
+#define ALF 1.0e-4
+#define TOLX 1.0e-7
+
+void lnsrch(int n, Vector &xold, double fold, Vector &g, Vector &p, Vector &x,
+ double *f, double stpmax, int *check, double (*func)(Vector &p))
+{
+ int i;
+ double a,alam,alam2,alamin,b,disc,f2,fold2,rhs1,rhs2,slope,sum,temp,
+ test,tmplam;
+
+ *check=0;
+ for (sum=0.0,i=0;i<n;i++) sum += p(i)*p(i);
+ sum=sqrt(sum);
+ if (sum > stpmax)
+ for (i=0;i<n;i++) p(i) *= stpmax/sum;
+ for (slope=0.0,i=0;i<n;i++)
+ slope += g(i)*p(i);
+ test=0.0;
+ for (i=0;i<n;i++) {
+ temp=fabs(p(i))/FMAX(fabs(xold(i)),1.0);
+ if (temp > test) test=temp;
+ }
+ alamin=TOLX/test;
+ alam=1.0;
+ for (;;) {
+ for (i=0;i<n;i++) x(i)=xold(i)+alam*p(i);
+ *f=(*func)(x);
+ if (alam < alamin) {
+ for (i=0;i<n;i++) x(i)=xold(i);
+ *check=1;
+ return;
+ } else if (*f <= fold+ALF*alam*slope) return;
+ else {
+ if (alam == 1.0)
+ tmplam = -slope/(2.0*(*f-fold-slope));
+ else {
+ rhs1 = *f-fold-alam*slope;
+ rhs2=f2-fold2-alam2*slope;
+ a=(rhs1/(alam*alam)-rhs2/(alam2*alam2))/(alam-alam2);
+ b=(-alam2*rhs1/(alam*alam)+alam*rhs2/(alam2*alam2))/(alam-alam2);
+ if (a == 0.0) tmplam = -slope/(2.0*b);
+ else {
+ disc=b*b-3.0*a*slope;
+ if (disc<0.0) cout << "Roundoff problem in lnsrch." << endl;
+ else tmplam=(-b+sqrt(disc))/(3.0*a);
+ }
+ if (tmplam>0.5*alam)
+ tmplam=0.5*alam;
+ }
+ }
+ alam2=alam;
+ f2 = *f;
+ fold2=fold;
+ alam=FMAX(tmplam,0.1*alam);
+ }
+}
+#undef ALF
+#undef TOLX
+
++ +
+ +
+ + +
Many good tutorials online and excellent textbook
++ +
+ +
+ + +
#include <omp.h>
+
+#pragma omp...
+
++ +
+ +
+ + +
#pragma omp construct [ clause ...]
+
+#include <omp.h>
+
++ +
+ +
+ + +
OpenMP supports several different ways to specify thread parallelism
+ ++ +
+ +
+ + +
#include <omp.h>
+main ()
+{
+int var1, var2, var3;
+/* serial code */
+/* ... */
+/* start of a parallel region */
+#pragma omp parallel private(var1, var2) shared(var3)
+{
+/* ... */
+}
+/* more serial code */
+/* ... */
+/* another parallel region */
+#pragma omp parallel
+{
+/* ... */
+}
+}
+
++ +
+ +
+ + +
#pragma omp parallel { ... }
+
++ +
+ +
+ + +
#include <omp.h>
+#include <cstdio>
+int main (int argc, char *argv[])
+{
+int th_id, nthreads;
+#pragma omp parallel private(th_id) shared(nthreads)
+{
+th_id = omp_get_thread_num();
+printf("Hello World from thread %d\n", th_id);
+#pragma omp barrier
+if ( th_id == 0 ) {
+nthreads = omp_get_num_threads();
+printf("There are %d threads\n",nthreads);
+}
+}
+return 0;
+}
+
++ +
+ +
+ + +
#include <cstdio>
+#include <omp.h>
+int main(int argc, char *argv[])
+{
+ omp_set_num_threads(4);
+#pragma omp parallel
+ {
+ int id = omp_get_thread_num();
+ int nproc = omp_get_num_threads();
+ cout << "Hello world with id number and processes " << id << nproc << endl;
+ }
+return 0;
+}
+
+Variables declared outside of the parallel region are shared by all threads +If a variable like id is declared outside of the +
+ + +#pragma omp parallel,
+
+it would have been shared by various the threads, possibly causing erroneous output
++ +
+ +
+ + +
+ +
+ +
+ + +
Private clause can be used to make thread- private versions of such variables:
+ + +#pragma omp parallel private(id)
+{
+ int id = omp_get_thread_num();
+ cout << "My thread num" << id << endl;
+}
+
++ +
+ +
+ + +
It is often useful to have only one thread execute some of the code in a parallel region. I/O statements are a common example
+ + +#pragma omp parallel
+{
+ #pragma omp master
+ {
+ int id = omp_get_thread_num();
+ cout << "My thread num" << id << endl;
+ }
+}
+
++ +
+ +
+ + +
#pragma omp for
+
++ +
+ +
+ + +
OpenMP provides an easy way to parallelize a loop
+ + +#pragma omp parallel for
+ for (i=0; i<n; i++) c[i] = a[i];
+
+OpenMP handles index variable (no need to declare in for loop or make private)
+ +Which thread does which values? Several options.
++ +
+ +
+ + +
We can let the OpenMP runtime decide. The decision is about how the loop iterates are scheduled +and OpenMP defines three choices of loop scheduling: +
++ +
+ +
+ + +
#include <omp.h>
+#define CHUNKSIZE 100
+#define N 1000
+int main (int argc, char *argv[])
+{
+int i, chunk;
+float a[N], b[N], c[N];
+for (i=0; i < N; i++) a[i] = b[i] = i * 1.0;
+chunk = CHUNKSIZE;
+#pragma omp parallel shared(a,b,c,chunk) private(i)
+{
+#pragma omp for schedule(dynamic,chunk)
+for (i=0; i < N; i++) c[i] = a[i] + b[i];
+} /* end of parallel region */
+}
+
++ +
+ +
+ + +
#include <omp.h>
+#define CHUNKSIZE 100
+#define N 1000
+int main (int argc, char *argv[])
+{
+int i, chunk;
+float a[N], b[N], c[N];
+for (i=0; i < N; i++) a[i] = b[i] = i * 1.0;
+chunk = CHUNKSIZE;
+#pragma omp parallel shared(a,b,c,chunk) private(i)
+{
+#pragma omp for schedule(guided,chunk)
+for (i=0; i < N; i++) c[i] = a[i] + b[i];
+} /* end of parallel region */
+}
+
++ +
+ +
+ + +
// #pragma omp parallel and #pragma omp for
+
+can be combined into
+ + +#pragma omp parallel for
+
++ +
+ +
+ + +
What happens with code like this
+ + +#pragma omp parallel for
+for (i=0; i<n; i++) sum += a[i]*a[i];
+
+All threads can access the sum variable, but the addition is not atomic! It is important to avoid race between threads. So-called reductions in OpenMP are thus important for performance and for obtaining correct results. OpenMP lets us indicate that a variable is used for a reduction with a particular operator. The above code becomes
+ + +sum = 0.0;
+#pragma omp parallel for reduction(+:sum)
+for (i=0; i<n; i++) sum += a[i]*a[i];
+
++ +
+ +
+ + +
int i;
+double sum = 0.;
+/* allocating and initializing arrays */
+/* ... */
+#pragma omp parallel for default(shared) private(i) reduction(+:sum)
+ for (i=0; i<N; i++) sum += a[i]*b[i];
+}
+
++ +
+ +
+ + +
Different threads do different tasks independently, each section is executed by one thread.
+ + +#pragma omp parallel
+{
+#pragma omp sections
+{
+#pragma omp section
+funcA ();
+#pragma omp section
+funcB ();
+#pragma omp section
+funcC ();
+}
+}
+
++ +
+ +
+ + +
#pragma omp single { ... }
+
+The code is executed by one thread only, no guarantee which thread
+ +Can introduce an implicit barrier at the end
+ + +#pragma omp master { ... }
+
+Code executed by the master thread, guaranteed and no implicit barrier at the end.
++ +
+ +
+ + +
#pragma omp barrier
+
+Synchronization, must be encountered by all threads in a team (or none)
+ + +#pragma omp ordered { a block of codes }
+
+is another form of synchronization (in sequential order). +The form +
+ + +#pragma omp critical { a block of codes }
+
+and
+ + +#pragma omp atomic { single assignment statement }
+
+is more efficient than
+ + +#pragma omp critical
+
++ +
+ +
+ + +
What are the purposes of these attributes
++ +
+ +
+ + +
+ +
+ +
+ + +
for (i=0; i<100; i++)
+ for (j=0; j<100; j++)
+ a[i][j] = b[i][j] + c[i][j];
+ }
+}
+
+#pragma omp parallel for private(j)
+for (i=0; i<100; i++)
+ for (j=0; j<100; j++)
+ a[i][j] = b[i][j] + c[i][j];
+ }
+}
+
++ +
+ +
+ + +
When a thread in a parallel region encounters another parallel construct, it +may create a new team of threads and become the master of the new +team. +
+ + +#pragma omp parallel num_threads(4)
+{
+/* .... */
+#pragma omp parallel num_threads(2)
+{
+//
+}
+}
+
++ +
+ +
+ + +
#pragma omp task
+#pragma omp parallel shared(p_vec) private(i)
+{
+#pragma omp single
+{
+for (i=0; i<N; i++) {
+ double r = random_number();
+ if (p_vec[i] > r) {
+#pragma omp task
+ do_work (p_vec[i]);
+
++ +
+ +
+ + +
Race condition
+ + +int nthreads;
+#pragma omp parallel shared(nthreads)
+{
+nthreads = omp_get_num_threads();
+}
+
+Deadlock
+ + +#pragma omp parallel
+{
+...
+#pragma omp critical
+{
+...
+#pragma omp barrier
+}
+}
+
++ +
+ +
+ + +
Not all computations are simple loops where the data can be evenly +divided among threads without any dependencies between threads +
+ +An example is finding the location and value of the largest element in an array
+ + +for (i=0; i<n; i++) {
+ if (x[i] > maxval) {
+ maxval = x[i];
+ maxloc = i;
+ }
+}
+
++ +
+ +
+ + +
All threads are potentially accessing and changing the same values, maxloc and maxval.
+#pragma omp atomic
+
+#pragma omp critical
+
+Atomic may be faster than critical but depends on hardware
++ +
+ +
+ + +
Write down the simplest algorithm and look carefully for race conditions. How would you handle them? +The first step would be to parallelize as +
+ + +#pragma omp parallel for
+ for (i=0; i<n; i++) {
+ if (x[i] > maxval) {
+ maxval = x[i];
+ maxloc = i;
+ }
+}
+
++ +
+ +
+ + +
Write down the simplest algorithm and look carefully for race conditions. How would you handle them? +The first step would be to parallelize as +
+ + +#pragma omp parallel for
+ for (i=0; i<n; i++) {
+#pragma omp critical
+ {
+ if (x[i] > maxval) {
+ maxval = x[i];
+ maxloc = i;
+ }
+ }
+}
+
+Exercise: write a code which implements this and give an estimate on performance. Perform several runs, +with a serial code only with and without vectorization and compare the serial code with the one that uses OpenMP. Run on different archictectures if you can. +
++ +
+ +
+ + +
Performance poor because we insisted on keeping track of the maxval and location during the execution of the loop.
+This is a common source of performance issues, namely the description of the method used to compute a value imposes additional, unnecessary requirements or properties
+ +Idea: Have each thread find the maxloc in its own data, then combine and use temporary arrays indexed by thread number to hold the values found by each thread ++ +
+ +
+ + +
int maxloc[MAX_THREADS], mloc;
+double maxval[MAX_THREADS], mval;
+#pragma omp parallel shared(maxval,maxloc)
+{
+ int id = omp_get_thread_num();
+ maxval[id] = -1.0e30;
+#pragma omp for
+ for (int i=0; i<n; i++) {
+ if (x[i] > maxval[id]) {
+ maxloc[id] = i;
+ maxval[id] = x[i];
+ }
+ }
+}
+
++ +
+ +
+ + +
#pragma omp flush (maxloc,maxval)
+#pragma omp master
+ {
+ int nt = omp_get_num_threads();
+ mloc = maxloc[0];
+ mval = maxval[0];
+ for (int i=1; i<nt; i++) {
+ if (maxval[i] > mval) {
+ mval = maxval[i];
+ mloc = maxloc[i];
+ }
+ }
+ }
+
+Note that we let the master process perform the last operation.
++ +
+ +
+ + +
This code computes the norm of a vector using OpenMp
+ + +// OpenMP program to compute vector norm by adding two other vectors
+#include <cstdlib>
+#include <iostream>
+#include <cmath>
+#include <iomanip>
+#include <omp.h>
+# include <ctime>
+
+using namespace std; // note use of namespace
+int main (int argc, char* argv[])
+{
+ // read in dimension of vector
+ int n = atoi(argv[1]);
+ double *a, *b, *c;
+ int i;
+ int thread_num;
+ double wtime, Norm2, s, angle;
+ cout << " Perform addition of two vectors and compute the norm-2." << endl;
+ omp_set_num_threads(4);
+ thread_num = omp_get_max_threads ();
+ cout << " The number of processors available = " << omp_get_num_procs () << endl ;
+ cout << " The number of threads available = " << thread_num << endl;
+ cout << " The matrix order n = " << n << endl;
+
+ s = 1.0/sqrt( (double) n);
+ wtime = omp_get_wtime ( );
+ // Allocate space for the vectors to be used
+ a = new double [n]; b = new double [n]; c = new double [n];
+ // Define parallel region
+# pragma omp parallel for default(shared) private (angle, i) reduction(+:Norm2)
+ // Set up values for vectors a and b
+ for (i = 0; i < n; i++){
+ angle = 2.0*M_PI*i/ (( double ) n);
+ a[i] = s*(sin(angle) + cos(angle));
+ b[i] = s*sin(2.0*angle);
+ c[i] = 0.0;
+ }
+ // Then perform the vector addition
+ for (i = 0; i < n; i++){
+ c[i] += a[i]+b[i];
+ }
+ // Compute now the norm-2
+ Norm2 = 0.0;
+ for (i = 0; i < n; i++){
+ Norm2 += c[i]*c[i];
+ }
+// end parallel region
+ wtime = omp_get_wtime ( ) - wtime;
+ cout << setiosflags(ios::showpoint | ios::uppercase);
+ cout << setprecision(10) << setw(20) << "Time used for norm-2 computation=" << wtime << endl;
+ cout << " Norm-2 = " << Norm2 << endl;
+ // Free up space
+ delete[] a;
+ delete[] b;
+ delete[] c;
+ return 0;
+}
+
++ +
+ +
+ + +
This the matrix-matrix multiplication code with plain c++ memory allocation using OpenMP
+ + + +// Matrix-matrix multiplication and Frobenius norm of a matrix with OpenMP
+#include <cstdlib>
+#include <iostream>
+#include <cmath>
+#include <iomanip>
+#include <omp.h>
+# include <ctime>
+
+using namespace std; // note use of namespace
+int main (int argc, char* argv[])
+{
+ // read in dimension of square matrix
+ int n = atoi(argv[1]);
+ double **A, **B, **C;
+ int i, j, k;
+ int thread_num;
+ double wtime, Fsum, s, angle;
+ cout << " Compute matrix product C = A * B and Frobenius norm." << endl;
+ omp_set_num_threads(4);
+ thread_num = omp_get_max_threads ();
+ cout << " The number of processors available = " << omp_get_num_procs () << endl ;
+ cout << " The number of threads available = " << thread_num << endl;
+ cout << " The matrix order n = " << n << endl;
+
+ s = 1.0/sqrt( (double) n);
+ wtime = omp_get_wtime ( );
+ // Allocate space for the two matrices
+ A = new double*[n]; B = new double*[n]; C = new double*[n];
+ for (i = 0; i < n; i++){
+ A[i] = new double[n];
+ B[i] = new double[n];
+ C[i] = new double[n];
+ }
+ // Define parallel region
+# pragma omp parallel for default(shared) private (angle, i, j, k) reduction(+:Fsum)
+ // Set up values for matrix A and B and zero matrix C
+ for (i = 0; i < n; i++){
+ for (j = 0; j < n; j++) {
+ angle = 2.0*M_PI*i*j/ (( double ) n);
+ A[i][j] = s * ( sin ( angle ) + cos ( angle ) );
+ B[j][i] = A[i][j];
+ }
+ }
+ // Then perform the matrix-matrix multiplication
+ for (i = 0; i < n; i++){
+ for (j = 0; j < n; j++) {
+ C[i][j] = 0.0;
+ for (k = 0; k < n; k++) {
+ C[i][j] += A[i][k]*B[k][j];
+ }
+ }
+ }
+ // Compute now the Frobenius norm
+ Fsum = 0.0;
+ for (i = 0; i < n; i++){
+ for (j = 0; j < n; j++) {
+ Fsum += C[i][j]*C[i][j];
+ }
+ }
+ Fsum = sqrt(Fsum);
+// end parallel region and letting only one thread perform I/O
+ wtime = omp_get_wtime ( ) - wtime;
+ cout << setiosflags(ios::showpoint | ios::uppercase);
+ cout << setprecision(10) << setw(20) << "Time used for matrix-matrix multiplication=" << wtime << endl;
+ cout << " Frobenius norm = " << Fsum << endl;
+ // Free up space
+ for (int i = 0; i < n; i++){
+ delete[] A[i];
+ delete[] B[i];
+ delete[] C[i];
+ }
+ delete[] A;
+ delete[] B;
+ delete[] C;
+ return 0;
+}
+
++ +
+ +