Statool Interface, Algorithms, Maintenance, and Applications
By Jianzhong Zhang and Daniel Berleant, July 2003
1 Introduction to Uncertainty Intervals
2 Narrowing the Envelopes Around Results Using Correlation
2.3 Interval-Valued Correlations
2.4 Legal and Illegal Correlation Values
2.5 Additional Constraints Gotten from Correlation
2.6 Nonlinear Optimization to Remove Excess Width
2.7 Improving Results by Adding Constraints to LP
2.8.1 How to Find the Initial Feasible Solution
2.8.2 How to Decide the Termination Condition and Entering Variable
2.8.3 How to Determine the Leaving Variable
2.8.4 Decreasing the Computational Cost
2.9.1 Local and Global Optimums
2.9.2 Classical Theory of Unconstrained Optimization
2.9.3 Finding a Solution Iteratively
2.9.4 Search Methods: Direct and Gradient
2.9.5 Converting Constrained to Unconstrained Optimization
3.1.1 Background on the Transportation Simplex Method
3.1.2 Exceptions in Finding the Initial Solution
3.1.3 Adaptation to the Unknown Dependency Case
3.3.1 Relational Operations on Intervals
3.3.2 Relational Operations on Random Variables
3.4.2 Limitations on Evaluating Expressions
3.4.3 Excess Width in Expressions
4.1 Economic Dispatch: Applying the Interval-Based Distribution Envelope Determination Algorithm
4.2 Bounding the Composite Value at Risk for Energy Service Company Operation with DEnv
5 Appendix: Interval mathematics, a brief review
6 Appendix: Software Architecture
6.3.1 Main Primary Window: the Operation Window
6.3.2 Other Primary Windows: the Data Editor and the View Windows
7 Appendix: Component Dsign and Implementation
7.2 Operation and Properties Windows
7.2.2 Properties Windows of the Operation Window
7.4 Lower 2 Levels of Logical Layer
7.4.1 Utility Functions and Internal Data Structures
7.4.2 Access Points to the Computing Layer
7.5.2 General (Legacy) Simplex Method
7.5.3 Transportation Simplex Method
7.5.4 Incorporating Correlation as a Constraint
8 Appendix: Sample Computations
Figure 1‑1. Probability bounds for random variable Z.
Figure 3‑1. Converting CDF envelopes to a set of intervals and associated probabilities.
Figure 3‑2. Result for operation.
Figure 3‑5. Expression editor.
Figure 6‑7. Package view for the computing layer.
Figure 7‑1. Operation window design.
Figure 7‑8. Popup menu for operand X.
Figure 7‑9. Popup menu for operand Y.
Figure 7‑10. Popup menu for result Z.
Figure 7‑11. Correlation setting.
Figure 7‑13. Display mode window.
Figure 7‑14. Display color window.
Figure 7‑15. Correlation setting window.
Figure 7‑16. Expression editor window.
Figure 7‑17. Class relationship for transportation simplex method.
Figure 7‑18. Class CConfigure.
Figure 7‑19. Flow of control for transportation simplex implementation.
Figure 7‑20. Class: COptimalMin.
Figure 7‑21. Class: CConfigure.
Figure 7‑22. Class: CVariance.
Figure 7‑25. Class: CBoundCorrelation.
Figure 7‑26. Class: CMinCorrelation.
Figure 7‑27. Class: CMinCorrelationExy.
Figure 7‑29. Sequence diagram for Cor_Bound.
Figure 7‑30. Sequence diagram for Cor_Min( ).
Figure 7‑31. Sequence diagram for Cor_Min_Exy( ).
Figure 7‑32. Sequence diagram for Cor_Min_Exy_S( ).
Figure 8‑2. X+Y when X and Y are 32 intervals.
Figure 8‑3. X+Y when X and Y have 64 intervals.
Figure 8‑5. X*Y for correlation 0.98.
Figure 8‑6. X*Y for correlation 0.
Figure 8‑7. X*Y for correlation -0.98.
Figure 8‑8. Times for operations.
Figure 8‑9. Times for multiplication and division.
Uncertainty exists frequently in our knowledge of the real world. Probability is a common way to understand uncertainty, and is intrinsic to the concept of the random variable. Intuitively, a random variable is a function that takes as input an experiment and returns a value that expresses the result of the experiment. Thus if the experiment is to measure a voltage, the value returned is the number of volts. Typically, the result of the experiment cannot be determined in advance. A probability distribution function is often used to describe the way the experiment is more likely to return some values than others. Sometimes random variables are defined such that samples of their distributions are derived from arithmetic operations on samples of the distributions of other random variables. Determining the distribution of these derived random variables has been the goal of a considerable body of work. Generally, there are two classes of methods to reach this goal: analytical and numerical. Analytical methods are restricted to specific classes of input distributions. Numerical methods give numerical (rather than closed form) results and are widely used in real applications.
Monte Carlo simulation is one of the best-known numerical methods. However as traditionally done, Monte Carlo has some drawbacks. Major ones include undependable conclusions about unusual situations, however important those situations may be, because in a simulation run, random number generation may not generate such situations at the same rate that they would actually be expected to occur, or might even fail to generate any example of such a situation at all.
To alleviate the problems with Monte Carlo analysis, one method that has been described is Distribution Envelope Determination (DEnv), developed by Berleant and Goodman-Strauss. Another approach is the copula-based approach called Probabilistic Arithmetic. These two methods have been implemented in software.
DEnv is now supported by Statool, software that supports a variety of dependence relationships between random variables whose vectors of samples each result in a sample of a derived random variable. These include independence, unknown dependence, and specific correlation values. The software may be downloaded as used at no charge. Software implementing Probabilistic Arithmetic may be purchased commercially.
Uncertainty is frequently present in our knowledge of the real world. Handling uncertainty is therefore an important problem. Probability is a common approach, and probability density functions (PDFs) or their integrals, cumulative distribution functions (CDFs), are examples of the probabilistic approach. These are used to describe random variables. Often, such random variables are defined based on combinations of other random variables. These are called derived random variables, and their distributions are called derived distributions [Springer, 1979]. For example, samples of a derived random variable could be defined as the sum, the max, or some other function of samples of two other random variables. Derived random variables are recognized in fields such as decision analysis, risk analysis, and many others.
A variety of methods have been developed to address this topic. There are two classes of such methods: analytical and numerical. Analytical methods are restricted to specific classes of input distribution, and usually make simplifying assumptions, such as independence. For example, normal distributions are often used. If two random variables are normal and independent, the sum of these two random variables still is normal. It is also possible to obtain derived distributions for specified dependency relationships other than independence, such as perfect positive rank correlation. However, it is often not easy to obtain analytical results for arithmetic operations on random variables without assumptions and it is not always reasonable to make these assumptions though it may be convenient. Sometimes, we do not have information about the dependency. But an advantage of analytical methods is accuracy. Unlike analytical methods, numerical methods only give numerical results. However this is suitable for a wide class of distributions. Numerical methods are widely used in real applications if approximate results within specific tolerances are acceptable.
Monte Carlo simulation is one of the best-known numerical methods. However, the traditional approach of Monte Carlo has some limitations. It assumes that the distribution of the random variables is known, and that their dependency relationship is independent or known [Ferson 1996]. If either the probability distributions or the dependency relationship of the random variables is not available, some assumptions are usually necessary process it. If the assumptions do not, dependability of results can be degraded.
A discrete convolution approach may be used to calculate the result given independence of the input random variables [Ingram et al. 1968; Colombo and Jaarsma 1980; Kaplan 1981]. Interval analysis may be used to solve this problem while providing error bounds. (Note that intervals become closer to point values as the intervals get narrower.) Interval mathematics became an identifiable field in the 1960’s [Moore, 1966].
Intervals have the potential for bounding the result of an operation. Discretization error coming from discretizing distributions may be bounded by interval based discretization [Berleant 1993]. If the dependency is not specified, result bounds will include the entire range of possible dependencies. These bounds will normally be wider than if a particular dependency is specified. The Distribution Envelope Determination (DEnv, referred to in previous PSERC reports as “IBDA”) technique for computing bounds on derived distributions without specifying dependency was described by Berleant and Goodman-Strauss [1998]. This approach has fundamental similarities with the copula-based approach [Frank, et al., 1987], which was significantly extended by Williamson and Downs [1990]. These two methods have been implemented in software. The copula-based approach, termed probabilistic arithmetic, is implemented in the commercial software RiskCalc [Ferson et al. 1998]. DEnv is implemented as Statool [Berleant and Goodman-Strauss 1998], which extends the previous tool [Berleant and Cheng 1998] by eliminating the independence assumption. DEnv (and Statool) thus can handle the case where a dependency relationship is unknown or unspecified by not making assumptions about the dependency relationship between operands. Yet, partial dependence information might be available in some cases. If we can use this information in the calculation, we will get more accurate results than can be obtained without using this information.
DEnv and its implementation support a variety of dependency relationships, such as independence, unknown dependence, and correlation values. The algorithm extension to support correlation is a significant improvement. In the implementation, we have found that using the transportation simplex method speeds up computing significantly over the classical simplex method. Recent advances in the algorithm now allow cascaded operations and monotonic binary functions to be supported. These new functions, and use of correlation as a constraint, are the main recent advances in DEnv and its implementation. Among the other contributions reported here are addressing example application problems. Recently we have shown, using bounds on derived distributions in a GENCO competitive bidding scenario, that
[1] the unjustified assumption of independence between uncertain quantities has a partially quantifiable dollar cost,
[2] the information that uncertain quantities are independent, if true, has a quantifiable dollar value, and
[3] a decision tree (d-tree) approach can identify optimal bids even when the dependency relationship between uncertain quantities is unknown.
An interval can be used to bound the range for a value. Arithmetic operations on intervals have been defined in the literature. For example, if interval X is the interval [1,2], and interval Y is the interval [3,4], then interval Z=X+Y is the interval [1+3,2+4]=[4,6]. Some further explanation of properties of intervals is provided in Appendix 5. A use for intervals in computations on random variable distributions is to partition the domain of the distribution into a set of intervals, with a probability associated with each. This partitioning is the basis for discretizing distributions and extending binary operations from intervals to (discretized) distributions.
Given 2 random variables X and Y, to get the exact distribution for a function of samples of X and Y, we must know the joint distribution of X and Y. The joint distribution is constrained (though not determined) by the correlation for these two random variables. Consider an example. The following table gives the interval-based discretization for two distributions, one for random variable X and one forY.
Table 1.1. Discretized distributions for X and Y.
|
|
X |
Y |
||||
|
Range |
[1,2] |
[2,3] |
[3,4] |
[2,3] |
[3,4] |
[4,5] |
|
Probability |
0.25 |
0.5 |
0.25 |
0.5 |
0.3 |
0.2 |
Sharing of interval endpoints between adjacent intervals in a discretization is of no practical consequence unless distributions have impulses at those shared endpoints. In that case discretizations would need to contain partially open intervals, e.g., [1,2). Note that the discretizations contain no information about distributions of probabilities over their corresponding intervals, however, each interval does limit its probability to within its endpoints. We also do not know the dependency relationship between X and Y. In other words, we do not know the joint distribution for X and Y.
Consider addition: Z=X+Y. Because we do not have the joint distribution for X and Y, it is impossible to find the precise distribution for Z. However, we can put these two random variables into a “joint distribution tableau,” as shown in the following table.
Table 1.2. Joint distribution tableau showing marginal distributions for X and Y.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
The last row in the table is the distribution for X and last column is the distribution for Y. We do not know the values of probabilities p11 through p33 because we do not know the joint distribution. For the simple case of X and Y independent, we can fill in the missing values as in the following table.
Table 1.3. Joint distribution for independence.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Because the joint distribution is affected by the dependency relationship between X and Y, if we do not know the dependency relationship between X and Y, we cannot determine the joint distribution in this tableau. But we can infer some things about the result variable Z=X+Y from this matrix. For example, consider z=5. This is possible only in the grey cells in the next table.
Table 1.4. Graying indicates cells in which z might equal 5.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
As previous stated, we do not know the exact
probability for z<=5. However we can ask what the possible
probabilities are for z<=5. As this matrix shows, only grey cells
contribute to the probability of z<=5. We would like to determine
the maximum probability and the minimum probability. To get the maximum value,
all cells in which z can be <= 5 will have their probabilities summed.
To obtain the minimum value, only cells in which Z must be <= 5 will have
their probabilities summed. For example, consider cell
. When we
calculate the maximum value for z<=5, this cell is counted because z
can be <= 5 in this cell. But for the minimum value, we do not count this
cell because z might not be <= 5 in this cell. This way, we can find
the possible range of cumulative probabilities for various values of z,
a sample of random variable Z. We can find the maximum possibility and
minimum possibility for every value of z, and connect all these points to
get 2 curves: a left curve and a right curve. All the CDFs that are possible
for Z must belong between these two curves.
In
this example, suppose Z’s range is from 3 to 9. It is clear that the
probability for z<3 is 0 and for z>9 is 1. For example, consider
the probability of
.
·
Maximum. First, find all the cells in which this situation may
occur. From the previous table, these cells are
,
, and
. So
the maximum value is the maximum value for the sum of
,
, and
.
·
Minimum. First, find all the cells in which z must be
<= 4. In this table, there are none. Although
,
, and
may
satisfy
, they also might not. For
example, the whole probability for the cell might be concentrated at the high
bound of its range. So there is no cell in which z must be <= 4.
Summarizing the above analysis, we can define a way to tell which cells contribute to the maximum and minimum probability values. Maximum: all the cells in which the low bound is not greater than the value of z contribute to the max value. Minimum: all the cells in which the high bound is not greater than the value of z contribute to the min value.
After finding all the cells satisfying the max (or
min) condition, we must calculate the sum of the probabilities of these cells.
Based on the previous table, there exist constraints for the probabilities pij.
It is clear that the sum of the pij’s in a row or column cannot
go over the marginal probability of that row or column. These constraints can
be described as follows.
Row Constraints:
for i=1 to 3.
Column Constraints:
for j=1 to 3.
Therefore, the question becomes: find the maximum and minimum value for the sum
of cells under these constraints. For the case
, we can describe these
questions as follows.
Maximum - make the sum of the specified cells’ value
big enough, that is, find
max (
+
+
) such
that:
for
i=1 to 3
and
for j=1 to 3.
Minimum - make the sum of specified cells’ values small enough, that is, find
min (
) such that:
for
i=1 to 3
and
for j=1 to 3
For these two optimization questions, linear programming is a suitable tool to. Thus, we can find the probability range for the specified value of z. The following table shows the probabilities for various values of z.
Table 1.5. Probabilities for result variable z.
|
z range |
Maximum probability |
Minimum probability |
|
z<3 |
0 |
0 |
|
z<=3 |
0.25 |
0 |
|
z<=4 |
0.75 |
0 |
|
z<=5 |
1 |
0 |
|
z<=6 |
1 |
0.25 |
|
z<=7 |
1 |
0.55 |
|
z<=8 |
1 |
0.8 |
|
z<=9 |
1 |
1 |
|
z<=10 |
1 |
1 |
From this table, we can draw two curves, a top curve and a bottom curve, using the maximum and minimum probabilities shown for various values of z. These two curves also can be called envelopes for the CDF of derived random variable Z because the CDF for Z must be between these 2 curves whatever the dependency relationship between X and Y is. The following figure shows the result.

Figure 1‑1. Probability bounds for random variable Z.
The previous chapter discussed the core DEnv algorithm. This algorithm uses linear programming based on the row and column constraints on the pij’s. The previous chapter also mentioned an important factor: correlation. If one knows something about correlation, it would be good to be able to use it to narrow the separation of envelopes. We have recently identified how to convert information about the correlation into constraints that can supplement the row and column constraints of the core algorithm, resulting in many cases in narrowing of the space between the left and right envelopes. We describe this new augmentation to the algorithm in this chapter. For the published account, see Berleant and Zhang (in press).
Correlation measures the degree of correspondence
between random variables. To describe this kind of relationship, there are a
number of methods. For example, we can consider the linear relationship
between two random variables, or the linear relationship of the squares of the
random variables, or many other possible relationships. By far the most
popular correlation coefficient is called Pearson correlation, or Pearson product-moment
correlation. It measures the strength of the linear relationship between two
random variables. It is defined as
![]()
![]()
were D(X) is the variance of X and D(Y) is the
variance of Y. E is the expectation function.
A joint distribution describes the detailed dependency between two R.V.’s. From the joint distribution, we can get the correlation. But correlation does not imply a specific joint distribution, so in general we cannot get the joint distribution from a value of correlation.
When the correlation is unknown, we use linear programming to find CDF envelopes, as described in the previous chapter. If we know the correlation between two operands, we would like to use that information to determine additional constraints for the linear programming problem. In another words, we wish to decrease the feasible solution space and get a better solution.
According to the definition of correlation, for samples x and y of two random variables X and Y, the correlation is
![]()
Where
and
are
the means for x and y.
Using the following formulas, we can simplify terms:
![]()
where Exy is the expectation of x*y. Also,

So previous formula becomes
![]()
Using the definition of mean, when variable x is discrete,
where
.
When x is continuously distributed, and has density function f, then
.
In the DEnv algorithm, we do not care if a random variable is discrete or continuous. We use bars to discretize the distribution. This method has the following characteristics:
We now extend the definition of mean to intervals.
, where
E(X) is the (interval-valued) mean of a collection of intervals X={X1,...Xm},
and
.
It is clear that the mean Ex is in E(X).
When x is continuous, we also can get the mean based on the following argument.
Consider some bar in the discretization of variable x whose distribution function is f(x). The probability that it is in [a,b] is the area of f(x) between a and b.

We can partition the domain of variable x
into many intervals such as this one, denoting them Xi. The
mean of variable x now becomes
![]()
Consider one item in the previous formula, assuming Xi is [a,b]
as in the previous figure.
.
Similarly, we also get
.
So,
must belong to Xi*pi.
Thus E(x) is in the mean of interval variable X.
If the intervals overlap, the width of the mean is wider than for the non-overlapped case. Thus, the mean of the non-overlapped intervals is a subset of that of the overlapped intervals. Hence E(x) belongs to mean of interval variable X in this case too.
In the current software, the user can input any value of correlation from –1 to 1. But in fact, for some marginal distributions, there are correlation values which will not be exhibited by any joint distribution. In fact, the constraints derived from setting the correlation to an impossible value should conflict with the constraints derived from the marginals of the joint distribution matrix (the row and column constraints).
From the definition of correlation, and algebraic
rearrangement, we obtain this formula for E(xy):
. Let
f(x,y)=E(xy), so f(x,y)
is a real function of x and y. We can rewrite E(xy)
with intervals X and Y.
.
This is the interval extension of the corresponding real function
![]()
where
and
. If
is
an interval, it becomes another variable in function f.
The software should provide a way to help the user to set a reasonable correlation. To do this, first, the software must figure out the range of possible correlations for the current random variables. Then the software can display this information. It then only accepts values intersecting with this range.
As mentioned before, there are 2 kinds of constraints, one coming from the marginals of the joint distributions matrix and another coming from the correlation setting. The joint distribution matrix marginals are assumed correct. So, the constraints arising from them are a given. If constraints coming from a correlation setting conflict with them, they must be in error. Constraints coming from the matrix are primary and constraints coming from correlation should be considered secondary.
Consider a joint distribution matrix for an
operation
.
Table 2.1. Joint distribution matrix.
|
|
Y1 |
… |
Ym |
|
|
X1 |
|
… |
|
|