-
Notifications
You must be signed in to change notification settings - Fork 118
Description
Background
As described in #455, it is possible for the beta to fail the fitting process. In this case, the expected behavior is:
- The
BetaUnivariatewill throw aFitError - When using this univariate, we implement a try/catch logic and switch to a fallback distribution instead
This works when scipy throws an error. However, there are cases where scipy does not throw an error and returns a bad fit. We should catch such cases and throw an error for them.
Replication
The data below came from SDV #2542. I've pulled out the particular column into this new CSV below.
The column contains data roughly in the range [1, 7]. When fitting this data, there is no error, but the fitted parameters clearly don’t make sense:
import pandas as pd
from copulas.univariate import BetaUnivariate
data = pd.read_csv('example_data.csv')
beta = BetaUnivariate()
beta.fit(data)
print(beta._params){
'loc': 2.1833080938648672,
'scale': 9.555402199071863e-29,
'a': 26439403.24259588,
'b': 28.274298502064756
}
Most notably: The scale parameter is so small, it is essentially 0. This means that distribution is essentially only capable of creating a constant value of ~2.1833. There have been many such examples of this for the Beta distribution specifically.
Expected Behavior
The BetaUnivariate distribution should to check itself after fitting to make sure the parameters make sense. For now, these are two obvious checks we should add:
- The
scalevalue should not be close to zero, as in the example above. (Perhaps we can check that it is greater than EPSILON?) - The range
[loc, loc+scale]should have at least some, non-zero overlap with the actual[data_min, data_max]range. That is to say:loc + scaleshould be> data_minandlocshould be< data_max
If any one of these checks fail, BetaUnivariate should throw a descriptive error explaining that the fitting failed.
Error: Converged parameters for beta distribution have a near-zero range.
Error: Converged parameters for beta distribution are outside the min/max range of the data.
The rest of the code will then adapt to using a fallback.
Future Work
(Not in scope for now) In the future, we can keep adding/refining the checks for the failure modes that we find.
In the future, we can also start to investigate how to “fix” the Beta’s fit rather than error out immediately. (For example: Does supplying floc and fscale help? We’d need to investigate).