Skip to content

BetaUnivariate should reject fitted parameters that obviously don’t match the data #472

@npatki

Description

@npatki

Background

As described in #455, it is possible for the beta to fail the fitting process. In this case, the expected behavior is:

  1. The BetaUnivariate will throw a FitError
  2. When using this univariate, we implement a try/catch logic and switch to a fallback distribution instead

This works when scipy throws an error. However, there are cases where scipy does not throw an error and returns a bad fit. We should catch such cases and throw an error for them.

Replication

The data below came from SDV #2542. I've pulled out the particular column into this new CSV below.

example_data.csv

The column contains data roughly in the range [1, 7]. When fitting this data, there is no error, but the fitted parameters clearly don’t make sense:

import pandas as pd
from copulas.univariate import BetaUnivariate

data = pd.read_csv('example_data.csv')
beta = BetaUnivariate()
beta.fit(data)
print(beta._params)
{
 'loc': 2.1833080938648672,
 'scale': 9.555402199071863e-29,
 'a': 26439403.24259588,
 'b': 28.274298502064756
}

Most notably: The scale parameter is so small, it is essentially 0. This means that distribution is essentially only capable of creating a constant value of ~2.1833. There have been many such examples of this for the Beta distribution specifically.

Expected Behavior

The BetaUnivariate distribution should to check itself after fitting to make sure the parameters make sense. For now, these are two obvious checks we should add:

  • The scale value should not be close to zero, as in the example above. (Perhaps we can check that it is greater than EPSILON?)
  • The range [loc, loc+scale] should have at least some, non-zero overlap with the actual [data_min, data_max] range. That is to say:
    • loc + scale should be > data_min and
    • loc should be < data_max

If any one of these checks fail, BetaUnivariate should throw a descriptive error explaining that the fitting failed.

Error: Converged parameters for beta distribution have a near-zero range.
Error: Converged parameters for beta distribution are outside the min/max range of the data.

The rest of the code will then adapt to using a fallback.

Future Work

(Not in scope for now) In the future, we can keep adding/refining the checks for the failure modes that we find.

In the future, we can also start to investigate how to “fix” the Beta’s fit rather than error out immediately. (For example: Does supplying floc and fscale help? We’d need to investigate).

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions