Begginer’s Tutorial

Reading Data from String

Data can be read from a string, by using the ImportString class of explain.dataio. The default delimiter is \s the spectial charactere for whitespaces.

[1]:
from explann.dataio import ImportString

data_string = """
Observação  Dureza  Temperatura
1   137     220
2   137     220
3   137     220
4   136     220
5   135     220
6   135     225
7   133     225
8   132     225
9   133     225
10  133     225
11  128     230
12  124     230
13  126     230
14  129     230
15  126     230
16  122     235
17  122     235
18  122     235
19  119     235
20  122     235
"""

data_reader_string = ImportString(data=data_string, delimiter="\s")

data_reader object stores the providade data in its .data attribute

[2]:
data_reader_string.data
[2]:
Observação Dureza Temperatura
0 1 137 220
1 2 137 220
2 3 137 220
3 4 136 220
4 5 135 220
5 6 135 225
6 7 133 225
7 8 132 225
8 9 133 225
9 10 133 225
10 11 128 230
11 12 124 230
12 13 126 230
13 14 129 230
14 15 126 230
15 16 122 235
16 17 122 235
17 18 122 235
18 19 119 235
19 20 122 235

Read Data from Excel file

Data can also be read from a .xlsx file (Excel extension). To do so, use the ImportXLSX class of explain.dataio. Default is to read the first Sheet, otherwise desired provide the additional argument sheet_name

[3]:
from explann.dataio import ImportXLSX

data_reader_xlsx = ImportXLSX(path="../../data/paper_data_24.xlsx")

data_reader_xlsx.data
[3]:
U A P Y F C B
0 -1 -1 -1 -1 39 1.328 170
1 1 -1 -1 -1 87 1.699 122
2 -1 1 -1 -1 48 1.332 473
3 1 1 -1 -1 71 1.979 511
4 -1 -1 1 -1 43 1.458 156
5 1 -1 1 -1 84 2.189 204
6 -1 1 1 -1 45 1.343 385
7 1 1 1 -1 112 1.707 288
8 -1 -1 -1 1 19 1.257 114
9 1 -1 -1 1 146 2.148 116
10 -1 1 -1 1 50 1.592 244
11 1 1 -1 1 92 1.726 126
12 -1 -1 1 1 107 1.203 72
13 1 -1 1 1 172 2.261 210
14 -1 1 1 1 62 1.434 234
15 1 1 1 1 82 1.848 154
16 0 0 0 0 75 1.726 223
17 0 0 0 0 70 1.782 219
18 0 0 0 0 89 1.753 226

Dealing With Leveled Data

Any importer has the functionality to convert levels of an factorial to respective values. For this task a table containg the values associated with each level should be passed, either as a string or a .xlsx file in the same way as the main data table.`

[4]:
levels_string = """
Levels;U;A;P;Y
-1;0.15;0.7; 0.40;0.13
0; 0.30;1.4; 0.75;0.26
1; 0.45;2.1; 1.10;0.38
"""

levels_reader = ImportString(
    data = levels_string,
    delimiter = ";",
    index_col = 0  # should pass the column name or index containing the level.
)
levels_reader.data
[4]:
U A P Y
Levels
-1 0.15 0.7 0.40 0.13
0 0.30 1.4 0.75 0.26
1 0.45 2.1 1.10 0.38

The same data should be imported from a .xlsx file.

[5]:
levels_reader_xlsx = ImportXLSX(
    path="../../data/paper_data_24.xlsx",
    sheet_name="Levels",
    index_col=0,
)
levels_reader_xlsx.data
[5]:
U A P Y
Levels
-1 0.15 0.7 0.40 0.13
0 0.30 1.4 0.75 0.26
1 0.45 2.1 1.10 0.38

The data reader parse_levels acept a pd.Dataframe constructed from any one of the methods above, you can pass string or .xlsx files to the methos parse_levels_from_string and parse_levels_from_xlsx.

[6]:
# passing a pd.DataFrame
data_reader_xlsx.parse_levels(
    data = levels_reader.data
)
[7]:
data_reader_xlsx.data
[7]:
U A P Y F C B
0 0.15 0.7 0.40 0.13 39 1.328 170
1 0.45 0.7 0.40 0.13 87 1.699 122
2 0.15 2.1 0.40 0.13 48 1.332 473
3 0.45 2.1 0.40 0.13 71 1.979 511
4 0.15 0.7 1.10 0.13 43 1.458 156
5 0.45 0.7 1.10 0.13 84 2.189 204
6 0.15 2.1 1.10 0.13 45 1.343 385
7 0.45 2.1 1.10 0.13 112 1.707 288
8 0.15 0.7 0.40 0.38 19 1.257 114
9 0.45 0.7 0.40 0.38 146 2.148 116
10 0.15 2.1 0.40 0.38 50 1.592 244
11 0.45 2.1 0.40 0.38 92 1.726 126
12 0.15 0.7 1.10 0.38 107 1.203 72
13 0.45 0.7 1.10 0.38 172 2.261 210
14 0.15 2.1 1.10 0.38 62 1.434 234
15 0.45 2.1 1.10 0.38 82 1.848 154
16 0.30 1.4 0.75 0.26 75 1.726 223
17 0.30 1.4 0.75 0.26 70 1.782 219
18 0.30 1.4 0.75 0.26 89 1.753 226
[8]:
# passing a string
data_reader_xlsx.parse_levels_from_string(
    data = levels_string,
    delimiter=";"
)

# passing a path
data_reader_xlsx.parse_levels_from_xlsx(
    data = "../../data/paper_data_24.xlsx",
    sheet_name = "Levels",
    index_col=0,
)

[9]:
data_reader_xlsx.data
[9]:
U A P Y F C B
0 0.15 0.7 0.40 0.13 39 1.328 170
1 0.45 0.7 0.40 0.13 87 1.699 122
2 0.15 2.1 0.40 0.13 48 1.332 473
3 0.45 2.1 0.40 0.13 71 1.979 511
4 0.15 0.7 1.10 0.13 43 1.458 156
5 0.45 0.7 1.10 0.13 84 2.189 204
6 0.15 2.1 1.10 0.13 45 1.343 385
7 0.45 2.1 1.10 0.13 112 1.707 288
8 0.15 0.7 0.40 0.38 19 1.257 114
9 0.45 0.7 0.40 0.38 146 2.148 116
10 0.15 2.1 0.40 0.38 50 1.592 244
11 0.45 2.1 0.40 0.38 92 1.726 126
12 0.15 0.7 1.10 0.38 107 1.203 72
13 0.45 0.7 1.10 0.38 172 2.261 210
14 0.15 2.1 1.10 0.38 62 1.434 234
15 0.45 2.1 1.10 0.38 82 1.848 154
16 0.30 1.4 0.75 0.26 75 1.726 223
17 0.30 1.4 0.75 0.26 70 1.782 219
18 0.30 1.4 0.75 0.26 89 1.753 226

The results are the same, data attribute has its values parsed to the corresponding index levels for each variable as described in the levels_reades_<type>.data table.

[10]:
data_reader_xlsx.data
[10]:
U A P Y F C B
0 0.15 0.7 0.40 0.13 39 1.328 170
1 0.45 0.7 0.40 0.13 87 1.699 122
2 0.15 2.1 0.40 0.13 48 1.332 473
3 0.45 2.1 0.40 0.13 71 1.979 511
4 0.15 0.7 1.10 0.13 43 1.458 156
5 0.45 0.7 1.10 0.13 84 2.189 204
6 0.15 2.1 1.10 0.13 45 1.343 385
7 0.45 2.1 1.10 0.13 112 1.707 288
8 0.15 0.7 0.40 0.38 19 1.257 114
9 0.45 0.7 0.40 0.38 146 2.148 116
10 0.15 2.1 0.40 0.38 50 1.592 244
11 0.45 2.1 0.40 0.38 92 1.726 126
12 0.15 0.7 1.10 0.38 107 1.203 72
13 0.45 0.7 1.10 0.38 172 2.261 210
14 0.15 2.1 1.10 0.38 62 1.434 234
15 0.45 2.1 1.10 0.38 82 1.848 154
16 0.30 1.4 0.75 0.26 75 1.726 223
17 0.30 1.4 0.75 0.26 70 1.782 219
18 0.30 1.4 0.75 0.26 89 1.753 226

Original data is retained in a raw_data attribute

[11]:
data_reader_xlsx.raw_data
[11]:
U A P Y F C B
0 -1 -1 -1 -1 39 1.328 170
1 1 -1 -1 -1 87 1.699 122
2 -1 1 -1 -1 48 1.332 473
3 1 1 -1 -1 71 1.979 511
4 -1 -1 1 -1 43 1.458 156
5 1 -1 1 -1 84 2.189 204
6 -1 1 1 -1 45 1.343 385
7 1 1 1 -1 112 1.707 288
8 -1 -1 -1 1 19 1.257 114
9 1 -1 -1 1 146 2.148 116
10 -1 1 -1 1 50 1.592 244
11 1 1 -1 1 92 1.726 126
12 -1 -1 1 1 107 1.203 72
13 1 -1 1 1 172 2.261 210
14 -1 1 1 1 62 1.434 234
15 1 1 1 1 82 1.848 154
16 0 0 0 0 75 1.726 223
17 0 0 0 0 70 1.782 219
18 0 0 0 0 89 1.753 226

Build a Factorial Model

To bulid a factorial model for the data explann implemented the class FactorialModel. The arguments are the data and the functions. Functions are passed as python dictionaries, the keys are the function names, in this examlpe "F","CM" and "B". This names should be any string, this allow to create any number of models by addressing different names.

The dictionary values are also string containing the model equations, the syntax follow patsy standards. The left hand side (lhs) contains the dependent variables and the right hand side (rhs) the indenpendent terms. The lhs and rhs are separated by “~” charactere.

[13]:
from explann.models import FactorialModel

fm = FactorialModel(
    data=data_reader_xlsx.data,
    functions=
    {
        "F"  : "F ~ U * A * P * Y",
        "CM" : "CM ~ U * A * P * Y",
        "B"  : "B ~ U * A * P * Y"
    }
    )
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
File ~/micromamba/envs/explann/lib/python3.11/site-packages/patsy/compat.py:36, in call_and_wrap_exc(msg, origin, f, *args, **kwargs)
     35 try:
---> 36     return f(*args, **kwargs)
     37 except Exception as e:

File ~/micromamba/envs/explann/lib/python3.11/site-packages/patsy/eval.py:169, in EvalEnvironment.eval(self, expr, source_name, inner_namespace)
    168 code = compile(expr, source_name, "eval", self.flags, False)
--> 169 return eval(code, {}, VarLookupDict([inner_namespace]
    170                                     + self._namespaces))

File <string>:1

NameError: name 'CM' is not defined

The above exception was the direct cause of the following exception:

PatsyError                                Traceback (most recent call last)
Cell In[13], line 3
      1 from explann.models import FactorialModel
----> 3 fm = FactorialModel(
      4     data=data_reader_xlsx.data,
      5     functions=
      6     {
      7         "F"  : "F ~ U * A * P * Y",
      8         "CM" : "CM ~ U * A * P * Y",
      9         "B"  : "B ~ U * A * P * Y"
     10     }
     11     )

File ~/Dropbox/local/github/explann/explann/models.py:34, in FactorialModel.__init__(self, data, functions, statsmodel)
     31 self.statsmodel = statsmodel
     32 self.functions = functions
---> 34 self.model = self.fit(
     35     data=data,
     36     functions=functions,
     37     statsmodel=statsmodel
     38 )

File ~/Dropbox/local/github/explann/explann/models.py:56, in FactorialModel.fit(self, data, functions, statsmodel)
     54         function_name = splited[0].strip()
     55         function_body = splited[1].strip()
---> 56         model_dict[key] = self.statsmodel(function, data).fit()
     57 else:
     58     print('must provide function formulas')

File ~/micromamba/envs/explann/lib/python3.11/site-packages/statsmodels/base/model.py:203, in Model.from_formula(cls, formula, data, subset, drop_cols, *args, **kwargs)
    200 if missing == 'none':  # with patsy it's drop or raise. let's raise.
    201     missing = 'raise'
--> 203 tmp = handle_formula_data(data, None, formula, depth=eval_env,
    204                           missing=missing)
    205 ((endog, exog), missing_idx, design_info) = tmp
    206 max_endog = cls._formula_max_endog

File ~/micromamba/envs/explann/lib/python3.11/site-packages/statsmodels/formula/formulatools.py:63, in handle_formula_data(Y, X, formula, depth, missing)
     61 else:
     62     if data_util._is_using_pandas(Y, None):
---> 63         result = dmatrices(formula, Y, depth, return_type='dataframe',
     64                            NA_action=na_action)
     65     else:
     66         result = dmatrices(formula, Y, depth, return_type='dataframe',
     67                            NA_action=na_action)

File ~/micromamba/envs/explann/lib/python3.11/site-packages/patsy/highlevel.py:309, in dmatrices(formula_like, data, eval_env, NA_action, return_type)
    299 """Construct two design matrices given a formula_like and data.
    300
    301 This function is identical to :func:`dmatrix`, except that it requires
   (...)
    306 See :func:`dmatrix` for details.
    307 """
    308 eval_env = EvalEnvironment.capture(eval_env, reference=1)
--> 309 (lhs, rhs) = _do_highlevel_design(formula_like, data, eval_env,
    310                                   NA_action, return_type)
    311 if lhs.shape[1] == 0:
    312     raise PatsyError("model is missing required outcome variables")

File ~/micromamba/envs/explann/lib/python3.11/site-packages/patsy/highlevel.py:164, in _do_highlevel_design(formula_like, data, eval_env, NA_action, return_type)
    162 def data_iter_maker():
    163     return iter([data])
--> 164 design_infos = _try_incr_builders(formula_like, data_iter_maker, eval_env,
    165                                   NA_action)
    166 if design_infos is not None:
    167     return build_design_matrices(design_infos, data,
    168                                  NA_action=NA_action,
    169                                  return_type=return_type)

File ~/micromamba/envs/explann/lib/python3.11/site-packages/patsy/highlevel.py:66, in _try_incr_builders(formula_like, data_iter_maker, eval_env, NA_action)
     64 if isinstance(formula_like, ModelDesc):
     65     assert isinstance(eval_env, EvalEnvironment)
---> 66     return design_matrix_builders([formula_like.lhs_termlist,
     67                                    formula_like.rhs_termlist],
     68                                   data_iter_maker,
     69                                   eval_env,
     70                                   NA_action)
     71 else:
     72     return None

File ~/micromamba/envs/explann/lib/python3.11/site-packages/patsy/build.py:693, in design_matrix_builders(termlists, data_iter_maker, eval_env, NA_action)
    689 factor_states = _factors_memorize(all_factors, data_iter_maker, eval_env)
    690 # Now all the factors have working eval methods, so we can evaluate them
    691 # on some data to find out what type of data they return.
    692 (num_column_counts,
--> 693  cat_levels_contrasts) = _examine_factor_types(all_factors,
    694                                                factor_states,
    695                                                data_iter_maker,
    696                                                NA_action)
    697 # Now we need the factor infos, which encapsulate the knowledge of
    698 # how to turn any given factor into a chunk of data:
    699 factor_infos = {}

File ~/micromamba/envs/explann/lib/python3.11/site-packages/patsy/build.py:443, in _examine_factor_types(factors, factor_states, data_iter_maker, NA_action)
    441 for data in data_iter_maker():
    442     for factor in list(examine_needed):
--> 443         value = factor.eval(factor_states[factor], data)
    444         if factor in cat_sniffers or guess_categorical(value):
    445             if factor not in cat_sniffers:

File ~/micromamba/envs/explann/lib/python3.11/site-packages/patsy/eval.py:568, in EvalFactor.eval(self, memorize_state, data)
    567 def eval(self, memorize_state, data):
--> 568     return self._eval(memorize_state["eval_code"],
    569                       memorize_state,
    570                       data)

File ~/micromamba/envs/explann/lib/python3.11/site-packages/patsy/eval.py:551, in EvalFactor._eval(self, code, memorize_state, data)
    549 def _eval(self, code, memorize_state, data):
    550     inner_namespace = VarLookupDict([data, memorize_state["transforms"]])
--> 551     return call_and_wrap_exc("Error evaluating factor",
    552                              self,
    553                              memorize_state["eval_env"].eval,
    554                              code,
    555                              inner_namespace=inner_namespace)

File ~/micromamba/envs/explann/lib/python3.11/site-packages/patsy/compat.py:43, in call_and_wrap_exc(msg, origin, f, *args, **kwargs)
     39     new_exc = PatsyError("%s: %s: %s"
     40                          % (msg, e.__class__.__name__, e),
     41                          origin)
     42     # Use 'exec' to hide this syntax from the Python 2 parser:
---> 43     exec("raise new_exc from e")
     44 else:
     45     # In python 2, we just let the original exception escape -- better
     46     # than destroying the traceback. But if it's a PatsyError, we can
     47     # at least set the origin properly.
     48     if isinstance(e, PatsyError):

File <string>:1

PatsyError: Error evaluating factor: NameError: name 'CM' is not defined
    CM ~ U * A * P * Y
    ^^
[ ]: