Begginer’s Tutorial

Reading Data from String

Data can be read from a string, by using the ImportString class of explain.dataio. The default delimiter is \s the spectial charactere for whitespaces.

[1]:

from explann.dataio import ImportString

data_string = """
Observação  Dureza  Temperatura
 137     220
 137     220
 137     220
 136     220
 135     220
 135     225
 133     225
 132     225
 133     225
133     225
128     230
124     230
126     230
129     230
126     230
122     235
122     235
122     235
119     235
122     235
"""

data_reader_string = ImportString(data=data_string, delimiter="\s")

data_reader object stores the providade data in its .data attribute

[2]:

data_reader_string.data

[2]:

	Observação	Dureza	Temperatura
0	1	137	220
1	2	137	220
2	3	137	220
3	4	136	220
4	5	135	220
5	6	135	225
6	7	133	225
7	8	132	225
8	9	133	225
9	10	133	225
10	11	128	230
11	12	124	230
12	13	126	230
13	14	129	230
14	15	126	230
15	16	122	235
16	17	122	235
17	18	122	235
18	19	119	235
19	20	122	235

Read Data from Excel file

Data can also be read from a .xlsx file (Excel extension). To do so, use the ImportXLSX class of explain.dataio. Default is to read the first Sheet, otherwise desired provide the additional argument sheet_name

[3]:

from explann.dataio import ImportXLSX

data_reader_xlsx = ImportXLSX(path="../../data/paper_data_24.xlsx")

data_reader_xlsx.data

[3]:

	U	A	P	Y	F	C	B
0	-1	-1	-1	-1	39	1.328	170
1	1	-1	-1	-1	87	1.699	122
2	-1	1	-1	-1	48	1.332	473
3	1	1	-1	-1	71	1.979	511
4	-1	-1	1	-1	43	1.458	156
5	1	-1	1	-1	84	2.189	204
6	-1	1	1	-1	45	1.343	385
7	1	1	1	-1	112	1.707	288
8	-1	-1	-1	1	19	1.257	114
9	1	-1	-1	1	146	2.148	116
10	-1	1	-1	1	50	1.592	244
11	1	1	-1	1	92	1.726	126
12	-1	-1	1	1	107	1.203	72
13	1	-1	1	1	172	2.261	210
14	-1	1	1	1	62	1.434	234
15	1	1	1	1	82	1.848	154
16	0	0	0	0	75	1.726	223
17	0	0	0	0	70	1.782	219
18	0	0	0	0	89	1.753	226

Dealing With Leveled Data

Any importer has the functionality to convert levels of an factorial to respective values. For this task a table containg the values associated with each level should be passed, either as a string or a .xlsx file in the same way as the main data table.`

[4]:

levels_string = """
Levels;U;A;P;Y
-1;0.15;0.7; 0.40;0.13
0; 0.30;1.4; 0.75;0.26
1; 0.45;2.1; 1.10;0.38
"""

levels_reader = ImportString(
    data = levels_string,
    delimiter = ";",
    index_col = 0  # should pass the column name or index containing the level.
)
levels_reader.data

[4]:

	U	A	P	Y
Levels
-1	0.15	0.7	0.40	0.13
0	0.30	1.4	0.75	0.26
1	0.45	2.1	1.10	0.38

The same data should be imported from a .xlsx file.

[5]:

levels_reader_xlsx = ImportXLSX(
    path="../../data/paper_data_24.xlsx",
    sheet_name="Levels",
    index_col=0,
)
levels_reader_xlsx.data

[5]:

	U	A	P	Y
Levels
-1	0.15	0.7	0.40	0.13
0	0.30	1.4	0.75	0.26
1	0.45	2.1	1.10	0.38

The data reader parse_levels acept a pd.Dataframe constructed from any one of the methods above, you can pass string or .xlsx files to the methos parse_levels_from_string and parse_levels_from_xlsx.

[6]:

# passing a pd.DataFrame
data_reader_xlsx.parse_levels(
    data = levels_reader.data
)

[7]:

data_reader_xlsx.data

[7]:

	U	A	P	Y	F	C	B
0	0.15	0.7	0.40	0.13	39	1.328	170
1	0.45	0.7	0.40	0.13	87	1.699	122
2	0.15	2.1	0.40	0.13	48	1.332	473
3	0.45	2.1	0.40	0.13	71	1.979	511
4	0.15	0.7	1.10	0.13	43	1.458	156
5	0.45	0.7	1.10	0.13	84	2.189	204
6	0.15	2.1	1.10	0.13	45	1.343	385
7	0.45	2.1	1.10	0.13	112	1.707	288
8	0.15	0.7	0.40	0.38	19	1.257	114
9	0.45	0.7	0.40	0.38	146	2.148	116
10	0.15	2.1	0.40	0.38	50	1.592	244
11	0.45	2.1	0.40	0.38	92	1.726	126
12	0.15	0.7	1.10	0.38	107	1.203	72
13	0.45	0.7	1.10	0.38	172	2.261	210
14	0.15	2.1	1.10	0.38	62	1.434	234
15	0.45	2.1	1.10	0.38	82	1.848	154
16	0.30	1.4	0.75	0.26	75	1.726	223
17	0.30	1.4	0.75	0.26	70	1.782	219
18	0.30	1.4	0.75	0.26	89	1.753	226

[8]:

# passing a string
data_reader_xlsx.parse_levels_from_string(
    data = levels_string,
    delimiter=";"
)

# passing a path
data_reader_xlsx.parse_levels_from_xlsx(
    data = "../../data/paper_data_24.xlsx",
    sheet_name = "Levels",
    index_col=0,
)

[9]:

data_reader_xlsx.data

[9]:

	U	A	P	Y	F	C	B
0	0.15	0.7	0.40	0.13	39	1.328	170
1	0.45	0.7	0.40	0.13	87	1.699	122
2	0.15	2.1	0.40	0.13	48	1.332	473
3	0.45	2.1	0.40	0.13	71	1.979	511
4	0.15	0.7	1.10	0.13	43	1.458	156
5	0.45	0.7	1.10	0.13	84	2.189	204
6	0.15	2.1	1.10	0.13	45	1.343	385
7	0.45	2.1	1.10	0.13	112	1.707	288
8	0.15	0.7	0.40	0.38	19	1.257	114
9	0.45	0.7	0.40	0.38	146	2.148	116
10	0.15	2.1	0.40	0.38	50	1.592	244
11	0.45	2.1	0.40	0.38	92	1.726	126
12	0.15	0.7	1.10	0.38	107	1.203	72
13	0.45	0.7	1.10	0.38	172	2.261	210
14	0.15	2.1	1.10	0.38	62	1.434	234
15	0.45	2.1	1.10	0.38	82	1.848	154
16	0.30	1.4	0.75	0.26	75	1.726	223
17	0.30	1.4	0.75	0.26	70	1.782	219
18	0.30	1.4	0.75	0.26	89	1.753	226

The results are the same, data attribute has its values parsed to the corresponding index levels for each variable as described in the levels_reades_<type>.data table.

[10]:

data_reader_xlsx.data

[10]:

	U	A	P	Y	F	C	B
0	0.15	0.7	0.40	0.13	39	1.328	170
1	0.45	0.7	0.40	0.13	87	1.699	122
2	0.15	2.1	0.40	0.13	48	1.332	473
3	0.45	2.1	0.40	0.13	71	1.979	511
4	0.15	0.7	1.10	0.13	43	1.458	156
5	0.45	0.7	1.10	0.13	84	2.189	204
6	0.15	2.1	1.10	0.13	45	1.343	385
7	0.45	2.1	1.10	0.13	112	1.707	288
8	0.15	0.7	0.40	0.38	19	1.257	114
9	0.45	0.7	0.40	0.38	146	2.148	116
10	0.15	2.1	0.40	0.38	50	1.592	244
11	0.45	2.1	0.40	0.38	92	1.726	126
12	0.15	0.7	1.10	0.38	107	1.203	72
13	0.45	0.7	1.10	0.38	172	2.261	210
14	0.15	2.1	1.10	0.38	62	1.434	234
15	0.45	2.1	1.10	0.38	82	1.848	154
16	0.30	1.4	0.75	0.26	75	1.726	223
17	0.30	1.4	0.75	0.26	70	1.782	219
18	0.30	1.4	0.75	0.26	89	1.753	226

Original data is retained in a raw_data attribute

[11]:

data_reader_xlsx.raw_data

[11]:

	U	A	P	Y	F	C	B
0	-1	-1	-1	-1	39	1.328	170
1	1	-1	-1	-1	87	1.699	122
2	-1	1	-1	-1	48	1.332	473
3	1	1	-1	-1	71	1.979	511
4	-1	-1	1	-1	43	1.458	156
5	1	-1	1	-1	84	2.189	204
6	-1	1	1	-1	45	1.343	385
7	1	1	1	-1	112	1.707	288
8	-1	-1	-1	1	19	1.257	114
9	1	-1	-1	1	146	2.148	116
10	-1	1	-1	1	50	1.592	244
11	1	1	-1	1	92	1.726	126
12	-1	-1	1	1	107	1.203	72
13	1	-1	1	1	172	2.261	210
14	-1	1	1	1	62	1.434	234
15	1	1	1	1	82	1.848	154
16	0	0	0	0	75	1.726	223
17	0	0	0	0	70	1.782	219
18	0	0	0	0	89	1.753	226

Build a Factorial Model

To bulid a factorial model for the data explann implemented the class FactorialModel. The arguments are the data and the functions. Functions are passed as python dictionaries, the keys are the function names, in this examlpe "F","CM" and "B". This names should be any string, this allow to create any number of models by addressing different names.

The dictionary values are also string containing the model equations, the syntax follow patsy standards. The left hand side (lhs) contains the dependent variables and the right hand side (rhs) the indenpendent terms. The lhs and rhs are separated by “~” charactere.

[13]:

from explann.models import FactorialModel

fm = FactorialModel(
    data=data_reader_xlsx.data,
    functions=
    {
        "F"  : "F ~ U * A * P * Y",
        "CM" : "CM ~ U * A * P * Y",
        "B"  : "B ~ U * A * P * Y"
    }
    )

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
File ~/micromamba/envs/explann/lib/python3.11/site-packages/patsy/compat.py:36, in call_and_wrap_exc(msg, origin, f, *args, **kwargs)
     35 try:
---> 36     return f(*args, **kwargs)
     37 except Exception as e:

File ~/micromamba/envs/explann/lib/python3.11/site-packages/patsy/eval.py:169, in EvalEnvironment.eval(self, expr, source_name, inner_namespace)
    168 code = compile(expr, source_name, "eval", self.flags, False)
--> 169 return eval(code, {}, VarLookupDict([inner_namespace]
    170                                     + self._namespaces))

File <string>:1

NameError: name 'CM' is not defined

The above exception was the direct cause of the following exception:

PatsyError                                Traceback (most recent call last)
Cell In[13], line 3
      1 from explann.models import FactorialModel
----> 3 fm = FactorialModel(
      4     data=data_reader_xlsx.data,
      5     functions=
      6     {
      7         "F"  : "F ~ U * A * P * Y",
      8         "CM" : "CM ~ U * A * P * Y",
      9         "B"  : "B ~ U * A * P * Y"
     10     }
     11     )

File ~/Dropbox/local/github/explann/explann/models.py:34, in FactorialModel.__init__(self, data, functions, statsmodel)
     31 self.statsmodel = statsmodel
     32 self.functions = functions
---> 34 self.model = self.fit(
     35     data=data,
     36     functions=functions,
     37     statsmodel=statsmodel
     38 )

File ~/Dropbox/local/github/explann/explann/models.py:56, in FactorialModel.fit(self, data, functions, statsmodel)
     54         function_name = splited[0].strip()
     55         function_body = splited[1].strip()
---> 56         model_dict[key] = self.statsmodel(function, data).fit()
     57 else:
     58     print('must provide function formulas')

File ~/micromamba/envs/explann/lib/python3.11/site-packages/statsmodels/base/model.py:203, in Model.from_formula(cls, formula, data, subset, drop_cols, *args, **kwargs)
    200 if missing == 'none':  # with patsy it's drop or raise. let's raise.
    201     missing = 'raise'
--> 203 tmp = handle_formula_data(data, None, formula, depth=eval_env,
    204                           missing=missing)
    205 ((endog, exog), missing_idx, design_info) = tmp
    206 max_endog = cls._formula_max_endog

File ~/micromamba/envs/explann/lib/python3.11/site-packages/statsmodels/formula/formulatools.py:63, in handle_formula_data(Y, X, formula, depth, missing)
     61 else:
     62     if data_util._is_using_pandas(Y, None):
---> 63         result = dmatrices(formula, Y, depth, return_type='dataframe',
     64                            NA_action=na_action)
     65     else:
     66         result = dmatrices(formula, Y, depth, return_type='dataframe',
     67                            NA_action=na_action)

File ~/micromamba/envs/explann/lib/python3.11/site-packages/patsy/highlevel.py:309, in dmatrices(formula_like, data, eval_env, NA_action, return_type)
    299 """Construct two design matrices given a formula_like and data.
    300
    301 This function is identical to :func:`dmatrix`, except that it requires
   (...)
    306 See :func:`dmatrix` for details.
    307 """
    308 eval_env = EvalEnvironment.capture(eval_env, reference=1)
--> 309 (lhs, rhs) = _do_highlevel_design(formula_like, data, eval_env,
    310                                   NA_action, return_type)
    311 if lhs.shape[1] == 0:
    312     raise PatsyError("model is missing required outcome variables")

File ~/micromamba/envs/explann/lib/python3.11/site-packages/patsy/highlevel.py:164, in _do_highlevel_design(formula_like, data, eval_env, NA_action, return_type)
    162 def data_iter_maker():
    163     return iter([data])
--> 164 design_infos = _try_incr_builders(formula_like, data_iter_maker, eval_env,
    165                                   NA_action)
    166 if design_infos is not None:
    167     return build_design_matrices(design_infos, data,
    168                                  NA_action=NA_action,
    169                                  return_type=return_type)

File ~/micromamba/envs/explann/lib/python3.11/site-packages/patsy/highlevel.py:66, in _try_incr_builders(formula_like, data_iter_maker, eval_env, NA_action)
     64 if isinstance(formula_like, ModelDesc):
     65     assert isinstance(eval_env, EvalEnvironment)
---> 66     return design_matrix_builders([formula_like.lhs_termlist,
     67                                    formula_like.rhs_termlist],
     68                                   data_iter_maker,
     69                                   eval_env,
     70                                   NA_action)
     71 else:
     72     return None

File ~/micromamba/envs/explann/lib/python3.11/site-packages/patsy/build.py:693, in design_matrix_builders(termlists, data_iter_maker, eval_env, NA_action)
    689 factor_states = _factors_memorize(all_factors, data_iter_maker, eval_env)
    690 # Now all the factors have working eval methods, so we can evaluate them
    691 # on some data to find out what type of data they return.
    692 (num_column_counts,
--> 693  cat_levels_contrasts) = _examine_factor_types(all_factors,
    694                                                factor_states,
    695                                                data_iter_maker,
    696                                                NA_action)
    697 # Now we need the factor infos, which encapsulate the knowledge of
    698 # how to turn any given factor into a chunk of data:
    699 factor_infos = {}

File ~/micromamba/envs/explann/lib/python3.11/site-packages/patsy/build.py:443, in _examine_factor_types(factors, factor_states, data_iter_maker, NA_action)
    441 for data in data_iter_maker():
    442     for factor in list(examine_needed):
--> 443         value = factor.eval(factor_states[factor], data)
    444         if factor in cat_sniffers or guess_categorical(value):
    445             if factor not in cat_sniffers:

File ~/micromamba/envs/explann/lib/python3.11/site-packages/patsy/eval.py:568, in EvalFactor.eval(self, memorize_state, data)
    567 def eval(self, memorize_state, data):
--> 568     return self._eval(memorize_state["eval_code"],
    569                       memorize_state,
    570                       data)

File ~/micromamba/envs/explann/lib/python3.11/site-packages/patsy/eval.py:551, in EvalFactor._eval(self, code, memorize_state, data)
    549 def _eval(self, code, memorize_state, data):
    550     inner_namespace = VarLookupDict([data, memorize_state["transforms"]])
--> 551     return call_and_wrap_exc("Error evaluating factor",
    552                              self,
    553                              memorize_state["eval_env"].eval,
    554                              code,
    555                              inner_namespace=inner_namespace)

File ~/micromamba/envs/explann/lib/python3.11/site-packages/patsy/compat.py:43, in call_and_wrap_exc(msg, origin, f, *args, **kwargs)
     39     new_exc = PatsyError("%s: %s: %s"
     40                          % (msg, e.__class__.__name__, e),
     41                          origin)
     42     # Use 'exec' to hide this syntax from the Python 2 parser:
---> 43     exec("raise new_exc from e")
     44 else:
     45     # In python 2, we just let the original exception escape -- better
     46     # than destroying the traceback. But if it's a PatsyError, we can
     47     # at least set the origin properly.
     48     if isinstance(e, PatsyError):

File <string>:1

PatsyError: Error evaluating factor: NameError: name 'CM' is not defined
    CM ~ U * A * P * Y
    ^^

[ ]: