Begginer’s Tutorial
Reading Data from String
Data can be read from a string, by using the ImportString
class of explain.dataio
. The default delimiter is \s
the spectial charactere for whitespaces.
[1]:
from explann.dataio import ImportString
data_string = """
Observação Dureza Temperatura
1 137 220
2 137 220
3 137 220
4 136 220
5 135 220
6 135 225
7 133 225
8 132 225
9 133 225
10 133 225
11 128 230
12 124 230
13 126 230
14 129 230
15 126 230
16 122 235
17 122 235
18 122 235
19 119 235
20 122 235
"""
data_reader_string = ImportString(data=data_string, delimiter="\s")
data_reader
object stores the providade data in its .data
attribute
[2]:
data_reader_string.data
[2]:
Observação | Dureza | Temperatura | |
---|---|---|---|
0 | 1 | 137 | 220 |
1 | 2 | 137 | 220 |
2 | 3 | 137 | 220 |
3 | 4 | 136 | 220 |
4 | 5 | 135 | 220 |
5 | 6 | 135 | 225 |
6 | 7 | 133 | 225 |
7 | 8 | 132 | 225 |
8 | 9 | 133 | 225 |
9 | 10 | 133 | 225 |
10 | 11 | 128 | 230 |
11 | 12 | 124 | 230 |
12 | 13 | 126 | 230 |
13 | 14 | 129 | 230 |
14 | 15 | 126 | 230 |
15 | 16 | 122 | 235 |
16 | 17 | 122 | 235 |
17 | 18 | 122 | 235 |
18 | 19 | 119 | 235 |
19 | 20 | 122 | 235 |
Read Data from Excel file
Data can also be read from a .xlsx
file (Excel extension). To do so, use the ImportXLSX
class of explain.dataio
. Default is to read the first Sheet
, otherwise desired provide the additional argument sheet_name
[3]:
from explann.dataio import ImportXLSX
data_reader_xlsx = ImportXLSX(path="../../data/paper_data_24.xlsx")
data_reader_xlsx.data
[3]:
U | A | P | Y | F | C | B | |
---|---|---|---|---|---|---|---|
0 | -1 | -1 | -1 | -1 | 39 | 1.328 | 170 |
1 | 1 | -1 | -1 | -1 | 87 | 1.699 | 122 |
2 | -1 | 1 | -1 | -1 | 48 | 1.332 | 473 |
3 | 1 | 1 | -1 | -1 | 71 | 1.979 | 511 |
4 | -1 | -1 | 1 | -1 | 43 | 1.458 | 156 |
5 | 1 | -1 | 1 | -1 | 84 | 2.189 | 204 |
6 | -1 | 1 | 1 | -1 | 45 | 1.343 | 385 |
7 | 1 | 1 | 1 | -1 | 112 | 1.707 | 288 |
8 | -1 | -1 | -1 | 1 | 19 | 1.257 | 114 |
9 | 1 | -1 | -1 | 1 | 146 | 2.148 | 116 |
10 | -1 | 1 | -1 | 1 | 50 | 1.592 | 244 |
11 | 1 | 1 | -1 | 1 | 92 | 1.726 | 126 |
12 | -1 | -1 | 1 | 1 | 107 | 1.203 | 72 |
13 | 1 | -1 | 1 | 1 | 172 | 2.261 | 210 |
14 | -1 | 1 | 1 | 1 | 62 | 1.434 | 234 |
15 | 1 | 1 | 1 | 1 | 82 | 1.848 | 154 |
16 | 0 | 0 | 0 | 0 | 75 | 1.726 | 223 |
17 | 0 | 0 | 0 | 0 | 70 | 1.782 | 219 |
18 | 0 | 0 | 0 | 0 | 89 | 1.753 | 226 |
Dealing With Leveled Data
Any importer has the functionality to convert levels of an factorial to respective values. For this task a table containg the values associated with each level should be passed, either as a string
or a .xlsx
file in the same way as the main data table.`
[4]:
levels_string = """
Levels;U;A;P;Y
-1;0.15;0.7; 0.40;0.13
0; 0.30;1.4; 0.75;0.26
1; 0.45;2.1; 1.10;0.38
"""
levels_reader = ImportString(
data = levels_string,
delimiter = ";",
index_col = 0 # should pass the column name or index containing the level.
)
levels_reader.data
[4]:
U | A | P | Y | |
---|---|---|---|---|
Levels | ||||
-1 | 0.15 | 0.7 | 0.40 | 0.13 |
0 | 0.30 | 1.4 | 0.75 | 0.26 |
1 | 0.45 | 2.1 | 1.10 | 0.38 |
The same data should be imported from a .xlsx
file.
[5]:
levels_reader_xlsx = ImportXLSX(
path="../../data/paper_data_24.xlsx",
sheet_name="Levels",
index_col=0,
)
levels_reader_xlsx.data
[5]:
U | A | P | Y | |
---|---|---|---|---|
Levels | ||||
-1 | 0.15 | 0.7 | 0.40 | 0.13 |
0 | 0.30 | 1.4 | 0.75 | 0.26 |
1 | 0.45 | 2.1 | 1.10 | 0.38 |
The data reader parse_levels
acept a pd.Dataframe
constructed from any one of the methods above, you can pass string
or .xlsx
files to the methos parse_levels_from_string
and parse_levels_from_xlsx
.
[6]:
# passing a pd.DataFrame
data_reader_xlsx.parse_levels(
data = levels_reader.data
)
[7]:
data_reader_xlsx.data
[7]:
U | A | P | Y | F | C | B | |
---|---|---|---|---|---|---|---|
0 | 0.15 | 0.7 | 0.40 | 0.13 | 39 | 1.328 | 170 |
1 | 0.45 | 0.7 | 0.40 | 0.13 | 87 | 1.699 | 122 |
2 | 0.15 | 2.1 | 0.40 | 0.13 | 48 | 1.332 | 473 |
3 | 0.45 | 2.1 | 0.40 | 0.13 | 71 | 1.979 | 511 |
4 | 0.15 | 0.7 | 1.10 | 0.13 | 43 | 1.458 | 156 |
5 | 0.45 | 0.7 | 1.10 | 0.13 | 84 | 2.189 | 204 |
6 | 0.15 | 2.1 | 1.10 | 0.13 | 45 | 1.343 | 385 |
7 | 0.45 | 2.1 | 1.10 | 0.13 | 112 | 1.707 | 288 |
8 | 0.15 | 0.7 | 0.40 | 0.38 | 19 | 1.257 | 114 |
9 | 0.45 | 0.7 | 0.40 | 0.38 | 146 | 2.148 | 116 |
10 | 0.15 | 2.1 | 0.40 | 0.38 | 50 | 1.592 | 244 |
11 | 0.45 | 2.1 | 0.40 | 0.38 | 92 | 1.726 | 126 |
12 | 0.15 | 0.7 | 1.10 | 0.38 | 107 | 1.203 | 72 |
13 | 0.45 | 0.7 | 1.10 | 0.38 | 172 | 2.261 | 210 |
14 | 0.15 | 2.1 | 1.10 | 0.38 | 62 | 1.434 | 234 |
15 | 0.45 | 2.1 | 1.10 | 0.38 | 82 | 1.848 | 154 |
16 | 0.30 | 1.4 | 0.75 | 0.26 | 75 | 1.726 | 223 |
17 | 0.30 | 1.4 | 0.75 | 0.26 | 70 | 1.782 | 219 |
18 | 0.30 | 1.4 | 0.75 | 0.26 | 89 | 1.753 | 226 |
[8]:
# passing a string
data_reader_xlsx.parse_levels_from_string(
data = levels_string,
delimiter=";"
)
# passing a path
data_reader_xlsx.parse_levels_from_xlsx(
data = "../../data/paper_data_24.xlsx",
sheet_name = "Levels",
index_col=0,
)
[9]:
data_reader_xlsx.data
[9]:
U | A | P | Y | F | C | B | |
---|---|---|---|---|---|---|---|
0 | 0.15 | 0.7 | 0.40 | 0.13 | 39 | 1.328 | 170 |
1 | 0.45 | 0.7 | 0.40 | 0.13 | 87 | 1.699 | 122 |
2 | 0.15 | 2.1 | 0.40 | 0.13 | 48 | 1.332 | 473 |
3 | 0.45 | 2.1 | 0.40 | 0.13 | 71 | 1.979 | 511 |
4 | 0.15 | 0.7 | 1.10 | 0.13 | 43 | 1.458 | 156 |
5 | 0.45 | 0.7 | 1.10 | 0.13 | 84 | 2.189 | 204 |
6 | 0.15 | 2.1 | 1.10 | 0.13 | 45 | 1.343 | 385 |
7 | 0.45 | 2.1 | 1.10 | 0.13 | 112 | 1.707 | 288 |
8 | 0.15 | 0.7 | 0.40 | 0.38 | 19 | 1.257 | 114 |
9 | 0.45 | 0.7 | 0.40 | 0.38 | 146 | 2.148 | 116 |
10 | 0.15 | 2.1 | 0.40 | 0.38 | 50 | 1.592 | 244 |
11 | 0.45 | 2.1 | 0.40 | 0.38 | 92 | 1.726 | 126 |
12 | 0.15 | 0.7 | 1.10 | 0.38 | 107 | 1.203 | 72 |
13 | 0.45 | 0.7 | 1.10 | 0.38 | 172 | 2.261 | 210 |
14 | 0.15 | 2.1 | 1.10 | 0.38 | 62 | 1.434 | 234 |
15 | 0.45 | 2.1 | 1.10 | 0.38 | 82 | 1.848 | 154 |
16 | 0.30 | 1.4 | 0.75 | 0.26 | 75 | 1.726 | 223 |
17 | 0.30 | 1.4 | 0.75 | 0.26 | 70 | 1.782 | 219 |
18 | 0.30 | 1.4 | 0.75 | 0.26 | 89 | 1.753 | 226 |
The results are the same, data
attribute has its values parsed to the corresponding index levels for each variable as described in the levels_reades_<type>.data
table.
[10]:
data_reader_xlsx.data
[10]:
U | A | P | Y | F | C | B | |
---|---|---|---|---|---|---|---|
0 | 0.15 | 0.7 | 0.40 | 0.13 | 39 | 1.328 | 170 |
1 | 0.45 | 0.7 | 0.40 | 0.13 | 87 | 1.699 | 122 |
2 | 0.15 | 2.1 | 0.40 | 0.13 | 48 | 1.332 | 473 |
3 | 0.45 | 2.1 | 0.40 | 0.13 | 71 | 1.979 | 511 |
4 | 0.15 | 0.7 | 1.10 | 0.13 | 43 | 1.458 | 156 |
5 | 0.45 | 0.7 | 1.10 | 0.13 | 84 | 2.189 | 204 |
6 | 0.15 | 2.1 | 1.10 | 0.13 | 45 | 1.343 | 385 |
7 | 0.45 | 2.1 | 1.10 | 0.13 | 112 | 1.707 | 288 |
8 | 0.15 | 0.7 | 0.40 | 0.38 | 19 | 1.257 | 114 |
9 | 0.45 | 0.7 | 0.40 | 0.38 | 146 | 2.148 | 116 |
10 | 0.15 | 2.1 | 0.40 | 0.38 | 50 | 1.592 | 244 |
11 | 0.45 | 2.1 | 0.40 | 0.38 | 92 | 1.726 | 126 |
12 | 0.15 | 0.7 | 1.10 | 0.38 | 107 | 1.203 | 72 |
13 | 0.45 | 0.7 | 1.10 | 0.38 | 172 | 2.261 | 210 |
14 | 0.15 | 2.1 | 1.10 | 0.38 | 62 | 1.434 | 234 |
15 | 0.45 | 2.1 | 1.10 | 0.38 | 82 | 1.848 | 154 |
16 | 0.30 | 1.4 | 0.75 | 0.26 | 75 | 1.726 | 223 |
17 | 0.30 | 1.4 | 0.75 | 0.26 | 70 | 1.782 | 219 |
18 | 0.30 | 1.4 | 0.75 | 0.26 | 89 | 1.753 | 226 |
Original data is retained in a raw_data
attribute
[11]:
data_reader_xlsx.raw_data
[11]:
U | A | P | Y | F | C | B | |
---|---|---|---|---|---|---|---|
0 | -1 | -1 | -1 | -1 | 39 | 1.328 | 170 |
1 | 1 | -1 | -1 | -1 | 87 | 1.699 | 122 |
2 | -1 | 1 | -1 | -1 | 48 | 1.332 | 473 |
3 | 1 | 1 | -1 | -1 | 71 | 1.979 | 511 |
4 | -1 | -1 | 1 | -1 | 43 | 1.458 | 156 |
5 | 1 | -1 | 1 | -1 | 84 | 2.189 | 204 |
6 | -1 | 1 | 1 | -1 | 45 | 1.343 | 385 |
7 | 1 | 1 | 1 | -1 | 112 | 1.707 | 288 |
8 | -1 | -1 | -1 | 1 | 19 | 1.257 | 114 |
9 | 1 | -1 | -1 | 1 | 146 | 2.148 | 116 |
10 | -1 | 1 | -1 | 1 | 50 | 1.592 | 244 |
11 | 1 | 1 | -1 | 1 | 92 | 1.726 | 126 |
12 | -1 | -1 | 1 | 1 | 107 | 1.203 | 72 |
13 | 1 | -1 | 1 | 1 | 172 | 2.261 | 210 |
14 | -1 | 1 | 1 | 1 | 62 | 1.434 | 234 |
15 | 1 | 1 | 1 | 1 | 82 | 1.848 | 154 |
16 | 0 | 0 | 0 | 0 | 75 | 1.726 | 223 |
17 | 0 | 0 | 0 | 0 | 70 | 1.782 | 219 |
18 | 0 | 0 | 0 | 0 | 89 | 1.753 | 226 |
Build a Factorial Model
To bulid a factorial model for the data explann
implemented the class FactorialModel
. The arguments are the data and the functions. Functions are passed as python dictionaries, the keys are the function names, in this examlpe "F"
,"CM"
and "B"
. This names should be any string, this allow to create any number of models by addressing different names.
The dictionary values are also string containing the model equations, the syntax follow patsy standards. The left hand side (lhs) contains the dependent variables and the right hand side (rhs) the indenpendent terms. The lhs and rhs are separated by “~
” charactere.
[13]:
from explann.models import FactorialModel
fm = FactorialModel(
data=data_reader_xlsx.data,
functions=
{
"F" : "F ~ U * A * P * Y",
"CM" : "CM ~ U * A * P * Y",
"B" : "B ~ U * A * P * Y"
}
)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
File ~/micromamba/envs/explann/lib/python3.11/site-packages/patsy/compat.py:36, in call_and_wrap_exc(msg, origin, f, *args, **kwargs)
35 try:
---> 36 return f(*args, **kwargs)
37 except Exception as e:
File ~/micromamba/envs/explann/lib/python3.11/site-packages/patsy/eval.py:169, in EvalEnvironment.eval(self, expr, source_name, inner_namespace)
168 code = compile(expr, source_name, "eval", self.flags, False)
--> 169 return eval(code, {}, VarLookupDict([inner_namespace]
170 + self._namespaces))
File <string>:1
NameError: name 'CM' is not defined
The above exception was the direct cause of the following exception:
PatsyError Traceback (most recent call last)
Cell In[13], line 3
1 from explann.models import FactorialModel
----> 3 fm = FactorialModel(
4 data=data_reader_xlsx.data,
5 functions=
6 {
7 "F" : "F ~ U * A * P * Y",
8 "CM" : "CM ~ U * A * P * Y",
9 "B" : "B ~ U * A * P * Y"
10 }
11 )
File ~/Dropbox/local/github/explann/explann/models.py:34, in FactorialModel.__init__(self, data, functions, statsmodel)
31 self.statsmodel = statsmodel
32 self.functions = functions
---> 34 self.model = self.fit(
35 data=data,
36 functions=functions,
37 statsmodel=statsmodel
38 )
File ~/Dropbox/local/github/explann/explann/models.py:56, in FactorialModel.fit(self, data, functions, statsmodel)
54 function_name = splited[0].strip()
55 function_body = splited[1].strip()
---> 56 model_dict[key] = self.statsmodel(function, data).fit()
57 else:
58 print('must provide function formulas')
File ~/micromamba/envs/explann/lib/python3.11/site-packages/statsmodels/base/model.py:203, in Model.from_formula(cls, formula, data, subset, drop_cols, *args, **kwargs)
200 if missing == 'none': # with patsy it's drop or raise. let's raise.
201 missing = 'raise'
--> 203 tmp = handle_formula_data(data, None, formula, depth=eval_env,
204 missing=missing)
205 ((endog, exog), missing_idx, design_info) = tmp
206 max_endog = cls._formula_max_endog
File ~/micromamba/envs/explann/lib/python3.11/site-packages/statsmodels/formula/formulatools.py:63, in handle_formula_data(Y, X, formula, depth, missing)
61 else:
62 if data_util._is_using_pandas(Y, None):
---> 63 result = dmatrices(formula, Y, depth, return_type='dataframe',
64 NA_action=na_action)
65 else:
66 result = dmatrices(formula, Y, depth, return_type='dataframe',
67 NA_action=na_action)
File ~/micromamba/envs/explann/lib/python3.11/site-packages/patsy/highlevel.py:309, in dmatrices(formula_like, data, eval_env, NA_action, return_type)
299 """Construct two design matrices given a formula_like and data.
300
301 This function is identical to :func:`dmatrix`, except that it requires
(...)
306 See :func:`dmatrix` for details.
307 """
308 eval_env = EvalEnvironment.capture(eval_env, reference=1)
--> 309 (lhs, rhs) = _do_highlevel_design(formula_like, data, eval_env,
310 NA_action, return_type)
311 if lhs.shape[1] == 0:
312 raise PatsyError("model is missing required outcome variables")
File ~/micromamba/envs/explann/lib/python3.11/site-packages/patsy/highlevel.py:164, in _do_highlevel_design(formula_like, data, eval_env, NA_action, return_type)
162 def data_iter_maker():
163 return iter([data])
--> 164 design_infos = _try_incr_builders(formula_like, data_iter_maker, eval_env,
165 NA_action)
166 if design_infos is not None:
167 return build_design_matrices(design_infos, data,
168 NA_action=NA_action,
169 return_type=return_type)
File ~/micromamba/envs/explann/lib/python3.11/site-packages/patsy/highlevel.py:66, in _try_incr_builders(formula_like, data_iter_maker, eval_env, NA_action)
64 if isinstance(formula_like, ModelDesc):
65 assert isinstance(eval_env, EvalEnvironment)
---> 66 return design_matrix_builders([formula_like.lhs_termlist,
67 formula_like.rhs_termlist],
68 data_iter_maker,
69 eval_env,
70 NA_action)
71 else:
72 return None
File ~/micromamba/envs/explann/lib/python3.11/site-packages/patsy/build.py:693, in design_matrix_builders(termlists, data_iter_maker, eval_env, NA_action)
689 factor_states = _factors_memorize(all_factors, data_iter_maker, eval_env)
690 # Now all the factors have working eval methods, so we can evaluate them
691 # on some data to find out what type of data they return.
692 (num_column_counts,
--> 693 cat_levels_contrasts) = _examine_factor_types(all_factors,
694 factor_states,
695 data_iter_maker,
696 NA_action)
697 # Now we need the factor infos, which encapsulate the knowledge of
698 # how to turn any given factor into a chunk of data:
699 factor_infos = {}
File ~/micromamba/envs/explann/lib/python3.11/site-packages/patsy/build.py:443, in _examine_factor_types(factors, factor_states, data_iter_maker, NA_action)
441 for data in data_iter_maker():
442 for factor in list(examine_needed):
--> 443 value = factor.eval(factor_states[factor], data)
444 if factor in cat_sniffers or guess_categorical(value):
445 if factor not in cat_sniffers:
File ~/micromamba/envs/explann/lib/python3.11/site-packages/patsy/eval.py:568, in EvalFactor.eval(self, memorize_state, data)
567 def eval(self, memorize_state, data):
--> 568 return self._eval(memorize_state["eval_code"],
569 memorize_state,
570 data)
File ~/micromamba/envs/explann/lib/python3.11/site-packages/patsy/eval.py:551, in EvalFactor._eval(self, code, memorize_state, data)
549 def _eval(self, code, memorize_state, data):
550 inner_namespace = VarLookupDict([data, memorize_state["transforms"]])
--> 551 return call_and_wrap_exc("Error evaluating factor",
552 self,
553 memorize_state["eval_env"].eval,
554 code,
555 inner_namespace=inner_namespace)
File ~/micromamba/envs/explann/lib/python3.11/site-packages/patsy/compat.py:43, in call_and_wrap_exc(msg, origin, f, *args, **kwargs)
39 new_exc = PatsyError("%s: %s: %s"
40 % (msg, e.__class__.__name__, e),
41 origin)
42 # Use 'exec' to hide this syntax from the Python 2 parser:
---> 43 exec("raise new_exc from e")
44 else:
45 # In python 2, we just let the original exception escape -- better
46 # than destroying the traceback. But if it's a PatsyError, we can
47 # at least set the origin properly.
48 if isinstance(e, PatsyError):
File <string>:1
PatsyError: Error evaluating factor: NameError: name 'CM' is not defined
CM ~ U * A * P * Y
^^
[ ]: