-
Notifications
You must be signed in to change notification settings - Fork 5
Some functionalities
- Apparent resistivity
rhoa
in ohm.meter. - Standard fracture index
sfi
, no unit(n.u). - Anomaly ratio
anr
, in %. - Anomaly power Pa or
power
in meter(m). - Anomaly magnitude Ma or
magnitude
in ohm.m. - Anomaly shape - can be
V, M, K, L, H, C, V
andW
(n.u). - Anomaly type - can be
EC, NC, CB2P
andCP
(n.u).-
EC
: Extensive conductive -
NC
: Narrow conductive -
CP
: Conductive plane -
CB2P
: Conductive between two planes
-
- Layer thickness
thick
in m. - Station( site) or position is given as
pk
in m. - Ohmic surface
ohmS
in ohm.m2 got from the vertical electrical sounding(VES) - Level of water inflow
lwi
in m got from the existing boreholes. - Geology
geol
of the survey area got during the drilling or from the previous geology works.
Before taking advantage of WATex algorithms, especially when dealing with Electrical Resistivity Profile(ERP)
as well as the Vertical Electrical Sounding (VES) data, we need a few steps of data preparation.
ERP and VES data straightforwardly collected from the field MUST be referenced. An example to how to
prepare ERP and VES data can be found in the data/geof_data
directory. If ERP and VES are in the same Excel workbook in separate sheets,
use the tool in read_from_excelsheets
and write_excel
from watex.utils.ml_utils
to separate each ERP and VES
by keeping the same location coordinate where the VES is done.
A new directory _anEX_
should be created with newly built data. Once the build
is successfully done, the geoelectrical
features should be computed automatically. To have full control of your selected anomaly, the
lower
, upper
(anomaly boundaries), and se
orves|*|0
of selected anomaly should be specified on each
ERP survey line in Excel sheet (see data/geof_data/XXXXXXX.csv
) then a new ExcelWorkbook main.<name of survey area>.csv
should
be created. Once the features file is generated, now enjoy your End-to-End Machine Learning(ML) project with implemented algorithms.
- Code Snippet of fetching raw data
>>> from watex.datasets import fetch_data
>>> data = fetch_data('Bagoue original')[data]
>>> attributes_infos = fetch_data('Bagoue original')['attrs-infos']
Geo-electrical features are mainly used for FR prediction purposes.
Beforehand, we refer to the data directory data\erp
accordingly for this demonstration.
The electrical resistivity profile (ERP) data of the survey line is found on l10_gbalo.csv
. There are two ways to get Geo-electrical features.
The first option is to provide the selected anomaly boundaries into the argument posMinMax
and
the second way is to let program find automatically the the best anomaly point. The first option is strongly recommended.
First of all, we import the module ERP from watex.core.erp.ERP
to build the erp_obj
as follow:
>>> from watex.methods.erp import ERP
>>> erp_obj =ERP (erp_fn = data/erp/l10_gbalo.csv', # erp_data
... auto=False, # automatic computation option
... dipole_length =10., # distance between measurements
... posMinMax= (90, 130), # select anomaly boundaries
... turn_on =True # display infos
)
- To get automatically the best anomaly point from the 'erp_line' of survey area, enable
auto
option and try:
>>> erp_obj.select_best_point_
Out[1]: 170 # --|> The best point is found at position (pk) = 170.0 m. ----> Station 18
>>> erp_obj.select_best_value_
Out[1]: 80.0 # --|> Best conductive value selected is = 80.0 Ω.m
- To get the other geo-electrical features, considered the prefix
best_+ {feature_name}
. For instance :
>>> erp_obj.best_type # Type of the best selected anomaly on erp line
>>> erp_obj.best_shape # Best selected anomaly shape is "V"
>>> erp_obj.best_magnitude # Best anomaly magnitude is 45 Ω.m.
>>> erp_obj.best_power # Best anomaly power is 40.0 m.
>>> erp_obj.best_sfi # best anomaly standard fracturation index.
>>> erp_obj.best_anr # best anomaly ration the whole ERP line.
- If
auto
is enabled, the program could find an additionally three(03) maximum best conductive points from the whole ERP line as :
>>> erp_obj.best_points
-----------------------------------------------------------------------------
--|> 3 best points were found :
01 : position = 170.0 m ----> rhoa = 80 Ω.m
02 : position = 80.0 m ----> rhoa = 95 Ω.m
03 : position = 40.0 m ----> rhoa = 110 Ω.m
-----------------------------------------------------------------------------
Generate multiple Features
from different locations of erp
survey line by computing all geo_electrical_features
from all
ERP survey line using the watex.bases.erp.ERP_collection
module as below:
>>> from watex.methods.erp import ERP_collection
>>> erpColObj= ERP_collection(listOferpfn= 'data/erp')
>>> erpColObj.erpdf
Get all features for data analysis and prediction purposes by calling GeoFeatures
from module ~.bases.features
as
>>> from watex.bases import GeoFeatures
>>> featurefn ='data/geo_fdata/BagoueDataset2.xlsx'
>>> featObj =GeoFeatures(features_fn= featurefn)
>>> featObj.site_ids
>>> featObj.site_names
>>> featObj.df
Click here to see the features' dataset.
To solve the classification problem in supervised learning, we need to categorize the
targettednumerical values into categorized values using the module
watex.analysis. It's possible to export data using the decorated
~writedf` function:
>>> from watex.analysis import FeatureInspection
>>> slObj =FeatureInspection(
... data_fn='data/geo_fdata/BagoueDataset2.xlsx',
... set_index =True)
>>> slObj.writedf()
To quickly see what data look like, call ~view
packages:
>>> from watex.view.plot import QuickPlot
>>> qplotObj = QuickPlot( df = slObj.df , lc='b')
>>> qplotObj.hist_cat_distribution(target_name='flow')
It's easy to quickly visualize the data by setting the argument data_fn
, if df
is not given, as data_fn ='data/geo_fdata/BagoueDataset2.xlsx'
.
Both will give the same result.
To draw a plot of two features with bivariate and univariate graphs, use ~.QuickPlot.joint2features
method as
below:
>>> from watex.view.plot.QuickPlot import joint2features
>>> qkObj = QuickPlot(
... data_fn ='data/geo_fdata/BagoueDataset2.xlsx', lc='b',
... target_name = 'flow', set_theme ='darkgrid',
... fig_title='`ohmS` and `lwi` features linked'
... )
>>> sns_pkws={
... 'kind':'reg' , #'kde', 'hex'
... # "hue": 'flow',
... }
>>> joinpl_kws={"color": "r",
... 'zorder':0, 'levels':6}
>>> plmarg_kws={'color':"r", 'height':-.15, 'clip_on':False}
>>> qkObj.joint2features(features=['ohmS', 'lwi'],
... join_kws=joinpl_kws, marginals_kws=plmarg_kws,
... **sns_pkws,
... )
To draw a scatter plot with the possibility of several semantic features groupings, use scattering features
method. Indeed this method analysis is a process of understanding how features in a
dataset relate to each other and how those relationships depend on other features. It easy to customize
the plot if the user has an experience with seaborn
plot styles. For instance, we can visualize the relationship
'flow
and the geology(geol)
' as:
>>> from watex.view.plot.QuickPlot import scatteringFeatures
>>> qkObj = QuickPlot(
... data_fn ='data/geo_fdata/BagoueDataset2.xlsx' ,
... fig_title='Relationship between geology and level of water inflow',
... xlabel='Level of water inflow (lwi)',
... ylabel='Flow rate in m3/h'
... )
>>> marker_list= ['o','s','P', 'H']
>>> markers_dict = {key:mv
... for key, mv in zip( list (
... dict(qkObj.df ['geol'].value_counts(
... normalize=True)).keys()),
... marker_list)}
>>> sns_pkws={'markers':markers_dict,
... 'sizes':(20, 200),
... "hue":'geol',
... 'style':'geol',
... "palette":'deep',
... 'legend':'full',
... # "hue_norm":(0,7)
... }
>>> regpl_kws = {'col':'flow',
... 'hue':'lwi',
... 'style':'geol',
... 'kind':'scatter'
... }
>>> qkObj.scatteringFeatures(features=['lwi', 'flow'],
... relplot_kws=regpl_kws,
... **sns_pkws,
... )
WATex gives a piece of mileage discussion. Indeed, discussing mileages seems to be a good approach to
comprehending the relationship of the features, their correlation as well as their influence on each other.
For instance, to try to discuss the mileages 'ohmS', 'sfi', 'geol' and 'flow'
, we merely
need to import discussingfeatures
method from the QuickPlot
class as below:
>>> from view.plot.QuickPlot import discussingFeatures
>>> qkObj = QuickPlot( fig_legend_kws={'loc':'upper right'},
... fig_title = '`sfi` vs`ohmS|`geol`',
... )
>>> sns_pkws={'aspect':2 ,
... "height": 2,
... }
>>> map_kws={'edgecolor':"w"}
>>> qkObj.discussingFeatures(
... data_fn ='data/geo_fdata/BagoueDataset2.xlsx' ,
... features =['ohmS', 'sfi','geol', 'flow'],
... map_kws=map_kws, **sns_pkws)
Processing is useful before the modeling step. To process data, a default implementation is given for
data preprocessing
after data sanitizing. It consists of creating a model pipeline using different supervised learnings
methods.
A default pipeline is created through the preprocessor
design. Indeed a preprocessor
is a set
of transformers + estimators
and multiple other functions to boost the prediction. WATex includes
nine(09) inner default estimators such as neighbors
, trees
, SVM
, and ~.ensemble
estimators category.
An example of preprocessing
class implementation is given below:
>>> from watex.bases import Preprocessing
>>> prepObj = Preprocessing(drop_features = ['lwi', 'x_m', 'y_m'],
... data_fn ='data/geo_fdata/BagoueDataset2.xlsx')
>>> prepObj.X_train, prepObj.X_test, prepObj.y_train, prepObj.y_test
>>> prepObj.categorial_features, prepObj.numerical_features
>>> prepObj.random_state = 25
>>> preObj.test_size = 0.25
>>> prepObj.make_preprocessor() # use default preprocessing
>>> prepObj.make_preprocessing_model( default_estimator='SVM')
>>> prepObj.preprocessing_model_score
>>> prepObj.preprocess_model_prediction
>>> prepObj.confusion_matrix
>>> prepObj.classification_report
It's also interesting to evaluate a quick model score without any preprocessing beforehand by calling the
Processing
superclass as :
>>> from watex.bases import Processing
>>> processObj = Processing(
... data_fn = 'data/geo_fdata/BagoueDataset2.xlsx')
>>> processObj.quick_estimation(estimator=DecisionTreeClassifier(
... max_depth=100, random_state=13))
>>> processObj.model_score
0.5769230769230769 # model score ~ 57.692 %
>>> processObj.model_prediction
Now let's evaluate onto the same dataset the model_score
by reinjecting the default composite estimator
using preprocessor
pipelines. We trigger the composite estimator by switching the auto
option to True
.
>>> processObj = Processing(data_fn = 'data/geo_fdata/BagoueDataset2.xlsx',
... auto=True)
>>> processObj.preprocessor
>>> processObj.model_score
0.65385896523648201 # new composite estimator ~ 65.385 %
>>> processObj.model_prediction
We clearly see a difference of 14.798%
between the two options. Furthermore, we can get the validation curve
by calling the get_validation_curve
function using the same default composite estimator like:
>>> processObj.get_validation_curve(switch_plot='on', preprocess_step=True)
The most interesting and challenging part of modeling is the tuning hyperparameters
after designing a composite estimator.
Getting the best params is a better way to reorganize the created pipeline {transformers +estimators}
so as to
have a great capability of data generalization. In the following example, we try to create a simple pipeline
and we'll be tuning its hyperparameters. Then the best parameters obtained will be reinjected into the design estimator for the next
prediction. This is an example and the user has the ability to create its own pipelines more powerfully.
We consider an svc estimator as the default estimator. The process is described below:
>>> from watex.bases import BaseModel
>>> from sklearn.preprocessing import RobustScaler, PolynomialFeatures
>>> from sklearn.feature_selection import SelectKBest, f_classif
>>> from sklearn.svm import SVC
>>> from sklearn.compose import make_column_selector
>>> my_own_pipelines= {
'num_column_selector_': make_column_selector(
dtype_include=np.number),
'cat_column_selector_': make_column_selector(
dtype_exclude=np.number),
'features_engineering_':PolynomialFeatures(
3, include_bias=False),
'selectors_': SelectKBest(f_classif, k=3),
'encodages_': RobustScaler()
}
>>> my_estimator = SVC(C=1, gamma=1e-4, random_state=7) # random estimator
>>> modelObj = BaseModel(data_fn ='data/geo_fdata/BagoueDataset2.xlsx',
pipelines =my_own_pipelines ,
estimator = my_estimator)
>>> hyperparams ={
'columntransformer__pipeline-1__polynomialfeatures__degree': np.arange(2,10),
'columntransformer__pipeline-1__selectkbest__k': np.arange(2,7),
'svc__C': [1, 10, 100],
'svc__gamma':[1e-1, 1e-2, 1e-3]}
>>> my_compose_estimator_ = modelObj.model_
>>> modelObj.tuning_hyperparameters(
estimator= my_compose_estimator_ ,
hyper_params= hyperparams,
search='rand')
>>> modelObj.best_params_
Out[7]:
{'columntransformer__pipeline-1__polynomialfeatures__degree': 2, 'columntransformer__pipeline-1__selectkbest__k': 2, 'svc__C': 1, 'svc__gamma': 0.1}
>>> modelObj.best_score_
Out[8]:
-----------------------------------------------------------------------------
> SupportVectorClassifier : Score = 73.092 %
-----------------------------------------------------------------------------
We can now rebuild and rearrange the pipeline by specifying the best parameters values and run again so to get the the new model_score and model prediction:
>>> modelObj.model_score
Out[9]:
-----------------------------------------------------------------------------
> SupportVectorClassifier : Score = 76.923 % ~ >3/4
-----------------------------------------------------------------------------
-
Note: This is an illustrative example, you can tune your hyperparameters using another
estimators either the supervised learning method by handling the method
watex.bases.modeling.BaseModel.tuning_hyperparameters
parameters.
We can quickly visualize the learning curve by calling the decorated method
get_learning_curve
as below:
>>> processObj.get_learning_curve (estimator= my_compose_estimator_,
switch_plot='on')
In the test area bagoue
, we can get a sample of model prediction after tuning the model hyperparameters,
by calling the decorated method get_model_prediction
referred as below:
>>> from watex.bases import BaseModel
>>> modelObj = BaseModel(data_fn ='data/geo_fdata/BagoueDataset2.xlsx',
... pipelines ={
... 'num_column_selector_': make_column_selector(dtype_include=np.number),
... 'cat_column_selector_': make_column_selector(dtype_exclude=np.number),
... 'features_engineering_':PolynomialFeatures(2, include_bias=False),
... 'selectors_': SelectKBest(f_classif, k=2),
... 'encodages_': RobustScaler()},
... estimator = SVC(C=1, gamma=0.1))
>>> modelObj.get_model_prediction(switch ='on')
-
Implementation in test area:
See the model prediction of the test area performed in the
Bagoue
region in the north part of Cote d'Ivoire (West-Africa) by clicking on the reference output .
It's also possible to visualize the permutation_importance
of mileages using a tree
or ensemble
method before shuffling and after shuffling.
Indeed Permutation feature importance is a model inspection technique that can be used for any fitted estimator when the data is tabular.
This is especially useful for non-linear or opaque estimators. More details can be found in scikit-learn website.
Let's do a quick example using the RandomForestClassifier
ensemble estimator.
We call the decorated method permutation_feature_importance
as :
>>> from watex.bases import BaseModel
>>> from sklearn.ensemble import AdaBoostClassifier
>>> modelObj.permutation_feature_importance(
... estimator = RandomForestClassifier(random_state=7),
... n_repeats=100,
... data_fn ='data/geo_fdata/BagoueDataset2.xlsx',
... switch ='on', pfi_style='pfi') # plot_style can be dendogram with argument `dendro`
Click here to see the pfi diagram reference output.
Useful links to writing your own wiki pages:
- For documenting your project on GitHub: https://guides.github.com/features/wikis/)
- For editing wiki content: https://docs.github.com/en/communities/documenting-your-project-with-wikis/editing-wiki-content
- For adding images to wikis: https://help.github.com/articles/adding-images-to-wikis/