Skip to content

Some functionalities

Daniel03 edited this page Jun 16, 2022 · 4 revisions

Features and units used

  1. Apparent resistivity rhoa in ohm.meter.
  2. Standard fracture index sfi, no unit(n.u).
  3. Anomaly ratio anr , in %.
  4. Anomaly power Pa or power in meter(m).
  5. Anomaly magnitude Ma or magnitude in ohm.m.
  6. Anomaly shape - can be V, M, K, L, H, C, V and W (n.u).
  7. Anomaly type - can be EC, NC, CB2Pand CP (n.u).
    • EC: Extensive conductive
    • NC: Narrow conductive
    • CP: Conductive plane
    • CB2P: Conductive between two planes
  8. Layer thickness thick in m.
  9. Station( site) or position is given as pk in m.
  10. Ohmic surface ohmS in ohm.m2 got from the vertical electrical sounding(VES)
  11. Level of water inflow lwi in m got from the existing boreholes.
  12. Geology geol of the survey area got during the drilling or from the previous geology works.

Data preparation steps

Before taking advantage of WATex algorithms, especially when dealing with Electrical Resistivity Profile(ERP) as well as the Vertical Electrical Sounding (VES) data, we need a few steps of data preparation. ERP and VES data straightforwardly collected from the field MUST be referenced. An example to how to prepare ERP and VES data can be found in the data/geof_data directory. If ERP and VES are in the same Excel workbook in separate sheets, use the tool in read_from_excelsheets and write_excel from watex.utils.ml_utils to separate each ERP and VES by keeping the same location coordinate where the VES is done. A new directory _anEX_ should be created with newly built data. Once the build is successfully done, the geoelectrical features should be computed automatically. To have full control of your selected anomaly, the lower, upper (anomaly boundaries), and se orves|*|0 of selected anomaly should be specified on each ERP survey line in Excel sheet (see data/geof_data/XXXXXXX.csv) then a new ExcelWorkbook main.<name of survey area>.csv should be created. Once the features file is generated, now enjoy your End-to-End Machine Learning(ML) project with implemented algorithms.

  • Code Snippet of fetching raw data
>>> from watex.datasets import fetch_data 
>>> data = fetch_data('Bagoue original')[data]
>>> attributes_infos = fetch_data('Bagoue original')['attrs-infos']

Get the geo-electrical features from the selected anomaly

Geo-electrical features are mainly used for FR prediction purposes. Beforehand, we refer to the data directory data\erp accordingly for this demonstration. The electrical resistivity profile (ERP) data of the survey line is found on l10_gbalo.csv. There are two ways to get Geo-electrical features. The first option is to provide the selected anomaly boundaries into the argument posMinMax and the second way is to let program find automatically the the best anomaly point. The first option is strongly recommended.

First of all, we import the module ERP from watex.core.erp.ERP to build the erp_obj as follow:

>>> from watex.methods.erp import ERP 
>>> erp_obj =ERP (erp_fn = data/erp/l10_gbalo.csv',  # erp_data 
...                auto=False,                        # automatic computation  option 
...                dipole_length =10.,                # distance between measurements 
...                posMinMax= (90, 130),              # select anomaly boundaries 
...                turn_on =True                      # display infos
                 )
  • To get automatically the best anomaly point from the 'erp_line' of survey area, enable auto option and try:
>>> erp_obj.select_best_point_ 
Out[1]: 170 			# --|> The best point is found  at position (pk) = 170.0 m. ----> Station 18              
>>> erp_obj.select_best_value_ 
Out[1]: 80.0			# --|> Best conductive value selected is = 80.0 Ω.m                    
  • To get the other geo-electrical features, considered the prefixbest_+ {feature_name}. For instance :
>>> erp_obj.best_type         # Type of the best selected anomaly on erp line
>>> erp_obj.best_shape        # Best selected anomaly shape is "V"
>>> erp_obj.best_magnitude   # Best anomaly magnitude is 45 Ω.m. 
>>> erp_obj.best_power         # Best anomaly power is 40.0 m. 
>>> erp_obj.best_sfi     	# best anomaly standard fracturation index.
>>> erp_obj.best_anr           # best anomaly ration the whole ERP line.
  • If auto is enabled, the program could find an additionally three(03) maximum best conductive points from the whole ERP line as :
>>> erp_obj.best_points 
-----------------------------------------------------------------------------
--|> 3 best points were found :
 01 : position = 170.0 m ----> rhoa = 80 Ω.m
 02 : position = 80.0 m ----> rhoa = 95 Ω.m
 03 : position = 40.0 m ----> rhoa = 110 Ω.m               
-----------------------------------------------------------------------------

Generate multiple Features from different locations of erp survey line by computing all geo_electrical_features from all ERP survey line using the watex.bases.erp.ERP_collection module as below:

>>> from watex.methods.erp import ERP_collection
>>> erpColObj= ERP_collection(listOferpfn= 'data/erp')
>>> erpColObj.erpdf 

Get all features for data analysis and prediction purposes by calling GeoFeatures from module ~.bases.features as

>>> from watex.bases import GeoFeatures       	        
>>> featurefn ='data/geo_fdata/BagoueDataset2.xlsx' 	
>>> featObj =GeoFeatures(features_fn= featurefn) 	
>>> featObj.site_ids 					
>>> featObj.site_names 					
>>> featObj.df 						
                       					

Click here to see the features' dataset.

Data analysis and quick plot hints

To solve the classification problem in supervised learning, we need to categorize the targettednumerical values into categorized values using the modulewatex.analysis. It's possible to export data using the decorated ~writedf` function:

>>> from watex.analysis  import FeatureInspection
>>> slObj =FeatureInspection(
...   data_fn='data/geo_fdata/BagoueDataset2.xlsx',
...   set_index =True)
>>> slObj.writedf()

To quickly see what data look like, call ~viewpackages:

>>> from watex.view.plot import QuickPlot 
>>> qplotObj = QuickPlot( df = slObj.df , lc='b') 
>>> qplotObj.hist_cat_distribution(target_name='flow')

It's easy to quickly visualize the data by setting the argument data_fn, if df is not given, as data_fn ='data/geo_fdata/BagoueDataset2.xlsx'. Both will give the same result. To draw a plot of two features with bivariate and univariate graphs, use ~.QuickPlot.joint2features method as below:

>>> from watex.view.plot.QuickPlot import joint2features
>>> qkObj = QuickPlot(
...             data_fn ='data/geo_fdata/BagoueDataset2.xlsx', lc='b', 
...             target_name = 'flow', set_theme ='darkgrid', 
...             fig_title='`ohmS` and `lwi` features linked'
...             )  
>>> sns_pkws={
...            'kind':'reg' , #'kde', 'hex'
...            # "hue": 'flow', 
...               }
>>> joinpl_kws={"color": "r", 
...                'zorder':0, 'levels':6}
>>> plmarg_kws={'color':"r", 'height':-.15, 'clip_on':False}           
>>> qkObj.joint2features(features=['ohmS', 'lwi'], 
...            join_kws=joinpl_kws, marginals_kws=plmarg_kws, 
...            **sns_pkws, 
...            ) 

To draw a scatter plot with the possibility of several semantic features groupings, use scattering features method. Indeed this method analysis is a process of understanding how features in a dataset relate to each other and how those relationships depend on other features. It easy to customize the plot if the user has an experience with seaborn plot styles. For instance, we can visualize the relationship 'flow and the geology(geol)' as:

>>> from watex.view.plot.QuickPlot import  scatteringFeatures
>>> qkObj = QuickPlot(
...    data_fn ='data/geo_fdata/BagoueDataset2.xlsx' , 
...             fig_title='Relationship between geology and level of water inflow',
...             xlabel='Level of water inflow (lwi)', 
...             ylabel='Flow rate in m3/h'
...            )  
>>> marker_list= ['o','s','P', 'H']
>>> markers_dict = {key:mv 
...               for key, mv in zip( list (
...                       dict(qkObj.df ['geol'].value_counts(
...                           normalize=True)).keys()), 
...                            marker_list)}
>>> sns_pkws={'markers':markers_dict, 
...          'sizes':(20, 200),
...          "hue":'geol', 
...          'style':'geol',
...         "palette":'deep',
...          'legend':'full',
...          # "hue_norm":(0,7)
...            }
>>> regpl_kws = {'col':'flow', 
...             'hue':'lwi', 
...             'style':'geol',
...             'kind':'scatter'
...            }
>>> qkObj.scatteringFeatures(features=['lwi', 'flow'],
...                         relplot_kws=regpl_kws,
...                         **sns_pkws, 
...                    ) 

WATex gives a piece of mileage discussion. Indeed, discussing mileages seems to be a good approach to comprehending the relationship of the features, their correlation as well as their influence on each other. For instance, to try to discuss the mileages 'ohmS', 'sfi', 'geol' and 'flow', we merely need to import discussingfeatures method from the QuickPlot class as below:

>>> from view.plot.QuickPlot import discussingFeatures 
>>> qkObj = QuickPlot(  fig_legend_kws={'loc':'upper right'},
...          fig_title = '`sfi` vs`ohmS|`geol`',
...            )  
>>> sns_pkws={'aspect':2 , 
...          "height": 2, 
...                  }
>>> map_kws={'edgecolor':"w"}   
>>> qkObj.discussingFeatures(
...    data_fn ='data/geo_fdata/BagoueDataset2.xlsx' , 
...                         features =['ohmS', 'sfi','geol', 'flow'],
...                           map_kws=map_kws,  **sns_pkws)                          

Data processing

Processing is useful before the modeling step. To process data, a default implementation is given for data preprocessing after data sanitizing. It consists of creating a model pipeline using different supervised learnings methods. A default pipeline is created through the preprocessor design. Indeed a preprocessor is a set of transformers + estimators and multiple other functions to boost the prediction. WATex includes nine(09) inner default estimators such as neighbors, trees, SVM, and ~.ensemble estimators category. An example of preprocessing class implementation is given below:

>>> from watex.bases import Preprocessing
>>> prepObj = Preprocessing(drop_features = ['lwi', 'x_m', 'y_m'],
...    data_fn ='data/geo_fdata/BagoueDataset2.xlsx')
>>> prepObj.X_train, prepObj.X_test, prepObj.y_train, prepObj.y_test
>>> prepObj.categorial_features, prepObj.numerical_features 
>>> prepObj.random_state = 25 
>>> preObj.test_size = 0.25
>>> prepObj.make_preprocessor()         # use default preprocessing
>>> prepObj.make_preprocessing_model( default_estimator='SVM')
>>> prepObj.preprocessing_model_score
>>> prepObj.preprocess_model_prediction
>>> prepObj.confusion_matrix
>>> prepObj.classification_report

It's also interesting to evaluate a quick model score without any preprocessing beforehand by calling the Processing superclass as :

>>> from watex.bases import Processing 
>>> processObj = Processing(
...   data_fn = 'data/geo_fdata/BagoueDataset2.xlsx')
>>> processObj.quick_estimation(estimator=DecisionTreeClassifier(
...    max_depth=100, random_state=13))
>>> processObj.model_score
0.5769230769230769                  # model score ~ 57.692   %
>>> processObj.model_prediction

Now let's evaluate onto the same dataset the model_score by reinjecting the default composite estimator using preprocessor pipelines. We trigger the composite estimator by switching the auto option to True.

>>> processObj = Processing(data_fn = 'data/geo_fdata/BagoueDataset2.xlsx', 
...                        auto=True)
>>> processObj.preprocessor
>>> processObj.model_score
0.65385896523648201                 # new composite estimator ~ 65.385    %
>>> processObj.model_prediction

We clearly see a difference of 14.798% between the two options. Furthermore, we can get the validation curve by calling the get_validation_curve function using the same default composite estimator like:

>>> processObj.get_validation_curve(switch_plot='on', preprocess_step=True)

Modeling

The most interesting and challenging part of modeling is the tuning hyperparameters after designing a composite estimator. Getting the best params is a better way to reorganize the created pipeline {transformers +estimators} so as to have a great capability of data generalization. In the following example, we try to create a simple pipeline and we'll be tuning its hyperparameters. Then the best parameters obtained will be reinjected into the design estimator for the next prediction. This is an example and the user has the ability to create its own pipelines more powerfully. We consider an svc estimator as the default estimator. The process is described below:

>>> from watex.bases import BaseModel
>>> from sklearn.preprocessing import RobustScaler, PolynomialFeatures 
>>> from sklearn.feature_selection import SelectKBest, f_classif 
>>> from sklearn.svm import SVC 
>>> from sklearn.compose import make_column_selector 
>>> my_own_pipelines= {
        'num_column_selector_': make_column_selector(
            dtype_include=np.number),
        'cat_column_selector_': make_column_selector(
            dtype_exclude=np.number),
        'features_engineering_':PolynomialFeatures(
            3, include_bias=False),
        'selectors_': SelectKBest(f_classif, k=3), 
        'encodages_': RobustScaler()
          }
>>> my_estimator = SVC(C=1, gamma=1e-4, random_state=7)             # random estimator 
>>> modelObj = BaseModel(data_fn ='data/geo_fdata/BagoueDataset2.xlsx', 
                       pipelines =my_own_pipelines , 
                       estimator = my_estimator)
>>> hyperparams ={
                'columntransformer__pipeline-1__polynomialfeatures__degree': np.arange(2,10), 
                'columntransformer__pipeline-1__selectkbest__k': np.arange(2,7), 
                'svc__C': [1, 10, 100],
                'svc__gamma':[1e-1, 1e-2, 1e-3]}
>>> my_compose_estimator_ = modelObj.model_ 
>>> modelObj.tuning_hyperparameters(
                            estimator= my_compose_estimator_ , 
                            hyper_params= hyperparams, 
                            search='rand') 
>>> modelObj.best_params_
Out[7]:
{'columntransformer__pipeline-1__polynomialfeatures__degree': 2, 'columntransformer__pipeline-1__selectkbest__k': 2, 'svc__C': 1, 'svc__gamma': 0.1}
>>> modelObj.best_score_
Out[8]:
-----------------------------------------------------------------------------
> SupportVectorClassifier       :   Score  =   73.092   %
-----------------------------------------------------------------------------

We can now rebuild and rearrange the pipeline by specifying the best parameters values and run again so to get the the new model_score and model prediction:

>>> modelObj.model_score
Out[9]:
-----------------------------------------------------------------------------
> SupportVectorClassifier       :   Score  =   76.923   % ~ >3/4 
----------------------------------------------------------------------------- 
  • Note: This is an illustrative example, you can tune your hyperparameters using another estimators either the supervised learning method by handling the method watex.bases.modeling.BaseModel.tuning_hyperparameters parameters.

We can quickly visualize the learning curve by calling the decorated method get_learning_curve as below:

>>> processObj.get_learning_curve (estimator= my_compose_estimator_,
        switch_plot='on')

In the test area bagoue , we can get a sample of model prediction after tuning the model hyperparameters, by calling the decorated method get_model_prediction referred as below:

>>> from watex.bases import BaseModel 
>>> modelObj = BaseModel(data_fn ='data/geo_fdata/BagoueDataset2.xlsx', 
...                     pipelines ={
...                             'num_column_selector_': make_column_selector(dtype_include=np.number),
...                             'cat_column_selector_': make_column_selector(dtype_exclude=np.number),
...                             'features_engineering_':PolynomialFeatures(2, include_bias=False),
...                             'selectors_': SelectKBest(f_classif, k=2), 
...                             'encodages_': RobustScaler()},
...                     estimator = SVC(C=1, gamma=0.1))
>>> modelObj.get_model_prediction(switch ='on')
  • Implementation in test area: See the model prediction of the test area performed in the Bagoue region in the north part of Cote d'Ivoire (West-Africa) by clicking on the reference output .

It's also possible to visualize the permutation_importance of mileages using a tree or ensemble method before shuffling and after shuffling. Indeed Permutation feature importance is a model inspection technique that can be used for any fitted estimator when the data is tabular. This is especially useful for non-linear or opaque estimators. More details can be found in scikit-learn website. Let's do a quick example using the RandomForestClassifier ensemble estimator. We call the decorated method permutation_feature_importance as :

>>> from watex.bases import BaseModel
>>> from sklearn.ensemble import AdaBoostClassifier
>>> modelObj.permutation_feature_importance(
...    estimator = RandomForestClassifier(random_state=7),
...    n_repeats=100, 
...    data_fn ='data/geo_fdata/BagoueDataset2.xlsx',  
...    switch ='on', pfi_style='pfi')              # plot_style can be dendogram with argument `dendro`

Click here to see the pfi diagram reference output.