This work is the implementation of the framework proposed in: HAMLET: a framework for Human-centered AutoML via Structured Argumentation.
If you are interested in reproducing the experiments performed in it, please refer to the dedicated GitHub repository.
In the last decades, we have witnessed an exponential growth in the field of AutoML. Unfortunately, this lead AutoML to be just another black-box that can be barely open. We claim that the user / data scientist has the right and the duty to revise and supervise the Machine Learning (ML) / Data Mining (DM) process.
HAMLET is a Human-centered AutoML framework that leverages Logic and Argumentation to:
- inject knowledge and contraints through a intuitive logical language into a Logical Knowledge Base (LogicalKB);
- represent (user) inputs and (AutoML) outputs in a uniform medium that is both human- and machine-readable (Problem Graph);
- discover insights through a recommendation mechanism atop the AutoML outputs;
- deal with possible arising inconsistencies in the knowledge.
HAMLET is inspired to the well-known and standard process model CRISP-DM (CRoss-Industry Standard Process for Data Mining). Iteration after iteration, the knowledge is augmented and data scientists and domain experts can work in a close cooperation towards the final solution.
HAMLET leverages:
- Arg-tuProlog, a Kotlin implementation of the Argumentation framework;
- Microsoft FLAML, a python implementation of the state-of-the-art Blend Search (a Bayesian Optmization variant that takes into consideration the cost of possible solution);
- Scikit-learn, the well-known python framework that offers plenty of ML algorithm implementations.
- Docker
- Java >= 11.0
java -jar hamlet-1.0.0-all.jar [workspace_path] [dataset_id] [optimization_metric] [optimization_mode] [n_configurations] [time_budget] [optimization_seed] [debug_mode] [knowledge_base_path]
- [workspace_path]: absolute path to the file system folder containing the workspace (i.e., where to save the results); if it does not exist, a new workspace is created, otherwise, the previous run is resumed.
- [dataset_id]: OpenML id of the dataset to analyze (due to some OpenML API disservices, we pre-downloaded the datasets, thus we support only a specific suite: the OpenML CC-18 suite).
- [optimization_metric]: a string of the metric name to optimize (choose among the scikit-learn metrics, e.g,
balanced-accuracy
). - [optimization_mode]: a string in ['min', 'max'] to specify the objective as minimization or maximization.
- [n_configurations]: an integer of the number of configurations to try in the optimization of each iteration.
- [time_budget]: the time budget in seconds given to the optimization of each iteration.
- [optimization_seed]: seed for reproducibility.
- [debug_mode]: a string in ['true', 'false'] to specify HAMLET execution in debug or release mode. In debug mode, the Docker container is built from the local sources; otherwise the released Docker image is downloaded.
- [knowledge_base_path] (OPTIONAL): file system path to an HAMLET knowledge base. If provided, HAMLET is run in console (with no GUI) mode and the theory is leveraged; otherwise HAMLET GUI is launched.
Once you run HAMLET with a specific dataset and a specific metric to optimize, at the top of the window, HAMLET allows to encode both the AutoML search space and the user-defined constraints into the LogicalKB (see the next section for the accepted syntax).
Two ready-to-use LogicalKB can be found in the resources folder of this repository:
kb.txt
is a knowledge base containing the search space leveraged in our experiments;pkb.txt
(PreliminaryKB) is a knowledge base containing the search space along with some suggested constraints (discovered in the paper Data pre-processing pipeline generation for AutoETL).
For the sake of brevity, follows an example with a simplier LogicalKB:
Intuitively, we specify the scheme of the ML pipeline that we want to build, step by step. In the example at hand, we have:
- a Data Pre-processing step for Discretization;
- a Data Pre-processing step for Normalization;
- a Modeling step for Classification (the task we want to address).
Then, we have the implementations and the hyper-parameter domains of each step:
- KBins for Discretization, with a integer parameter k_bins that ranges from 3 to 8;
- StandardScaler for Normalization, with no parameter;
- Decision Tree and K-Nearest Neighbors for Classification, with -- respectively -- an integer parameter max_depth that ranges from 1 to 5 and an integer parameter n_neighbors that ranges from 3 to 20.
Finally, we have a user-defined constraints (c1
): forbid Normalization for Decision Tree.
By hitting the Compute Graph
button, the Argumentation framework is called to process the encoded LogicalKB.
The Problem Graph is visualized at the bottom-right corner.
Each node of this Argumentation graph (called arguments) represent a specific portion of search sub-space, the legend is visualized at the bottom-left corner.
For instance:
- A1, A3, A5, A7, and A9 represent all the possible pipelines for the Decision Tree algorithm;
- A2, A4, A6, A8, and A10 represent all the possible pipelines for the K-Nearest Neighbor algorithm.
Besides, each constraint is represented as an argument as well.
Indeed, the node (argument) A0 represent the user-defined constraint c1
.
Edges are attacks from an argument to another (c1
attacks exactly the pipelines in which we have Normalization along with the Decision Tree).
By hitting the Run AutoML
button, HAMLET triggers FLAML to explore the encoded search space, taking also in consideration the specified constraints (discouraging the exploration in those particular sub-spaces).
At the end of the optimization, the user can switch to the Data
tab to go through all the explored configurations:
As to the last tab AutoML arguments
, we can see reccomendations of constraints, mined from the AutoML output:
We think at this process as an argument between the data scientist and the AutoML tool. The data scientist can consider the arguments at hand, and encode them into the LogicalKB.
At this point, the next iteration can be performed.
We committed in developing a logical language as intuitive as possible:
step(S).
specifies a stepS
of the pipeline, withS
in [discretization
,normalization
,rebalancing
,imputation
,features
,classification
]operator(S, O).
specifies an operatorO
for the stepS
, withO
in [kbins
,binarizer
,power_transformer
,robust_scaler
,standard
,minmax
,select_k_best
,pca
,simple_imputer
,iterative_imputer
,near_miss
,smote
]hyperparameter(O, H, T).
specifies an hyper-parameterH
for the operatorO
with typeT
,H
can be every hyper-parameter name of the chosen Scikit-learn operatorO
,T
is chosen accordingly and has to be in [randint
,choice
,uniform
]domain(O, H, D).
specifies the domainD
of the hyper-parameterH
of the operatoreO
,D
is an array in[ ... ]
brackets containing the values that the hyper-parameterH
can assume (in case ofrandint
anduniform
, the array has to contain just two elements: the boundary of the range)id :=> mandatory_order([S1, S2], O1).
specifies amandatory_order
constraint: the stepS1
has to appear before the stepS2
when occurring the operatorO1
of the task step (in this implementation we support onlyclassification
task); it is possible to putclassification
instead ofO1
, this will apply the constraint for eachclassification
operatorsid :=> mandatory([S1, S2, ...], O1).
specifies amandatory
constraint: the steps[S1, S2, ...]
are mandatory when occurring the operatorO1
of the task step (in this implementation we support onlyclassification
task); if the array of the steps is empty, the constraint specifies only that O1 is mandatory (with or withour Data Pre-processing steps)id :=> forbidden([S1, S2, ...], O1).
specifies aforbidden
constraint: the steps[S1, S2, ...]
are forbidden when occurring the operatorO1
of the task step (in this implementation we support onlyclassification
task); if the array of the steps is empty, the constraint specifies only that O1 is forbidden (with or withour Data Pre-processing steps)