From 9344722aa04f35a8127ce2592a8db96dfdf16f40 Mon Sep 17 00:00:00 2001 From: guempelp0 Date: Tue, 25 Apr 2023 14:33:07 +0200 Subject: [PATCH 1/9] Added glossary.md and filled it with some first entries --- docs/glossary.md | 35 +++++++++++++++++++++++++++++++++++ mkdocs.yml | 1 + 2 files changed, 36 insertions(+) create mode 100644 docs/glossary.md diff --git a/docs/glossary.md b/docs/glossary.md new file mode 100644 index 000000000..33aceb8a5 --- /dev/null +++ b/docs/glossary.md @@ -0,0 +1,35 @@ +# Glossary for important terms in Safe-DS + +## API +**A**pplication **P**rogramming **I**nterface
+An API allows independent applications to communicate with each other and exchange data. + +## Decision Tree +A Decision Tree represents the process of conditional evaluation in a tree diagram. + +## One Hot Encoder +If a column's entries consist of a non-numerical data type, using a One Hot Encoder will create +a new column for each different entry, filling it with a '1' in the respective places, '0' otherwise. + +## Machine Learning (ML) +Machine Learning is a generic term for artificially generating knowledge through experience. +To achieve this, one can choose between a variety of model options. + +## Random Forest +Random Forest is an ML model that works by generating decision trees at random. + +## Supervised Learning +Supervised Learning is a subcategory of ML. This approach uses algorithms to learn given data. +Those Algorithms might be able to find hidden meaning in data - without being told where to look. + +## Tagged Table +In addition to a regular table, a Tagged Table will mark one column as tagged, meaning that +an applied algorithm will train to predict its entries. + +## Regression +Regression refers to the estimation of continuous dependent variables. + + + + + diff --git a/mkdocs.yml b/mkdocs.yml index 3b63c79e9..1c09f41d5 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -11,6 +11,7 @@ nav: - Data Visualization: tutorials/data_visualization.ipynb - Machine Learning: tutorials/machine_learning.ipynb - API Reference: reference/ + - Glossary: docs/glossary.md - Development: - Environment: development/environment.md - Guidelines: development/guidelines.md From 5d8ec7b122b87bbf92252a9c541bd45268678f5e Mon Sep 17 00:00:00 2001 From: patrikguempel Date: Fri, 5 May 2023 22:39:28 +0200 Subject: [PATCH 2/9] configures Snippets extension, further added to and reformatted the glossary.md --- {docs => includes}/glossary.md | 25 +++++++++++++++++-------- mkdocs.yml | 5 ++++- 2 files changed, 21 insertions(+), 9 deletions(-) rename {docs => includes}/glossary.md (69%) diff --git a/docs/glossary.md b/includes/glossary.md similarity index 69% rename from docs/glossary.md rename to includes/glossary.md index 33aceb8a5..90525c6aa 100644 --- a/docs/glossary.md +++ b/includes/glossary.md @@ -1,35 +1,44 @@ # Glossary for important terms in Safe-DS -## API +*[API]: **A**pplication **P**rogramming **I**nterface
An API allows independent applications to communicate with each other and exchange data. -## Decision Tree +*[Decision Tree]: A Decision Tree represents the process of conditional evaluation in a tree diagram. -## One Hot Encoder +*[One Hot Encoder]: If a column's entries consist of a non-numerical data type, using a One Hot Encoder will create a new column for each different entry, filling it with a '1' in the respective places, '0' otherwise. -## Machine Learning (ML) +*[Machine Learning (ML)]: Machine Learning is a generic term for artificially generating knowledge through experience. To achieve this, one can choose between a variety of model options. -## Random Forest +*[Random Forest]: Random Forest is an ML model that works by generating decision trees at random. -## Supervised Learning +*[Supervised Learning]: Supervised Learning is a subcategory of ML. This approach uses algorithms to learn given data. Those Algorithms might be able to find hidden meaning in data - without being told where to look. -## Tagged Table +*[Tagged Table]: In addition to a regular table, a Tagged Table will mark one column as tagged, meaning that an applied algorithm will train to predict its entries. -## Regression +*[Regression]: Regression refers to the estimation of continuous dependent variables. +*[Precision]: +The ability of a classification model to identify only the relevant data points. +*[Recall]: +The ability of a classification model to identify all the relevant data points. +*[Accuracy]: +The fraction of predictions a classification model has correctly identified. + +*[F1-Score]: +Combines 'precision' and 'recall' using the harmonic mean. diff --git a/mkdocs.yml b/mkdocs.yml index 1c09f41d5..ca8c29a6a 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -86,7 +86,10 @@ markdown_extensions: - pymdownx.highlight: anchor_linenums: true - pymdownx.inlinehilite - - pymdownx.snippets + - pymdownx.snippets: + auto_append: + - includes/glossary.md + # Diagrams - pymdownx.superfences: From 77294a9055fadeca6c535838ebae1c1caa91c83c Mon Sep 17 00:00:00 2001 From: patrikguempel Date: Tue, 9 May 2023 02:25:23 +0200 Subject: [PATCH 3/9] fixed link in mkdocs.yml and included more terms in the glossary.md --- includes/glossary.md | 66 +++++++++++++++++++++++++++++++++++++++++++- mkdocs.yml | 2 +- 2 files changed, 66 insertions(+), 2 deletions(-) diff --git a/includes/glossary.md b/includes/glossary.md index 90525c6aa..962ddbfab 100644 --- a/includes/glossary.md +++ b/includes/glossary.md @@ -30,15 +30,79 @@ an applied algorithm will train to predict its entries. Regression refers to the estimation of continuous dependent variables. *[Precision]: -The ability of a classification model to identify only the relevant data points. +The ability of a classification model to identify only the relevant data points.
+Formula: $\frac{\text{True Positives}}{\text{True Positives + False Positives}}$ *[Recall]: The ability of a classification model to identify all the relevant data points. +Formula: $\frac{\text{True Positives}}{\text{True Positives + False Negatives}}$ *[Accuracy]: The fraction of predictions a classification model has correctly identified. +Formula: $\frac{\text{True Positives + True Negatives}}{\text{Total amount of data points}}$ *[F1-Score]: Combines 'precision' and 'recall' using the harmonic mean. +*[Metric]: +A data metric is an aggregated calculation within a raw dataset. + +*[Underfitting]: +Underfitting is a scenario in which a data model is unable to capture the relationship between the input and output variables accurately, +due to generalizing too much. + +*[Overfitting]: +Overfitting is a scenario in which a data model is unable to capture the relationship between the input and output variables accurately, +due to not generalizing enough. + +*[Classification]: +Classification refers to dividing a data set into multiple chunks, which are then considered "classes". + +*[Regularization]: +Regularization refers to techniques that are used to calibrate machine learning models +in order to minimize the adjusted loss function and prevent overfitting or underfitting. + +*[Linear Regression]: +Linear Regression is the supervised Machine Learning model in which the model finds the best fit linear line between the independent and dependent variable +i.e it finds the linear relationship between the dependent and independent variable. + +*[Feature]: +Each feature represents a measurable piece of data that can be used for analysis. +It is analogous to a column within a table. + +*[Target]: +The target variable of a dataset is the feature of a dataset about which you want to gain a deeper understanding. + +*[Sample]: +A sample is a subset of the whole data set. +It is analyzed to uncover the meaningful information in the larger data set. + +*[Training Set]: +A set of examples used for learning, that is to fit the parameters of the classifier. + +*[Validation Set]: +A set of examples used to the parameters of a classifier. + +*[Test Set]: +A set of examples used only to assess the performance of a fully-specified classifier. + +*[Positive Class]: +The "Positive Class" consists of all attributes to be considered positive. Consequently, every attribute to not be in this class is considered to be negative class. + +*[True Positive (TP)]: +An outcome is considered to be a true positive, if the data model has correctly predicted a value of positive class. + +*[True Negative (TN)]: +An outcome is considered to be a true negative, if the data model has correctly predicted a value of negative class. + +*[False Positive (FP)]: +An outcome is considered to be a false positive, if the data model has mistakenly predicted a value of positive class. + +*[False Negative (FN)]: +An outcome is considered to be a false negative, if the data model has mistakenly predicted a value of negative class. + +*[Confusion Matrix] +A confusion matrix is a table that is used to define the performance of a classification algorithm. +It classifies the predictions to be either be true positive, true negative, false positive or false negative. + diff --git a/mkdocs.yml b/mkdocs.yml index ca8c29a6a..08454deef 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -11,7 +11,7 @@ nav: - Data Visualization: tutorials/data_visualization.ipynb - Machine Learning: tutorials/machine_learning.ipynb - API Reference: reference/ - - Glossary: docs/glossary.md + - Glossary: includes/glossary.md - Development: - Environment: development/environment.md - Guidelines: development/guidelines.md From bb9aa108ce59aa0024ba418ee23f8d117b6a78a5 Mon Sep 17 00:00:00 2001 From: Lars Reimann Date: Tue, 9 May 2023 21:31:08 +0200 Subject: [PATCH 4/9] docs: link to glossary instead of including it --- {includes => docs}/glossary.md | 0 mkdocs.yml | 6 ++---- 2 files changed, 2 insertions(+), 4 deletions(-) rename {includes => docs}/glossary.md (100%) diff --git a/includes/glossary.md b/docs/glossary.md similarity index 100% rename from includes/glossary.md rename to docs/glossary.md diff --git a/mkdocs.yml b/mkdocs.yml index a09b6368e..b1c1dd224 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -11,7 +11,7 @@ nav: - Data Visualization: tutorials/data_visualization.ipynb - Machine Learning: tutorials/machine_learning.ipynb - API Reference: reference/ - - Glossary: includes/glossary.md + - Glossary: glossary.md - Development: - Environment: development/environment.md - Guidelines: development/guidelines.md @@ -87,9 +87,7 @@ markdown_extensions: - pymdownx.highlight: anchor_linenums: true - pymdownx.inlinehilite - - pymdownx.snippets: - auto_append: - - includes/glossary.md + - pymdownx.snippets # Diagrams From 81c076507db4455d614ea006d669eba1abaabc78 Mon Sep 17 00:00:00 2001 From: Lars Reimann Date: Tue, 9 May 2023 21:41:57 +0200 Subject: [PATCH 5/9] docs: LaTeX formatting --- docs/glossary.md | 93 +++++++++++++++++++++++++++--------------------- 1 file changed, 52 insertions(+), 41 deletions(-) diff --git a/docs/glossary.md b/docs/glossary.md index 962ddbfab..a7ce8b1ae 100644 --- a/docs/glossary.md +++ b/docs/glossary.md @@ -1,108 +1,119 @@ -# Glossary for important terms in Safe-DS +# Glossary -*[API]: +## API **A**pplication **P**rogramming **I**nterface
An API allows independent applications to communicate with each other and exchange data. -*[Decision Tree]: +## Decision Tree A Decision Tree represents the process of conditional evaluation in a tree diagram. -*[One Hot Encoder]: +## One Hot Encoder If a column's entries consist of a non-numerical data type, using a One Hot Encoder will create a new column for each different entry, filling it with a '1' in the respective places, '0' otherwise. -*[Machine Learning (ML)]: +## Machine Learning (ML) Machine Learning is a generic term for artificially generating knowledge through experience. To achieve this, one can choose between a variety of model options. -*[Random Forest]: +## Random Forest Random Forest is an ML model that works by generating decision trees at random. -*[Supervised Learning]: +## Supervised Learning Supervised Learning is a subcategory of ML. This approach uses algorithms to learn given data. Those Algorithms might be able to find hidden meaning in data - without being told where to look. -*[Tagged Table]: +## Tagged Table In addition to a regular table, a Tagged Table will mark one column as tagged, meaning that an applied algorithm will train to predict its entries. -*[Regression]: +## Regression Regression refers to the estimation of continuous dependent variables. -*[Precision]: -The ability of a classification model to identify only the relevant data points.
-Formula: $\frac{\text{True Positives}}{\text{True Positives + False Positives}}$ +## Precision +The ability of a classification model to identify only the relevant data points. Formula: -*[Recall]: -The ability of a classification model to identify all the relevant data points. -Formula: $\frac{\text{True Positives}}{\text{True Positives + False Negatives}}$ +$$ +\text{precision} = \frac{\text{True Positives}}{\text{True Positives + False Positives}} +$$ -*[Accuracy]: -The fraction of predictions a classification model has correctly identified. -Formula: $\frac{\text{True Positives + True Negatives}}{\text{Total amount of data points}}$ +## Recall +The ability of a classification model to identify all the relevant data points. Formula: -*[F1-Score]: -Combines 'precision' and 'recall' using the harmonic mean. +$$ +\text{recall} = \frac{\text{True Positives}}{\text{True Positives + False Negatives}} +$$ -*[Metric]: +## Accuracy +The fraction of predictions a classification model has correctly identified. Formula: + +$$ +\text{accuracy} = \frac{\text{True Positives + True Negatives}}{\text{Total amount of data points}} +$$ + +## F1-Score +The harmonic mean of [precision](#precision) and [recall](#recall). Formula: + +$$ +f_1 = 2 \cdot \frac{\text{precision} \cdot \text{recall}}{\text{precision} + \text{recall}} +$$ + +## Metric A data metric is an aggregated calculation within a raw dataset. -*[Underfitting]: +## Underfitting Underfitting is a scenario in which a data model is unable to capture the relationship between the input and output variables accurately, due to generalizing too much. -*[Overfitting]: +## Overfitting Overfitting is a scenario in which a data model is unable to capture the relationship between the input and output variables accurately, due to not generalizing enough. -*[Classification]: +## Classification Classification refers to dividing a data set into multiple chunks, which are then considered "classes". -*[Regularization]: +## Regularization Regularization refers to techniques that are used to calibrate machine learning models in order to minimize the adjusted loss function and prevent overfitting or underfitting. -*[Linear Regression]: +## Linear Regression Linear Regression is the supervised Machine Learning model in which the model finds the best fit linear line between the independent and dependent variable -i.e it finds the linear relationship between the dependent and independent variable. +i.e. it finds the linear relationship between the dependent and independent variable. -*[Feature]: +## Feature Each feature represents a measurable piece of data that can be used for analysis. It is analogous to a column within a table. -*[Target]: +## Target The target variable of a dataset is the feature of a dataset about which you want to gain a deeper understanding. -*[Sample]: +## Sample A sample is a subset of the whole data set. It is analyzed to uncover the meaningful information in the larger data set. -*[Training Set]: +## Training Set A set of examples used for learning, that is to fit the parameters of the classifier. -*[Validation Set]: +## Validation Set A set of examples used to the parameters of a classifier. -*[Test Set]: +## Test Set A set of examples used only to assess the performance of a fully-specified classifier. -*[Positive Class]: +## Positive Class The "Positive Class" consists of all attributes to be considered positive. Consequently, every attribute to not be in this class is considered to be negative class. -*[True Positive (TP)]: +## True Positive (TP) An outcome is considered to be a true positive, if the data model has correctly predicted a value of positive class. -*[True Negative (TN)]: +## True Negative (TN) An outcome is considered to be a true negative, if the data model has correctly predicted a value of negative class. -*[False Positive (FP)]: +## False Positive (FP) An outcome is considered to be a false positive, if the data model has mistakenly predicted a value of positive class. -*[False Negative (FN)]: +## False Negative (FN) An outcome is considered to be a false negative, if the data model has mistakenly predicted a value of negative class. -*[Confusion Matrix] +## Confusion Matrix A confusion matrix is a table that is used to define the performance of a classification algorithm. It classifies the predictions to be either be true positive, true negative, false positive or false negative. - - From 730a00c8207dcdf3c1d8937c45983f85286d2a14 Mon Sep 17 00:00:00 2001 From: Lars Reimann Date: Tue, 9 May 2023 21:46:16 +0200 Subject: [PATCH 6/9] docs: sort alphabetically --- docs/glossary.md | 134 +++++++++++++++++++++++------------------------ 1 file changed, 67 insertions(+), 67 deletions(-) diff --git a/docs/glossary.md b/docs/glossary.md index a7ce8b1ae..49996d5a0 100644 --- a/docs/glossary.md +++ b/docs/glossary.md @@ -1,33 +1,64 @@ # Glossary +## Accuracy +The fraction of predictions a classification model has correctly identified. Formula: + +$$ +\text{accuracy} = \frac{\text{True Positives + True Negatives}}{\text{Total amount of data points}} +$$ + ## API **A**pplication **P**rogramming **I**nterface
An API allows independent applications to communicate with each other and exchange data. +## Classification +Classification refers to dividing a data set into multiple chunks, which are then considered "classes". + +## Confusion Matrix +A confusion matrix is a table that is used to define the performance of a classification algorithm. +It classifies the predictions to be either be true positive, true negative, false positive or false negative. + ## Decision Tree A Decision Tree represents the process of conditional evaluation in a tree diagram. -## One Hot Encoder -If a column's entries consist of a non-numerical data type, using a One Hot Encoder will create -a new column for each different entry, filling it with a '1' in the respective places, '0' otherwise. +## F1-Score +The harmonic mean of [precision](#precision) and [recall](#recall). Formula: + +$$ +f_1 = 2 \cdot \frac{\text{precision} \cdot \text{recall}}{\text{precision} + \text{recall}} +$$ + +## False Negative (FN) +An outcome is considered to be a false negative, if the data model has mistakenly predicted a value of negative class. + +## False Positive (FP) +An outcome is considered to be a false positive, if the data model has mistakenly predicted a value of positive class. + +## Feature +Each feature represents a measurable piece of data that can be used for analysis. +It is analogous to a column within a table. + +## Linear Regression +Linear Regression is the supervised Machine Learning model in which the model finds the best fit linear line between the independent and dependent variable +i.e. it finds the linear relationship between the dependent and independent variable. ## Machine Learning (ML) Machine Learning is a generic term for artificially generating knowledge through experience. To achieve this, one can choose between a variety of model options. -## Random Forest -Random Forest is an ML model that works by generating decision trees at random. +## Metric +A data metric is an aggregated calculation within a raw dataset. -## Supervised Learning -Supervised Learning is a subcategory of ML. This approach uses algorithms to learn given data. -Those Algorithms might be able to find hidden meaning in data - without being told where to look. +## One Hot Encoder +If a column's entries consist of a non-numerical data type, using a One Hot Encoder will create +a new column for each different entry, filling it with a '1' in the respective places, '0' otherwise. -## Tagged Table -In addition to a regular table, a Tagged Table will mark one column as tagged, meaning that -an applied algorithm will train to predict its entries. +## Overfitting +Overfitting is a scenario in which a data model is unable to capture the relationship between the input and output variables accurately, +due to not generalizing enough. -## Regression -Regression refers to the estimation of continuous dependent variables. +## Positive Class +The "Positive Class" consists of all attributes to be considered positive. Consequently, every attribute to not be in this class is considered to be negative class. ## Precision The ability of a classification model to identify only the relevant data points. Formula: @@ -36,6 +67,9 @@ $$ \text{precision} = \frac{\text{True Positives}}{\text{True Positives + False Positives}} $$ +## Random Forest +Random Forest is an ML model that works by generating decision trees at random. + ## Recall The ability of a classification model to identify all the relevant data points. Formula: @@ -43,77 +77,43 @@ $$ \text{recall} = \frac{\text{True Positives}}{\text{True Positives + False Negatives}} $$ -## Accuracy -The fraction of predictions a classification model has correctly identified. Formula: - -$$ -\text{accuracy} = \frac{\text{True Positives + True Negatives}}{\text{Total amount of data points}} -$$ - -## F1-Score -The harmonic mean of [precision](#precision) and [recall](#recall). Formula: - -$$ -f_1 = 2 \cdot \frac{\text{precision} \cdot \text{recall}}{\text{precision} + \text{recall}} -$$ - -## Metric -A data metric is an aggregated calculation within a raw dataset. - -## Underfitting -Underfitting is a scenario in which a data model is unable to capture the relationship between the input and output variables accurately, -due to generalizing too much. - -## Overfitting -Overfitting is a scenario in which a data model is unable to capture the relationship between the input and output variables accurately, -due to not generalizing enough. - -## Classification -Classification refers to dividing a data set into multiple chunks, which are then considered "classes". +## Regression +Regression refers to the estimation of continuous dependent variables. ## Regularization Regularization refers to techniques that are used to calibrate machine learning models in order to minimize the adjusted loss function and prevent overfitting or underfitting. -## Linear Regression -Linear Regression is the supervised Machine Learning model in which the model finds the best fit linear line between the independent and dependent variable -i.e. it finds the linear relationship between the dependent and independent variable. - -## Feature -Each feature represents a measurable piece of data that can be used for analysis. -It is analogous to a column within a table. - -## Target -The target variable of a dataset is the feature of a dataset about which you want to gain a deeper understanding. - ## Sample A sample is a subset of the whole data set. It is analyzed to uncover the meaningful information in the larger data set. -## Training Set -A set of examples used for learning, that is to fit the parameters of the classifier. +## Supervised Learning +Supervised Learning is a subcategory of ML. This approach uses algorithms to learn given data. +Those Algorithms might be able to find hidden meaning in data - without being told where to look. -## Validation Set -A set of examples used to the parameters of a classifier. +## Tagged Table +In addition to a regular table, a Tagged Table will mark one column as tagged, meaning that +an applied algorithm will train to predict its entries. + +## Target +The target variable of a dataset is the feature of a dataset about which you want to gain a deeper understanding. ## Test Set A set of examples used only to assess the performance of a fully-specified classifier. -## Positive Class -The "Positive Class" consists of all attributes to be considered positive. Consequently, every attribute to not be in this class is considered to be negative class. - -## True Positive (TP) -An outcome is considered to be a true positive, if the data model has correctly predicted a value of positive class. +## Training Set +A set of examples used for learning, that is to fit the parameters of the classifier. ## True Negative (TN) An outcome is considered to be a true negative, if the data model has correctly predicted a value of negative class. -## False Positive (FP) -An outcome is considered to be a false positive, if the data model has mistakenly predicted a value of positive class. +## True Positive (TP) +An outcome is considered to be a true positive, if the data model has correctly predicted a value of positive class. -## False Negative (FN) -An outcome is considered to be a false negative, if the data model has mistakenly predicted a value of negative class. +## Underfitting +Underfitting is a scenario in which a data model is unable to capture the relationship between the input and output variables accurately, +due to generalizing too much. -## Confusion Matrix -A confusion matrix is a table that is used to define the performance of a classification algorithm. -It classifies the predictions to be either be true positive, true negative, false positive or false negative. +## Validation Set +A set of examples used to the parameters of a classifier. From 70dd262dec8682914e0779a4d68399a70f0677a7 Mon Sep 17 00:00:00 2001 From: Lars Reimann Date: Tue, 9 May 2023 21:49:56 +0200 Subject: [PATCH 7/9] docs: link to linear regression model --- docs/glossary.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/docs/glossary.md b/docs/glossary.md index 49996d5a0..2516a75a7 100644 --- a/docs/glossary.md +++ b/docs/glossary.md @@ -7,8 +7,7 @@ $$ \text{accuracy} = \frac{\text{True Positives + True Negatives}}{\text{Total amount of data points}} $$ -## API -**A**pplication **P**rogramming **I**nterface
+## Application Programming Interface (API) An API allows independent applications to communicate with each other and exchange data. ## Classification @@ -42,6 +41,8 @@ It is analogous to a column within a table. Linear Regression is the supervised Machine Learning model in which the model finds the best fit linear line between the independent and dependent variable i.e. it finds the linear relationship between the dependent and independent variable. +Implemented in Safe-DS as [LinearRegression][safeds.ml.classical.regression.LinearRegression]. + ## Machine Learning (ML) Machine Learning is a generic term for artificially generating knowledge through experience. To achieve this, one can choose between a variety of model options. From 9e7cc08df27ece6837a3367a53f13abb446bf2ab Mon Sep 17 00:00:00 2001 From: patrikguempel Date: Fri, 2 Jun 2023 21:04:18 +0200 Subject: [PATCH 8/9] added cross-references as well as code-references --- docs/glossary.md | 33 ++++++++++++++++++++------------- 1 file changed, 20 insertions(+), 13 deletions(-) diff --git a/docs/glossary.md b/docs/glossary.md index 2516a75a7..72b2c93d4 100644 --- a/docs/glossary.md +++ b/docs/glossary.md @@ -1,10 +1,10 @@ # Glossary ## Accuracy -The fraction of predictions a classification model has correctly identified. Formula: +The fraction of predictions a [classification](#classification) model has correctly identified. Formula: $$ -\text{accuracy} = \frac{\text{True Positives + True Negatives}}{\text{Total amount of data points}} +\text{accuracy} = \frac{\text{[True Positives](#true-positive-tp) + [True Negatives](#true-negative-tn)}}{\text{Total amount of data points}} $$ ## Application Programming Interface (API) @@ -14,12 +14,15 @@ An API allows independent applications to communicate with each other and exchan Classification refers to dividing a data set into multiple chunks, which are then considered "classes". ## Confusion Matrix -A confusion matrix is a table that is used to define the performance of a classification algorithm. -It classifies the predictions to be either be true positive, true negative, false positive or false negative. +A confusion matrix is a table that is used to define the performance of a [classification](#classification) algorithm. +It classifies the predictions to be either be [true positive](#true-positive-tp), [true negative](#true-negative-tn), +[false positive](#false-positive-fp) or [false negative](#false-negative-fn). ## Decision Tree A Decision Tree represents the process of conditional evaluation in a tree diagram. +Implemented in Safe-DS as [Decision Tree][safeds.ml.classical.classification.DecisionTree]. + ## F1-Score The harmonic mean of [precision](#precision) and [recall](#recall). Formula: @@ -54,6 +57,8 @@ A data metric is an aggregated calculation within a raw dataset. If a column's entries consist of a non-numerical data type, using a One Hot Encoder will create a new column for each different entry, filling it with a '1' in the respective places, '0' otherwise. +Implemented in Safe-DS as [OneHotEncoder][safeds.data.tabular.transformation.OneHotEncoder]. + ## Overfitting Overfitting is a scenario in which a data model is unable to capture the relationship between the input and output variables accurately, due to not generalizing enough. @@ -62,20 +67,22 @@ due to not generalizing enough. The "Positive Class" consists of all attributes to be considered positive. Consequently, every attribute to not be in this class is considered to be negative class. ## Precision -The ability of a classification model to identify only the relevant data points. Formula: +The ability of a [classification](#classification) model to identify only the relevant data points. Formula: $$ -\text{precision} = \frac{\text{True Positives}}{\text{True Positives + False Positives}} +\text{precision} = \frac{\text{[True Positives](#true-positive-tp)}}{\text{[True Positives](#true-positive-tp) + [False Positives](#false-positive-fp)}} $$ ## Random Forest Random Forest is an ML model that works by generating decision trees at random. +Implemented in Safe-DS as [RandomForest][safeds.ml.classical.regression.RandomForest]. + ## Recall -The ability of a classification model to identify all the relevant data points. Formula: +The ability of a [classification](#classification) model to identify all the relevant data points. Formula: $$ -\text{recall} = \frac{\text{True Positives}}{\text{True Positives + False Negatives}} +\text{recall} = \frac{\text{[True Positives](#true-positive-tp)}}{\text{[True Positives](#true-positive-tp) + [False Negatives](#false-negative-fn)}} $$ ## Regression @@ -83,7 +90,7 @@ Regression refers to the estimation of continuous dependent variables. ## Regularization Regularization refers to techniques that are used to calibrate machine learning models -in order to minimize the adjusted loss function and prevent overfitting or underfitting. +in order to minimize the adjusted loss function and prevent [overfitting](#overfitting) or [underfitting](#underfitting). ## Sample A sample is a subset of the whole data set. @@ -95,16 +102,16 @@ Those Algorithms might be able to find hidden meaning in data - without being to ## Tagged Table In addition to a regular table, a Tagged Table will mark one column as tagged, meaning that -an applied algorithm will train to predict its entries. +an applied algorithm will train to predict its entries. The marked column is referred to as ["target"](#target). ## Target The target variable of a dataset is the feature of a dataset about which you want to gain a deeper understanding. ## Test Set -A set of examples used only to assess the performance of a fully-specified classifier. +A set of examples used only to assess the performance of a fully-specified [classifier](#classification). ## Training Set -A set of examples used for learning, that is to fit the parameters of the classifier. +A set of examples used for learning, that is to fit the parameters of the [classifier](#classification). ## True Negative (TN) An outcome is considered to be a true negative, if the data model has correctly predicted a value of negative class. @@ -117,4 +124,4 @@ Underfitting is a scenario in which a data model is unable to capture the relati due to generalizing too much. ## Validation Set -A set of examples used to the parameters of a classifier. +A set of examples used to the parameters of a [classifier](#classification). From 11ba806908f75734cb7b42e16336b75221229c61 Mon Sep 17 00:00:00 2001 From: patrikguempel Date: Fri, 23 Jun 2023 16:39:06 +0200 Subject: [PATCH 9/9] pulled references out of latex code --- docs/glossary.md | 18 +++++++++++++++--- 1 file changed, 15 insertions(+), 3 deletions(-) diff --git a/docs/glossary.md b/docs/glossary.md index 72b2c93d4..f0afeb80e 100644 --- a/docs/glossary.md +++ b/docs/glossary.md @@ -4,9 +4,13 @@ The fraction of predictions a [classification](#classification) model has correctly identified. Formula: $$ -\text{accuracy} = \frac{\text{[True Positives](#true-positive-tp) + [True Negatives](#true-negative-tn)}}{\text{Total amount of data points}} +\text{accuracy} = \frac{\text{True Positives + True Negatives}}{\text{Total amount of data points}} $$ +See here for respective definitions: +[True Positives](#true-positive-tp) +[True Negatives](#true-negative-tn) + ## Application Programming Interface (API) An API allows independent applications to communicate with each other and exchange data. @@ -70,9 +74,13 @@ The "Positive Class" consists of all attributes to be considered positive. Conse The ability of a [classification](#classification) model to identify only the relevant data points. Formula: $$ -\text{precision} = \frac{\text{[True Positives](#true-positive-tp)}}{\text{[True Positives](#true-positive-tp) + [False Positives](#false-positive-fp)}} +\text{precision} = \frac{\text{True Positives}}{\text{True Positives + False Positives}} $$ +See here for respective references: +[True Positives](#true-positive-tp) +[False Positives](#false-positive-fp) + ## Random Forest Random Forest is an ML model that works by generating decision trees at random. @@ -82,9 +90,13 @@ Implemented in Safe-DS as [RandomForest][safeds.ml.classical.regression.RandomFo The ability of a [classification](#classification) model to identify all the relevant data points. Formula: $$ -\text{recall} = \frac{\text{[True Positives](#true-positive-tp)}}{\text{[True Positives](#true-positive-tp) + [False Negatives](#false-negative-fn)}} +\text{recall} = \frac{\text{True Positives}}{\text{True Positives + False Negatives}} $$ +See here for respective references: +[True Positives](#true-positive-tp) +[False Negatives](#false-negative-fn) + ## Regression Regression refers to the estimation of continuous dependent variables.