Skip to content

Commit

Permalink
Add train scores for ludwig in the create function handler. (#1342)
Browse files Browse the repository at this point in the history
In the previous commit, we added the changes for displaying the train
scores and train times for XGBoost. We plan to add similar changes to
Ludwig integration as well.

---------

Co-authored-by: Jineet Desai <[email protected]>
Co-authored-by: Andy Xu <[email protected]>
  • Loading branch information
3 people authored Nov 10, 2023
1 parent 7a49bec commit 1c78b22
Show file tree
Hide file tree
Showing 4 changed files with 23 additions and 7 deletions.
4 changes: 2 additions & 2 deletions docs/source/overview/concepts.rst
Original file line number Diff line number Diff line change
Expand Up @@ -92,6 +92,6 @@ After registering ``MnistImageClassifier`` function, you can call the function i
AI-Centric Query Optimization
-----------------------------

EvaDB optimizes the AI queries to save money spent on running models and reduce query execution time. It contains a novel `Cascades-style query optimizer <https://www.cse.iitb.ac.in/infolab/Data/Courses/CS632/Papers/Cascades-graefe.pdf>`__ tailored for AI queries.
EvaDB optimizes the AI queries to save money spent on running models and reduce query execution time. It contains a novel `Cascades-style query optimizer <https://faculty.cc.gatech.edu/~jarulraj/courses/8803-s21/slides/22-cascades.pdf>`__ tailored for AI queries.

Query optimization has powered SQL database systems for several decades. It is the bridge that connects the declarative query language to efficient query execution on hardware. EvaDB accelerates AI queries using a collection of optimizations detailed in the :ref:`optimizations<optimizations>` page.
Query optimization has powered SQL database systems for several decades. It is the bridge that connects the declarative query language to efficient query execution on hardware. EvaDB accelerates AI queries using a collection of optimizations detailed in the :ref:`optimizations<optimizations>` page.
4 changes: 2 additions & 2 deletions docs/source/reference/ai/model-forecasting.rst
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@ EvaDB's default forecast framework is `statsforecast <https://nixtla.github.io/s
* - LIBRARY (str, default: 'statsforecast')
- We can select one of `statsforecast` (default) or `neuralforecast`. `statsforecast` provides access to statistical forecasting methods, while `neuralforecast` gives access to deep-learning based forecasting methods.
* - MODEL (str, default: 'ARIMA')
- If LIBRARY is `statsforecast`, we can select one of ARIMA, ting, ETS, Theta. The default is ARIMA. Check `Automatic Forecasting <https://nixtla.github.io/statsforecast/src/core/models_intro.html#automatic-forecasting>`_ to learn details about these models. If LIBRARY is `neuralforecast`, we can select one of NHITS or NBEATS. The default is NBEATS. Check `NBEATS docs <https://nixtla.github.io/neuralforecast/models.nbeats.html>`_ for details.
- If LIBRARY is `statsforecast`, we can select one of ARIMA, ting, ETS, Theta. The default is ARIMA. Check `Automatic Forecasting <https://nixtla.mintlify.app/statsforecast/index.html#automatic-forecasting>`_ to learn details about these models. If LIBRARY is `neuralforecast`, we can select one of NHITS or NBEATS. The default is NBEATS. Check `NBEATS docs <https://nixtla.github.io/neuralforecast/models.nbeats.html>`_ for details.
* - AUTO (str, default: 'T')
- If set to 'T', it enables automatic hyperparameter optimization. Must be set to 'T' for `statsforecast` library. One may set this parameter to `false` if LIBRARY is `neuralforecast` for faster (but less reliable) results.
* - Frequency (str, default: 'auto')
Expand Down Expand Up @@ -90,4 +90,4 @@ Below is an example query with `neuralforecast` with `trend` column as exogenous
PREDICT 'y'
LIBRARY 'neuralforecast'
AUTO 'f'
FREQUENCY 'M';
FREQUENCY 'M';
2 changes: 1 addition & 1 deletion docs/source/reference/databases/github.rst
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ Required:

Optional:

* ``github_token`` is not required for public repositories. However, the rate limit is lower without a valid github_token. Check the `Rate limits page <https://docs.github.com/en/rest/overview/resources-in-the-rest-api>`_ to learn more about how to check your rate limit status. Check `Managing your personal access tokens page <https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/managing-your-personal-access-tokens>`_ to learn how to create personal access tokens.
* ``github_token`` is not required for public repositories. However, the rate limit is lower without a valid github_token. Check the `Rate limits page <https://docs.github.com/en/rest/overview/rate-limits-for-the-rest-api?apiVersion=2022-11-28>`_ to learn more about how to check your rate limit status. Check `Managing your personal access tokens page <https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/managing-your-personal-access-tokens>`_ to learn how to create personal access tokens.

Create Connection
-----------------
Expand Down
20 changes: 18 additions & 2 deletions evadb/executor/create_function_executor.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@
import os
import pickle
import re
import time
from pathlib import Path
from typing import Dict, List

Expand Down Expand Up @@ -125,6 +126,7 @@ def handle_ludwig_function(self):
aggregated_batch.drop_column_alias()

arg_map = {arg.key: arg.value for arg in self.node.metadata}
start_time = int(time.time())
auto_train_results = auto_train(
dataset=aggregated_batch.frames,
target=arg_map["predict"],
Expand All @@ -134,11 +136,13 @@ def handle_ludwig_function(self):
"tmp_dir"
),
)
train_time = int(time.time()) - start_time
model_path = os.path.join(
self.db.catalog().get_configuration_catalog_value("model_dir"),
self.node.name,
)
auto_train_results.best_model.save(model_path)
best_score = auto_train_results.experiment_analysis.best_result["metric_score"]
self.node.metadata.append(
FunctionMetadataCatalogEntry("model_path", model_path)
)
Expand All @@ -151,6 +155,8 @@ def handle_ludwig_function(self):
self.node.function_type,
io_list,
self.node.metadata,
best_score,
train_time,
)

def handle_sklearn_function(self):
Expand Down Expand Up @@ -178,7 +184,10 @@ def handle_sklearn_function(self):
model = LinearRegression()
Y = aggregated_batch.frames[arg_map["predict"]]
aggregated_batch.frames.drop([arg_map["predict"]], axis=1, inplace=True)
start_time = int(time.time())
model.fit(X=aggregated_batch.frames, y=Y)
train_time = int(time.time()) - start_time
score = model.score(X=aggregated_batch.frames, y=Y)
model_path = os.path.join(
self.db.catalog().get_configuration_catalog_value("model_dir"),
self.node.name,
Expand All @@ -200,6 +209,8 @@ def handle_sklearn_function(self):
self.node.function_type,
io_list,
self.node.metadata,
score,
train_time,
)

def convert_to_numeric(self, x):
Expand Down Expand Up @@ -241,9 +252,11 @@ def handle_xgboost_function(self):
"estimator_list": ["xgboost"],
"task": arg_map.get("task", DEFAULT_XGBOOST_TASK),
}
start_time = int(time.time())
model.fit(
dataframe=aggregated_batch.frames, label=arg_map["predict"], **settings
)
train_time = int(time.time()) - start_time
model_path = os.path.join(
self.db.catalog().get_configuration_catalog_value("model_dir"),
self.node.name,
Expand All @@ -260,7 +273,6 @@ def handle_xgboost_function(self):
impl_path = Path(f"{self.function_dir}/xgboost.py").absolute().as_posix()
io_list = self._resolve_function_io(None)
best_score = model.best_loss
train_time = model.best_config_train_time
return (
self.node.name,
impl_path,
Expand Down Expand Up @@ -638,6 +650,8 @@ def exec(self, *args, **kwargs):
function_type,
io_list,
metadata,
best_score,
train_time,
) = self.handle_ludwig_function()
elif string_comparison_case_insensitive(self.node.function_type, "Sklearn"):
(
Expand All @@ -646,6 +660,8 @@ def exec(self, *args, **kwargs):
function_type,
io_list,
metadata,
best_score,
train_time,
) = self.handle_sklearn_function()
elif string_comparison_case_insensitive(self.node.function_type, "XGBoost"):
(
Expand Down Expand Up @@ -688,7 +704,7 @@ def exec(self, *args, **kwargs):
[
msg,
"Validation Score: " + str(best_score),
"Training time: " + str(train_time),
"Training time: " + str(train_time) + " secs.",
]
)
)
Expand Down

0 comments on commit 1c78b22

Please sign in to comment.