datakind · dividor · Jun 1, 2024 · May 14, 2024 · May 16, 2024 · May 16, 2024
diff --git a/.gitignore b/.gitignore
@@ -18,5 +18,9 @@ ui/recipes-chat/recipesdb
 instructions.txt
 dataset_details.json
 .DS_Store
-recipes-management/checked_out
+recipes-management/checked_out/*
+!recipes-management/checked_out/.gitkeep
+!recipes-management/checked_out/skills.py
+recipes-management/files/
+database.sqlite
 
diff --git a/README.md b/README.md
@@ -18,7 +18,7 @@ Given the rapidly changing landscape of LLMs, we have tried as much as possible
 
 Data recipes supports datasources accessed via API, but in some cases it is preferable to ingest data in order to leverage LLM SQL capabilities. We include an initial set of data sources specific to Humanitarian Response in the ingestion module, which can be extended to include additional sources as required.
 
-Finally, for reviewing/updating/creating new recipes, though we provide some experimental assistants that can generate and run recipes to semi-automate the process, in talking with developers and datascientists, most would prefer to use their existing environment for development, such as VS Code + GitHub Copilot. For this reason we are not developing a dedicated user interface for this, and intead provide a sync process that will allow recipe managers to check out and work on recipes locally, then publish them back into the recipes database for wider consumption. We do include however a autogen studio setup to be able to use agent teams to create recipes.
+Finally, for reviewing/updating/creating new recipes, though we provide some experimental assistants that can generate and run recipes to semi-automate the process, in talking with developers and datascientists, most would prefer to use their existing environment for development, such as VS Code + GitHub Copilot. For this reason we are not developing a dedicated user interface for this, and instead provide a [sync process that will allow recipe managers to check out and work on recipes locally](#managing-recipes), then publish them back into the recipes database for wider consumption. We do include however a autogen studio setup to be able to use agent teams to create recipes.
 
 Some more discussion on design decisions can also be found [here](https://www.loom.com/share/500e960fd91c44c282076be4b0126461?sid=83af2d6c-622c-4bda-b21b-8f528d6eafba).
 
@@ -159,6 +159,77 @@ To create an additional plugin, perform the following steps:
 
 As the robocorp actions might differ slightly, this can lead to differing requirements in the openapi spec, and manifest files. The [LibraChat documentation](https://docs.librechat.ai/features/plugins/chatgpt_plugins_openapi.html) provides tips and examples to form the files correctly. 
 
+## Managing recipes
+
+The management of recipes is part of the human in the loop approach of this repo. New recipes are created in status pending and only get marked as approved, once they have been verified by a recipe manager. Recipe managers can 'check out' recipes from the database into their local development environment such as VS Code to run, debug, and edit the recipes, before checking them back in. To make this process platform independent, recipes are checked out into a docker container, which can be used as the runtime environment to run the recipes via VSCode. 
+
+1. To check out recipes:
+
+`cd recipes-management`
+`docker exec haa-recipe-manager python recipe_sync.py --check_out <YOUR NAME>`
+
+Note: This will lock the recipes in the database so others cannot edit them.
+
+2. The checked_out folder in the recipes-management directory now shows all the recipes that were checked out from the database including the recipe code as a .py file. Note that after this step, the recipes in the database are marked as locked with your name and the timestamp you checked them out. If someone else tries to check them out, they are notified accordingly and cannot proceed until you've unlocked the records (more on that below).
+
+This step checks out three files:
+
+   -  Recipe.py - Contains the recipe code (re-assembled from the corresponding sections in the metadata json file)
+   -  metadata.json - Contains the content of the cmetadata column in the recipe database.
+   -  record_info.json - Contains additional information about the record such as its custom_id, and output.
+
+   You can edit all files according to your needs. Please makes sure to not change the custom_id anywehere because it's needed for the check in process.
+
+3. Run the scripts and edit them as you deem fit. Please note: Do not delete the #Functions Code and #Calling Code comments as they're mandatory to reassemble the metadata json for the check in process.
+4. Once you've checked and edited the recipes, run 
+
+   `docker exec haa-recipe-manager python recipe_sync.py --check_in <YOUR NAME>`
+
+   To check the records back into the database and unlock them. All recipes that you've checked in in this fashion are automatically set to status 'approved' with your name as the approver and the timestamp of when you checked them back in.
+
+## Testing a Recipe
+
+You can run a specific recipe like this ...
+
+`docker exec haa-recipe-manager python checked_out/retrieve_the_total_population_of_a_specified_country/recipe.py`
+
+You can also exec into the container to do it ...
+
+1. `docker exec -it haa-recipe-manager /bin/bash`
+2. `cd ./checked_out`, then `cd <RECIPE_DIR>`
+3. `python recipe.py`
+
+You can also configure VS Code to connect to the recipe-manage container for running recipes ...
+
+1. Install the DevContainers VSCode extension 
+2. Build data recipes using the `docker compose` command mentioned above
+3. Open the command palette in VSCode (CMD + Shift + P on Mac; CTRL + Shift + P on Windows) and select 
+
+   `Dev Containers: Attach to remote container`. 
+
+   Select the recipe-manager container. This opens a new VSCode window - use it for the next steps.
+4. Open folder `/app`
+5. Navigate to your recipe in sub-folder `checked_out`
+6. Run the `recipe.py` in a terminal or set up the docker interpretor
+
+# Autogen Studio and autogen agent teams for creating data recipes
+
+![alt text](../assets/autogen-studio-recipes.png)
+
+Data recipes AI contains an autogenstudio instance for the Docker build, as well as sample skill, agent and workflows to use a team of autogen agents for creating data recipes.
+
+You can information on Autogen studio [here](https://github.com/microsoft/autogen/tree/main/samples/apps/autogen-studio). This folder includes a skill to query the data recipes data DB, an agent to use that, with some prompts to help it, and a workflow that uses the agent.
+
+To activate:
+
+1. Go to [http://localhost:8091/](http://localhost:8091/)
+2. Click on 'Build'
+3. Click 'Skills' on left, top right click '...' and import the skill in `./assets`
+4. Click 'Agents' on left, top right click '...' and import the skill in `./assets`
+5. Click 'Workflows' on left, top right click '...' and import the skill in `./assets`
+6. Go to playground and start a new session, select the 'Recipes data Analysis' workflow
+7. Ask 'What is the total population of Mali?'
+
 # Deployment
 
 We will add more details here soon, for now, here are some notes on Azure ...
@@ -185,4 +256,4 @@ You will need to set key environment variables, see your local `.env` for exampl
 
 ## Databases
 
-When running in Azure it is useful to use remote databases, at least for the mongodb instance so that user logins are retained with each release. For example, a databse can be configured by following [these instructions](https://docs.librechat.ai/install/configuration/mongodb.html). If doing this, then docker-compose-azure.yml in Azure can have the mongo DB section removed, and any instance of the Mongo URL used by other containers updated with the cloud connection string accordingly.
+When running in Azure it is useful to use remote databases, at least for the mongodb instance so that user logins are retained with each release. For example, a databse can be configured by following [these instructions](https://docs.librechat.ai/install/configuration/mongodb.html). If doing this, then docker-compose-azure.yml in Azure can have the mongo DB section removed, and any instance of the Mongo URL used by other containers updated with the cloud connection string accordingly.
diff --git a/actions/actions_plugins/recipe-server/db/1-schema.sql b/actions/actions_plugins/recipe-server/db/1-schema.sql
@@ -23,7 +23,11 @@ CREATE TABLE public.langchain_pg_embedding (
 	cmetadata json NULL,
 	custom_id varchar NULL,
 	uuid uuid NOT NULL,
-  "checksum" varchar NULL,
+  approval_status varchar NULL,
+  approver varchar NULL,
+  approval_latest_update varchar NULL,
+  locked_by varchar NULL,
+  locked_at varchar NULL,
 	CONSTRAINT langchain_pg_embedding_pkey PRIMARY KEY (uuid)
 );
 

diff --git a/actions/actions_plugins/recipe-server/db/2-demo-data.sql b/actions/actions_plugins/recipe-server/db/2-demo-data.sql
diff --git a/cleanup.sh b/cleanup.sh
@@ -4,4 +4,5 @@ rm -rf ./ui/recipes-chat/images/*
 rm -rf ./ui/recipes-chat/logs/*
 rm -rf ./ui/recipes-chat/data-node/*
 rm -rf ./ui/recipes-chat/meili_data_v1.7/*
-rm -rf ./ui/recipes-chat/datadb/*
+rm -rf ./ui/recipes-chat/datadb/*
+rm -rf ./ui/recipes-chat/recipesdb/*
diff --git a/docker-compose.yml b/docker-compose.yml
@@ -179,36 +179,23 @@ services:
       dockerfile: ./recipes-management/Dockerfile
     depends_on:
       - recipedb
+      - datadb
     restart: always
+    ports:
+      - 8091:8081
     environment:
       POSTGRES_DB: ${POSTGRES_RECIPE_DB}
       POSTGRES_USER: ${POSTGRES_RECIPE_USER}
       POSTGRES_PASSWORD: ${POSTGRES_RECIPE_PASSWORD}
+      POSTGRES_DATA_HOST: ${POSTGRES_DATA_HOST}
+      POSTGRES_DATA_DB: ${POSTGRES_DATA_DB}
+      POSTGRES_DATA_USER: ${POSTGRES_DATA_USER}
+      POSTGRES_DATA_PASSWORD: ${POSTGRES_DATA_PASSWORD}
+      AZURE_OPENAI_API_KEY: ${AZURE_API_KEY_ENV}
     env_file:
       - .env
     volumes:
       - ./recipes-management:/app
-  # recipe-creator:
-  #   container_name: haa-recipe-creator
-  #   build:
-  #     context: .
-  #     dockerfile: ./recipes-creation/Dockerfile
-  #   depends_on:
-  #     - datadb
-  #   restart: always
-  #   environment:
-  #     POSTGRES_DATA_HOST: ${POSTGRES_DATA_HOST}
-  #     POSTGRES_DATA_DB: ${POSTGRES_DATA_DB}
-  #     POSTGRES_DATA_USER: ${POSTGRES_DATA_USER}
-  #     POSTGRES_DATA_PASSWORD: ${POSTGRES_DATA_PASSWORD}
-  #     AZURE_OPENAI_API_KEY: ${AZURE_API_KEY_ENV}
-  #   ports:
-  #     - 8091:8081
-  #   env_file:
-  #     - .env
-  #   volumes:
-  #     - ./recipes-creation:/app
-
 
 volumes:
   pgdata2:

diff --git a/ingestion/api/hapi_utils.py b/ingestion/api/hapi_utils.py
@@ -53,10 +53,10 @@ def post_process_data(df, standard_names):
     df = filter_hapi_df(df, standard_names["admin0_code_field"])
 
     # Add a flag to indicate latest dataset by HDX ID, useful for LLM queries
-    if "dataset_hdx_stub" in df.columns and "reference_period_start" in df.columns:
+    if "resource_hdx_id" in df.columns and "reference_period_start" in df.columns:
         df["latest"] = 0
         df["reference_period_start"] = pd.to_datetime(df["reference_period_start"])
-        df["latest"] = df.groupby("dataset_hdx_stub")[
+        df["latest"] = df.groupby("resource_hdx_id")[
             "reference_period_start"
         ].transform(lambda x: x == x.max())
 

diff --git a/ingestion/ingest.py b/ingestion/ingest.py
@@ -79,7 +79,13 @@ def get_api_data(endpoint, params, data_node=None):
 
 
 def download_openapi_data(
-    api_host, openapi_def, excluded_endpoints, data_node, save_path, query_extra=""
+    api_host,
+    openapi_def,
+    excluded_endpoints,
+    data_node,
+    save_path,
+    query_extra="",
+    skip_downloaded=False,
 ):
     """
     Downloads data based on the functions specified in the openapi.json definition file.
@@ -94,17 +100,19 @@ def download_openapi_data(
         excluded_endpoints (list): List of endpoints to exclude
         data_node (str): The node in the openapi JSON file where the data is stored
         query_extra (str): Extra query parameters to add to the request
+        skip_downloaded (bool): If True, skip downloading data that already exists
 
     """
 
     limit = 1000
     offset = 0
 
     files = os.listdir(save_path)
-    for f in files:
-        if "openapi.json" not in f:
-            filename = f"{save_path}/{f}"
-            os.remove(filename)
+    if skip_downloaded is False:
+        for f in files:
+            if "openapi.json" not in f:
+                filename = f"{save_path}/{f}"
+                os.remove(filename)
 
     for endpoint in openapi_def["paths"]:
         if endpoint in excluded_endpoints:
@@ -118,6 +126,15 @@ def download_openapi_data(
 
         print(url)
 
+        endpoint_clean = endpoint.replace("/", "_")
+        if endpoint_clean[0] == "_":
+            endpoint_clean = endpoint_clean[1:]
+        file_name = f"{save_path}/{endpoint_clean}.csv"
+
+        if skip_downloaded and os.path.exists(file_name):
+            print(f"Skipping {endpoint} as {file_name} already exists")
+            continue
+
         data = []
         offset = 0
         output = []
@@ -135,14 +152,9 @@ def download_openapi_data(
             time.sleep(1)
 
         if len(data) > 0:
-            endpoint_clean = endpoint.replace("/", "_")
-            if endpoint_clean[0] == "_":
-                endpoint_clean = endpoint_clean[1:]
-
             print(len(data), "Before DF")
             df = pd.DataFrame(data)
             print(df.shape[0], "After DF")
-            file_name = f"{save_path}/{endpoint_clean}.csv"
             df.to_csv(file_name, index=False)
             with open(f"{save_path}/{endpoint_clean}_meta.json", "w") as f:
                 full_meta = openapi_def["paths"][endpoint]
@@ -241,6 +253,9 @@ def process_openapi_data(api_name, files_dir, field_map, standard_names):
         None
     """
     datafiles = os.listdir(files_dir)
+    processed_dir = f"{files_dir}/processed"
+    if not os.path.exists(processed_dir):
+        os.makedirs(processed_dir)
     for f in datafiles:
         if f.endswith(".csv"):
             filename = f"{files_dir}/{f}"
@@ -257,7 +272,8 @@ def process_openapi_data(api_name, files_dir, field_map, standard_names):
             df = eval(post_process_str)
             print("      After shape", df.shape)
 
-            df.to_csv(filename, index=False)
+            processed_filename = f"{processed_dir}/{f}"
+            df.to_csv(processed_filename, index=False)
 
 
 def save_openapi_data(files_dir, conn, api_name):
@@ -283,7 +299,7 @@ def save_openapi_data(files_dir, conn, api_name):
             df.to_sql(table, conn, if_exists="replace", index=False)
 
             # Collate metadata
-            meta_file = f"{files_dir}/{f.replace('.csv', '_meta.json')}"
+            meta_file = f"{files_dir.replace('/processed', '')}/{f.replace('.csv', '_meta.json')}"
             if os.path.exists(meta_file):
                 with open(meta_file) as mf:
                     meta = json.load(mf)
@@ -297,10 +313,16 @@ def save_openapi_data(files_dir, conn, api_name):
                         r["api_description"] += f' : {meta["get"]["description"]}'
                     r["api_definition"] = str(meta)
                     r["file_name"] = f
+                    for field in ["location_name", "origin_location_name"]:
+                        if field in df.columns:
+                            r["countries"] = sorted(df[field].unique())
+
                     table_metadata.append(r)
 
     # We could also use Postgres comments, but this is simpler for LLM agents for now
+    print("Saving metadata")
     table_metadata = pd.DataFrame(table_metadata)
+    print(table_metadata.shape)
     table_metadata.to_sql("table_metadata", conn, if_exists="replace", index=False)
 
 
@@ -335,6 +357,12 @@ def upload_hdx_shape_files(files_dir, conn):
 
     shape_files_table = "hdx_shape_files"
 
+    with conn.connect() as connection:
+        print(f"Creating metadata for {shape_files_table}")
+        statement = text("CREATE EXTENSION IF NOT EXISTS postgis;")
+        connection.execute(statement)
+        connection.commit()
+
     df_list = []
     for f in os.listdir(files_dir):
         if f.endswith(".shp"):
@@ -375,9 +403,16 @@ def map_field_names(df, field_map):
     return df
 
 
-def main():
+def main(skip_downloaded=False):
+    """
+    Main function for data ingestion.
+
+    Args:
+        skip_downloaded (bool, optional): Flag to skip downloaded data. Defaults to False.
+    """
     apis, field_map, standard_names = read_integration_config(INTEGRATION_CONFIG)
     conn = connect_to_db()
+
     for api in apis:
 
         openapi_def = get_api_def(api)
@@ -406,27 +441,33 @@ def main():
 
         # Extract data from remote APIs which are defined in apis.config
         download_openapi_data(
-            api_host, openapi_def, excluded_endpoints, data_node, save_path, query_extra
+            api_host,
+            openapi_def,
+            excluded_endpoints,
+            data_node,
+            save_path,
+            query_extra,
+            skip_downloaded,
         )
 
         # Standardize column names
         process_openapi_data(api_name, save_path, field_map, standard_names)
 
         # Upload CSV files to the database, with supporting metadata
-        save_openapi_data(save_path, conn, api_name)
+        save_openapi_data(f"{save_path}/processed", conn, api_name)
 
     # Download shapefiles from HDX. Note, this also standardizes column names
     download_hdx_boundaries(
-        datafile="./api/hapi/api_v1_themes_population.csv",
+        datafile="./api/hapi/processed/api_v1_population-social_population.csv",
         datafile_country_col=standard_names["country_code_field"],
         target_dir="./api/hdx/",
         field_map=field_map,
         map_field_names=map_field_names,
     )
 
     # Upload shapefiles to the database
-    upload_hdx_shape_files("./api/hdx", conn)
+    upload_hdx_shape_files("./api/hdx/", conn)
 
 
 if __name__ == "__main__":
-    main()
+    main(skip_downloaded=True)
diff --git a/ingestion/shapefiles.py b/ingestion/shapefiles.py
@@ -110,7 +110,7 @@ def normalize_hdx_boundaries(
         None
     """
 
-    output_dir = "./tmp/normalized/"
+    output_dir = "./tmp/processed/"
     if not os.path.exists(output_dir):
         os.makedirs(output_dir)
 
@@ -146,7 +146,7 @@ def normalize_hdx_boundaries(
 
 
 def download_hdx_boundaries(
-    datafile="./api/hapi/hapi_population.csv",
+    datafile="./api/hapi/api_v1_population-social_population.csv",
     datafile_country_col="location_code",
     target_dir="./api/hdx/",
     field_map={},
@@ -175,6 +175,9 @@ def download_hdx_boundaries(
     get_hdx_config()
 
     df = pd.read_csv(datafile)
+
+    print(df.columns)
+
     countries = df[datafile_country_col].unique()
     countries = [c.lower() for c in countries]