Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding Recipe Management #32

Merged
merged 42 commits into from
Jun 1, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
42 commits
Select commit Hold shift + click to select a range
9ff181d
Protecting checked out folder
JanPeterDatakind May 14, 2024
f54c28c
Added recipe management
JanPeterDatakind May 16, 2024
f0220d8
Fixes to avoid linting check - fail
JanPeterDatakind May 16, 2024
9a38f73
More linting fixes
JanPeterDatakind May 16, 2024
c33fa8e
Merge pull request #31 from datakind/main
JanPeterDatakind May 16, 2024
cc0623f
Including recipe management instructions in ReadMe
JanPeterDatakind May 16, 2024
aa1ffda
Fixed recipe db seeding
JanPeterDatakind May 17, 2024
a629b1e
Added recipe creation skills & workflow in AutoGen
JanPeterDatakind May 21, 2024
d4c9d59
Recipes records now have embeddings
JanPeterDatakind May 21, 2024
95208ee
Update README.md
JanPeterDatakind May 22, 2024
0a71312
Tweaks to README
dividor May 22, 2024
94b6d1b
Merge branch 'feat/recipe-manager-new' of github.com:datakind/data-re…
dividor May 22, 2024
9537439
Updated AutoGen workflow and skills
JanPeterDatakind May 22, 2024
1b1f476
update demo data
JanPeterDatakind May 22, 2024
43d522d
Clean up recipes
dividor May 22, 2024
8e66c0d
Added manual recipe creation
JanPeterDatakind May 24, 2024
f030986
Simplified recipe check out instructions
dividor May 28, 2024
e20de79
Refactored recipes create into recipes manager
dividor May 28, 2024
d7b1b3d
Refactored recipes create into recipes manager
dividor May 28, 2024
53c8a82
Refactored recipes create into recipes manager
dividor May 28, 2024
7b59705
Refactored recipes create into recipes manager
dividor May 28, 2024
78cb7a7
Refactored recipes create into recipes manager
dividor May 28, 2024
bb5de11
Refactored recipes create into recipes manager
dividor May 28, 2024
56d6d48
Refactored recipes create into recipes manager
dividor May 28, 2024
deb301c
Added a flag to skip downloading if file exists
dividor May 28, 2024
1de50b4
Housekeeping
dividor May 28, 2024
9da6270
Some linting (pre commit hooks) & relative imports
JanPeterDatakind May 29, 2024
00a0ddd
Fixing bug for some of the new HAPI datasets
dividor May 29, 2024
a293e85
Merge branch 'feat/recipe-manager-new' of github.com:datakind/data-re…
dividor May 29, 2024
7a72c92
Fixing bug for some of the new HAPI datasets
dividor May 29, 2024
5f4a032
Fixing bug for some of the new HAPI datasets
dividor May 29, 2024
38e0260
Fixing bug for some of the new HAPI datasets
dividor May 29, 2024
2ff665c
Fixing bug for some of the new HAPI datasets
dividor May 29, 2024
aa10409
Fixes to ingestion so I can test AI team
dividor May 29, 2024
457f7ee
Added hapi folder to avoid ingestion error
JanPeterDatakind May 29, 2024
b72962c
Adjusted AutoGen assets - works now
JanPeterDatakind May 29, 2024
911c396
Fixes 4 SQLalchemy 1.4+ dataattribution in AutoGen
JanPeterDatakind May 29, 2024
4978e57
Fix for metadata
dividor May 31, 2024
7fe660d
Help for running recipes
dividor May 31, 2024
212ee0e
Fixed check out and -in
JanPeterDatakind Jun 1, 2024
5d3e2d5
Code quality fix
dividor Jun 1, 2024
630fcc9
Merge branch 'main' into feat/recipe-manager-new
dividor Jun 1, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 5 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -18,5 +18,9 @@ ui/recipes-chat/recipesdb
instructions.txt
dataset_details.json
.DS_Store
recipes-management/checked_out
recipes-management/checked_out/*
!recipes-management/checked_out/.gitkeep
!recipes-management/checked_out/skills.py
recipes-management/files/
database.sqlite

75 changes: 73 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ Given the rapidly changing landscape of LLMs, we have tried as much as possible

Data recipes supports datasources accessed via API, but in some cases it is preferable to ingest data in order to leverage LLM SQL capabilities. We include an initial set of data sources specific to Humanitarian Response in the ingestion module, which can be extended to include additional sources as required.

Finally, for reviewing/updating/creating new recipes, though we provide some experimental assistants that can generate and run recipes to semi-automate the process, in talking with developers and datascientists, most would prefer to use their existing environment for development, such as VS Code + GitHub Copilot. For this reason we are not developing a dedicated user interface for this, and intead provide a sync process that will allow recipe managers to check out and work on recipes locally, then publish them back into the recipes database for wider consumption. We do include however a autogen studio setup to be able to use agent teams to create recipes.
Finally, for reviewing/updating/creating new recipes, though we provide some experimental assistants that can generate and run recipes to semi-automate the process, in talking with developers and datascientists, most would prefer to use their existing environment for development, such as VS Code + GitHub Copilot. For this reason we are not developing a dedicated user interface for this, and instead provide a [sync process that will allow recipe managers to check out and work on recipes locally](#managing-recipes), then publish them back into the recipes database for wider consumption. We do include however a autogen studio setup to be able to use agent teams to create recipes.

Some more discussion on design decisions can also be found [here](https://www.loom.com/share/500e960fd91c44c282076be4b0126461?sid=83af2d6c-622c-4bda-b21b-8f528d6eafba).

Expand Down Expand Up @@ -159,6 +159,77 @@ To create an additional plugin, perform the following steps:

As the robocorp actions might differ slightly, this can lead to differing requirements in the openapi spec, and manifest files. The [LibraChat documentation](https://docs.librechat.ai/features/plugins/chatgpt_plugins_openapi.html) provides tips and examples to form the files correctly.

## Managing recipes

The management of recipes is part of the human in the loop approach of this repo. New recipes are created in status pending and only get marked as approved, once they have been verified by a recipe manager. Recipe managers can 'check out' recipes from the database into their local development environment such as VS Code to run, debug, and edit the recipes, before checking them back in. To make this process platform independent, recipes are checked out into a docker container, which can be used as the runtime environment to run the recipes via VSCode.

1. To check out recipes:

`cd recipes-management`
`docker exec haa-recipe-manager python recipe_sync.py --check_out <YOUR NAME>`

Note: This will lock the recipes in the database so others cannot edit them.

2. The checked_out folder in the recipes-management directory now shows all the recipes that were checked out from the database including the recipe code as a .py file. Note that after this step, the recipes in the database are marked as locked with your name and the timestamp you checked them out. If someone else tries to check them out, they are notified accordingly and cannot proceed until you've unlocked the records (more on that below).

This step checks out three files:

- Recipe.py - Contains the recipe code (re-assembled from the corresponding sections in the metadata json file)
- metadata.json - Contains the content of the cmetadata column in the recipe database.
- record_info.json - Contains additional information about the record such as its custom_id, and output.

You can edit all files according to your needs. Please makes sure to not change the custom_id anywehere because it's needed for the check in process.

3. Run the scripts and edit them as you deem fit. Please note: Do not delete the #Functions Code and #Calling Code comments as they're mandatory to reassemble the metadata json for the check in process.
4. Once you've checked and edited the recipes, run

`docker exec haa-recipe-manager python recipe_sync.py --check_in <YOUR NAME>`

To check the records back into the database and unlock them. All recipes that you've checked in in this fashion are automatically set to status 'approved' with your name as the approver and the timestamp of when you checked them back in.

## Testing a Recipe

You can run a specific recipe like this ...

`docker exec haa-recipe-manager python checked_out/retrieve_the_total_population_of_a_specified_country/recipe.py`

You can also exec into the container to do it ...

1. `docker exec -it haa-recipe-manager /bin/bash`
2. `cd ./checked_out`, then `cd <RECIPE_DIR>`
3. `python recipe.py`

You can also configure VS Code to connect to the recipe-manage container for running recipes ...

1. Install the DevContainers VSCode extension
2. Build data recipes using the `docker compose` command mentioned above
3. Open the command palette in VSCode (CMD + Shift + P on Mac; CTRL + Shift + P on Windows) and select

`Dev Containers: Attach to remote container`.

Select the recipe-manager container. This opens a new VSCode window - use it for the next steps.
4. Open folder `/app`
5. Navigate to your recipe in sub-folder `checked_out`
6. Run the `recipe.py` in a terminal or set up the docker interpretor

# Autogen Studio and autogen agent teams for creating data recipes

![alt text](../assets/autogen-studio-recipes.png)

Data recipes AI contains an autogenstudio instance for the Docker build, as well as sample skill, agent and workflows to use a team of autogen agents for creating data recipes.

You can information on Autogen studio [here](https://github.com/microsoft/autogen/tree/main/samples/apps/autogen-studio). This folder includes a skill to query the data recipes data DB, an agent to use that, with some prompts to help it, and a workflow that uses the agent.

To activate:

1. Go to [http://localhost:8091/](http://localhost:8091/)
2. Click on 'Build'
3. Click 'Skills' on left, top right click '...' and import the skill in `./assets`
4. Click 'Agents' on left, top right click '...' and import the skill in `./assets`
5. Click 'Workflows' on left, top right click '...' and import the skill in `./assets`
6. Go to playground and start a new session, select the 'Recipes data Analysis' workflow
7. Ask 'What is the total population of Mali?'

# Deployment

We will add more details here soon, for now, here are some notes on Azure ...
Expand All @@ -185,4 +256,4 @@ You will need to set key environment variables, see your local `.env` for exampl

## Databases

When running in Azure it is useful to use remote databases, at least for the mongodb instance so that user logins are retained with each release. For example, a databse can be configured by following [these instructions](https://docs.librechat.ai/install/configuration/mongodb.html). If doing this, then docker-compose-azure.yml in Azure can have the mongo DB section removed, and any instance of the Mongo URL used by other containers updated with the cloud connection string accordingly.
When running in Azure it is useful to use remote databases, at least for the mongodb instance so that user logins are retained with each release. For example, a databse can be configured by following [these instructions](https://docs.librechat.ai/install/configuration/mongodb.html). If doing this, then docker-compose-azure.yml in Azure can have the mongo DB section removed, and any instance of the Mongo URL used by other containers updated with the cloud connection string accordingly.
6 changes: 5 additions & 1 deletion actions/actions_plugins/recipe-server/db/1-schema.sql
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,11 @@ CREATE TABLE public.langchain_pg_embedding (
cmetadata json NULL,
custom_id varchar NULL,
uuid uuid NOT NULL,
"checksum" varchar NULL,
approval_status varchar NULL,
approver varchar NULL,
approval_latest_update varchar NULL,
locked_by varchar NULL,
locked_at varchar NULL,
CONSTRAINT langchain_pg_embedding_pkey PRIMARY KEY (uuid)
);

Expand Down
62 changes: 32 additions & 30 deletions actions/actions_plugins/recipe-server/db/2-demo-data.sql

Large diffs are not rendered by default.

3 changes: 2 additions & 1 deletion cleanup.sh
Original file line number Diff line number Diff line change
Expand Up @@ -4,4 +4,5 @@ rm -rf ./ui/recipes-chat/images/*
rm -rf ./ui/recipes-chat/logs/*
rm -rf ./ui/recipes-chat/data-node/*
rm -rf ./ui/recipes-chat/meili_data_v1.7/*
rm -rf ./ui/recipes-chat/datadb/*
rm -rf ./ui/recipes-chat/datadb/*
rm -rf ./ui/recipes-chat/recipesdb/*
29 changes: 8 additions & 21 deletions docker-compose.yml
Original file line number Diff line number Diff line change
Expand Up @@ -179,36 +179,23 @@ services:
dockerfile: ./recipes-management/Dockerfile
depends_on:
- recipedb
- datadb
restart: always
ports:
- 8091:8081
environment:
POSTGRES_DB: ${POSTGRES_RECIPE_DB}
POSTGRES_USER: ${POSTGRES_RECIPE_USER}
POSTGRES_PASSWORD: ${POSTGRES_RECIPE_PASSWORD}
POSTGRES_DATA_HOST: ${POSTGRES_DATA_HOST}
POSTGRES_DATA_DB: ${POSTGRES_DATA_DB}
POSTGRES_DATA_USER: ${POSTGRES_DATA_USER}
POSTGRES_DATA_PASSWORD: ${POSTGRES_DATA_PASSWORD}
AZURE_OPENAI_API_KEY: ${AZURE_API_KEY_ENV}
env_file:
- .env
volumes:
- ./recipes-management:/app
# recipe-creator:
# container_name: haa-recipe-creator
# build:
# context: .
# dockerfile: ./recipes-creation/Dockerfile
# depends_on:
# - datadb
# restart: always
# environment:
# POSTGRES_DATA_HOST: ${POSTGRES_DATA_HOST}
# POSTGRES_DATA_DB: ${POSTGRES_DATA_DB}
# POSTGRES_DATA_USER: ${POSTGRES_DATA_USER}
# POSTGRES_DATA_PASSWORD: ${POSTGRES_DATA_PASSWORD}
# AZURE_OPENAI_API_KEY: ${AZURE_API_KEY_ENV}
# ports:
# - 8091:8081
# env_file:
# - .env
# volumes:
# - ./recipes-creation:/app


volumes:
pgdata2:
Expand Down
4 changes: 2 additions & 2 deletions ingestion/api/hapi_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -53,10 +53,10 @@ def post_process_data(df, standard_names):
df = filter_hapi_df(df, standard_names["admin0_code_field"])

# Add a flag to indicate latest dataset by HDX ID, useful for LLM queries
if "dataset_hdx_stub" in df.columns and "reference_period_start" in df.columns:
if "resource_hdx_id" in df.columns and "reference_period_start" in df.columns:
df["latest"] = 0
df["reference_period_start"] = pd.to_datetime(df["reference_period_start"])
df["latest"] = df.groupby("dataset_hdx_stub")[
df["latest"] = df.groupby("resource_hdx_id")[
"reference_period_start"
].transform(lambda x: x == x.max())

Expand Down
77 changes: 59 additions & 18 deletions ingestion/ingest.py
Original file line number Diff line number Diff line change
Expand Up @@ -79,7 +79,13 @@ def get_api_data(endpoint, params, data_node=None):


def download_openapi_data(
api_host, openapi_def, excluded_endpoints, data_node, save_path, query_extra=""
api_host,
openapi_def,
excluded_endpoints,
data_node,
save_path,
query_extra="",
skip_downloaded=False,
):
"""
Downloads data based on the functions specified in the openapi.json definition file.
Expand All @@ -94,17 +100,19 @@ def download_openapi_data(
excluded_endpoints (list): List of endpoints to exclude
data_node (str): The node in the openapi JSON file where the data is stored
query_extra (str): Extra query parameters to add to the request
skip_downloaded (bool): If True, skip downloading data that already exists

"""

limit = 1000
offset = 0

files = os.listdir(save_path)
for f in files:
if "openapi.json" not in f:
filename = f"{save_path}/{f}"
os.remove(filename)
if skip_downloaded is False:
for f in files:
if "openapi.json" not in f:
filename = f"{save_path}/{f}"
os.remove(filename)

for endpoint in openapi_def["paths"]:
if endpoint in excluded_endpoints:
Expand All @@ -118,6 +126,15 @@ def download_openapi_data(

print(url)

endpoint_clean = endpoint.replace("/", "_")
if endpoint_clean[0] == "_":
endpoint_clean = endpoint_clean[1:]
file_name = f"{save_path}/{endpoint_clean}.csv"

if skip_downloaded and os.path.exists(file_name):
print(f"Skipping {endpoint} as {file_name} already exists")
continue

data = []
offset = 0
output = []
Expand All @@ -135,14 +152,9 @@ def download_openapi_data(
time.sleep(1)

if len(data) > 0:
endpoint_clean = endpoint.replace("/", "_")
if endpoint_clean[0] == "_":
endpoint_clean = endpoint_clean[1:]

print(len(data), "Before DF")
df = pd.DataFrame(data)
print(df.shape[0], "After DF")
file_name = f"{save_path}/{endpoint_clean}.csv"
df.to_csv(file_name, index=False)
with open(f"{save_path}/{endpoint_clean}_meta.json", "w") as f:
full_meta = openapi_def["paths"][endpoint]
Expand Down Expand Up @@ -241,6 +253,9 @@ def process_openapi_data(api_name, files_dir, field_map, standard_names):
None
"""
datafiles = os.listdir(files_dir)
processed_dir = f"{files_dir}/processed"
if not os.path.exists(processed_dir):
os.makedirs(processed_dir)
for f in datafiles:
if f.endswith(".csv"):
filename = f"{files_dir}/{f}"
Expand All @@ -257,7 +272,8 @@ def process_openapi_data(api_name, files_dir, field_map, standard_names):
df = eval(post_process_str)
print(" After shape", df.shape)

df.to_csv(filename, index=False)
processed_filename = f"{processed_dir}/{f}"
df.to_csv(processed_filename, index=False)


def save_openapi_data(files_dir, conn, api_name):
Expand All @@ -283,7 +299,7 @@ def save_openapi_data(files_dir, conn, api_name):
df.to_sql(table, conn, if_exists="replace", index=False)

# Collate metadata
meta_file = f"{files_dir}/{f.replace('.csv', '_meta.json')}"
meta_file = f"{files_dir.replace('/processed', '')}/{f.replace('.csv', '_meta.json')}"
if os.path.exists(meta_file):
with open(meta_file) as mf:
meta = json.load(mf)
Expand All @@ -297,10 +313,16 @@ def save_openapi_data(files_dir, conn, api_name):
r["api_description"] += f' : {meta["get"]["description"]}'
r["api_definition"] = str(meta)
r["file_name"] = f
for field in ["location_name", "origin_location_name"]:
if field in df.columns:
r["countries"] = sorted(df[field].unique())

table_metadata.append(r)

# We could also use Postgres comments, but this is simpler for LLM agents for now
print("Saving metadata")
table_metadata = pd.DataFrame(table_metadata)
print(table_metadata.shape)
table_metadata.to_sql("table_metadata", conn, if_exists="replace", index=False)


Expand Down Expand Up @@ -335,6 +357,12 @@ def upload_hdx_shape_files(files_dir, conn):

shape_files_table = "hdx_shape_files"

with conn.connect() as connection:
print(f"Creating metadata for {shape_files_table}")
statement = text("CREATE EXTENSION IF NOT EXISTS postgis;")
connection.execute(statement)
connection.commit()

df_list = []
for f in os.listdir(files_dir):
if f.endswith(".shp"):
Expand Down Expand Up @@ -375,9 +403,16 @@ def map_field_names(df, field_map):
return df


def main():
def main(skip_downloaded=False):
"""
Main function for data ingestion.

Args:
skip_downloaded (bool, optional): Flag to skip downloaded data. Defaults to False.
"""
apis, field_map, standard_names = read_integration_config(INTEGRATION_CONFIG)
conn = connect_to_db()

for api in apis:

openapi_def = get_api_def(api)
Expand Down Expand Up @@ -406,27 +441,33 @@ def main():

# Extract data from remote APIs which are defined in apis.config
download_openapi_data(
api_host, openapi_def, excluded_endpoints, data_node, save_path, query_extra
api_host,
openapi_def,
excluded_endpoints,
data_node,
save_path,
query_extra,
skip_downloaded,
)

# Standardize column names
process_openapi_data(api_name, save_path, field_map, standard_names)

# Upload CSV files to the database, with supporting metadata
save_openapi_data(save_path, conn, api_name)
save_openapi_data(f"{save_path}/processed", conn, api_name)

# Download shapefiles from HDX. Note, this also standardizes column names
download_hdx_boundaries(
datafile="./api/hapi/api_v1_themes_population.csv",
datafile="./api/hapi/processed/api_v1_population-social_population.csv",
datafile_country_col=standard_names["country_code_field"],
target_dir="./api/hdx/",
field_map=field_map,
map_field_names=map_field_names,
)

# Upload shapefiles to the database
upload_hdx_shape_files("./api/hdx", conn)
upload_hdx_shape_files("./api/hdx/", conn)


if __name__ == "__main__":
main()
main(skip_downloaded=True)
7 changes: 5 additions & 2 deletions ingestion/shapefiles.py
Original file line number Diff line number Diff line change
Expand Up @@ -110,7 +110,7 @@ def normalize_hdx_boundaries(
None
"""

output_dir = "./tmp/normalized/"
output_dir = "./tmp/processed/"
if not os.path.exists(output_dir):
os.makedirs(output_dir)

Expand Down Expand Up @@ -146,7 +146,7 @@ def normalize_hdx_boundaries(


def download_hdx_boundaries(
datafile="./api/hapi/hapi_population.csv",
datafile="./api/hapi/api_v1_population-social_population.csv",
datafile_country_col="location_code",
target_dir="./api/hdx/",
field_map={},
Expand Down Expand Up @@ -175,6 +175,9 @@ def download_hdx_boundaries(
get_hdx_config()

df = pd.read_csv(datafile)

print(df.columns)

countries = df[datafile_country_col].unique()
countries = [c.lower() for c in countries]

Expand Down
Loading
Loading