Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release 0.9.0 #2239

Open
wants to merge 405 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
405 commits
Select commit Hold shift + click to select a range
b58720c
[CoreEngine] In order to make the inference logs work, we save the co…
fedml-alex May 29, 2024
f4c49c9
Merge pull request #2135 from FedML-AI/alexleung/dev_branch_latest
fedml-alex May 29, 2024
28e4af4
Merge pull request #2136 from FedML-AI/alexleung/dev_v070_for_refactor
fedml-alex May 30, 2024
38e4453
Merge pull request #2137 from FedML-AI/dev/v0.7.0
fedml-alex May 30, 2024
1377a0d
Merge pull request #2138 from FedML-AI/alexleung/dev_v070_for_refactor
fedml-alex May 30, 2024
7625075
[Deploy] Avoid re-download the same model serving package.
Raphael-Jin May 30, 2024
e70837e
Merge pull request #2139 from FedML-AI/raphael/fix-pkg-download
Raphael-Jin May 30, 2024
9d8b0df
Add inference gateway logs
alaydshah May 30, 2024
c2fd5bd
Merge remote-tracking branch 'origin' into alaydshah/inference_gatewa…
alaydshah May 30, 2024
94757eb
Merge branch 'dev/v0.7.0' into alaydshah/inference_gateway_logging
alaydshah May 31, 2024
3fb45aa
Make Inference Gateway Daemon Process
Jun 3, 2024
8595e0f
Adding fail fast and timeout enforcement per request policies.
fedml-dimitris Jun 3, 2024
9296884
[Deploy] Fix config reading from redis.
Raphael-Jin Jun 3, 2024
b1312e1
Add global env file
alaydshah Jun 4, 2024
19160a2
Nits
alaydshah Jun 4, 2024
1b2eefe
Bug fix
alaydshah Jun 4, 2024
3481aa8
Write it to release by default
alaydshah Jun 4, 2024
b2ea4d0
Nit
alaydshah Jun 4, 2024
21138dd
Nit
alaydshah Jun 4, 2024
1cc1552
Hotfix mqtt timeout inference constant after refactoring.
fedml-dimitris Jun 4, 2024
8e03183
Improving pending requests counter robustness.
fedml-dimitris Jun 4, 2024
b0a55ad
Returning well formatted json messages in the case of errored requests.
fedml-dimitris Jun 4, 2024
2e53536
Fix bug
alaydshah Jun 4, 2024
b5e4c25
Make env variables override system, abstract dotenv api calls into fu…
alaydshah Jun 4, 2024
9a8f307
Merge pull request #2143 from FedML-AI/alaydshah/global/env_file
alaydshah Jun 4, 2024
e1e09c0
Merge branch 'dev/v0.7.0' into alaydshah/inference_gateway_logging
alaydshah Jun 4, 2024
10c5e17
Merge pull request #2142 from FedML-AI/dimitris/fail_fast_policy_merge
fedml-dimitris Jun 5, 2024
e2430fc
Renaming endpoint_id key to end_point_id
fedml-dimitris Jun 5, 2024
826efa8
Merge pull request #2144 from FedML-AI/hotfix/endpoint_metrics_attribute
fedml-dimitris Jun 5, 2024
600905f
[Deploy] Fix multi sub folder issue during deployment.
Raphael-Jin Jun 5, 2024
2ce07f4
Optimize Inference
alaydshah Jun 5, 2024
c1e37af
Nit.
Raphael-Jin Jun 5, 2024
d6799eb
Merge branch 'dev/v0.7.0' into alaydshah/inference_gateway_logging
alaydshah Jun 5, 2024
85c3ad8
Merge pull request #2145 from FedML-AI/raphael/fix-multi-deploy-subfd
fedml-dimitris Jun 5, 2024
a4b8ad2
Pipe in Mqtt config directly instead of deserializing object
alaydshah Jun 5, 2024
0ad81ce
Nits
alaydshah Jun 5, 2024
d713876
Fix bugs
alaydshah Jun 6, 2024
134f63e
Remove info logging added for debugging
alaydshah Jun 6, 2024
fd446b0
Fix
alaydshah Jun 6, 2024
2478350
Merge pull request #2146 from FedML-AI/dev/v0.7.0
fedml-alex Jun 6, 2024
19abac1
[CoreEngine] update the version and dependent libs.
fedml-alex Jun 6, 2024
3e39975
Merge pull request #2148 from FedML-AI/alexleung/dev_v070_for_refactor
fedml-alex Jun 6, 2024
27ad2e7
[CoreEngine] remove the deprecated files in the scheduler.
fedml-alex Jun 6, 2024
2c7d434
Merge pull request #2149 from FedML-AI/alexleung/dev_v070_for_refactor
fedml-alex Jun 6, 2024
f487b12
Merge pull request #2140 from FedML-AI/alaydshah/inference_gateway_lo…
alaydshah Jun 6, 2024
493463e
[Deploy] Recursively find the model serving package folder
Raphael-Jin Jun 6, 2024
6d5f62b
Merge branch 'raphael/fix-multi-subfd' of https://github.com/FedML-AI…
fedml-dimitris Jun 6, 2024
b4cb7c5
Making sure the unzipped file is a directory during initial deployment.
fedml-dimitris Jun 6, 2024
5a05310
Merge pull request #2150 from FedML-AI/raphael/fix-multi-subfd
Raphael-Jin Jun 6, 2024
f76d88e
[Deploy] Hot fix grammar.
Raphael-Jin Jun 6, 2024
8247dd2
Merge pull request #2152 from FedML-AI/raphael/hot-fix-grammar
Raphael-Jin Jun 6, 2024
4b11270
Hot fix to support local debugging
alaydshah Jun 6, 2024
2de8c37
Bug fix
alaydshah Jun 7, 2024
343b940
Merge pull request #2153 from FedML-AI/alaydshah/inference_gateway/ho…
Raphael-Jin Jun 7, 2024
38bc898
Adding sequential uploads & download using presigned URL
bhargav191098 Jun 7, 2024
aa62a94
minor comments and some error handling
bhargav191098 Jun 7, 2024
14bae99
[CoreEngine] 1. fixed the issue that the fork method is not support i…
fedml-alex Jun 7, 2024
28ff0f3
[CoreEngine] add the missed import.
fedml-alex Jun 7, 2024
6b33065
Merge pull request #2155 from FedML-AI/alexleung/dev_v070_for_refactor
fedml-alex Jun 7, 2024
c151831
Adding hash set for counting the number of pending requests per endp…
fedml-dimitris Jun 6, 2024
c29cf1d
[Deploy] Unified timeout key.
Raphael-Jin Jun 10, 2024
e667ded
Merge pull request #2151 from FedML-AI/dimitris/fix_pending_requests_…
Raphael-Jin Jun 10, 2024
5214078
Merge pull request #2154 from FedML-AI/bhargav191098/storage_presigne…
alaydshah Jun 10, 2024
c4a8714
[Deploy] Report worker's connectivity when it finished.
Raphael-Jin Jun 11, 2024
ea03b60
[Deploy] Refactor the quick start example, use public ip as default.
Raphael-Jin Jun 11, 2024
af026fb
Merge pull request #2158 from FedML-AI/raphael/refactor-quick-start
alaydshah Jun 11, 2024
31d8e7c
[CoreEngine] Adjust the design of FedML Python Agent to a decentraliz…
fedml-alex Jun 11, 2024
cc90279
Merge pull request #2159 from FedML-AI/dev/v0.7.0
fedml-alex Jun 11, 2024
9c227bb
Merge pull request #2160 from FedML-AI/dev/v0.7.0
fedml-alex Jun 11, 2024
edd148e
[CoreEngine] Use the fork process on the MacOS and linux to avoid the…
fedml-alex Jun 11, 2024
fd5af7e
[CoreEngine] Use the fork process on the MacOS and linux to avoid the…
fedml-alex Jun 11, 2024
2248621
Merge pull request #2162 from FedML-AI/alexleung/dev_branch_latest
fedml-alex Jun 11, 2024
207b5fb
Merge branch 'raphael/unify-connectivity' of https://github.com/FedML…
fedml-dimitris Jun 11, 2024
4a9622c
Adding default http connectivity type constant. Fixing minor typos an…
fedml-dimitris Jun 11, 2024
653fe66
[CoreEngine] make the multiprocess work on windows, linux and mac.
fedml-alex Jun 11, 2024
a7567ee
Merge pull request #2164 from FedML-AI/alexleung/dev_branch_latest
fedml-alex Jun 11, 2024
34fdba0
Merge pull request #2157 from FedML-AI/raphael/unify-connectivity
Raphael-Jin Jun 11, 2024
23d88fc
[Deploy] Remove unnecessary logic.
Raphael-Jin Jun 11, 2024
e0ad9b5
[Deploy] Remove unnecessary logic; Rename readiness check function; F…
Raphael-Jin Jun 11, 2024
64e8c77
[Deploy] Nit
Raphael-Jin Jun 11, 2024
9194f84
[Deploy] Hide unnecessary log.
Raphael-Jin Jun 11, 2024
8530973
Merge pull request #2165 from FedML-AI/raphael/refactor-container-dep…
fedml-dimitris Jun 11, 2024
243be07
[Deploy] Read port info from env.
Raphael-Jin Jun 12, 2024
0b23499
[CoreEngine] make the status center work in the united agents.
fedml-alex Jun 12, 2024
c27edd0
Merge pull request #2166 from FedML-AI/alexleung/dev_branch_latest
fedml-alex Jun 12, 2024
3a03471
[Deploy] Nit.
Raphael-Jin Jun 12, 2024
f0dd29e
[Deploy] Nit.
Raphael-Jin Jun 12, 2024
21a8a4c
[Deploy] Change few more places relate to gateway port.
Raphael-Jin Jun 12, 2024
e7e974d
[Deploy] Write port info into env file.
Raphael-Jin Jun 12, 2024
9c8ce99
[Deploy] Nit.
Raphael-Jin Jun 12, 2024
bec28a6
Merge pull request #2167 from FedML-AI/raphael/hotfix-inference-port
Raphael-Jin Jun 13, 2024
505103f
removing zip from upload
bhargav191098 Jun 14, 2024
03c58a2
changes in the download to support files
bhargav191098 Jun 14, 2024
cb7da70
print statement removal
bhargav191098 Jun 14, 2024
394906e
name issue
bhargav191098 Jun 14, 2024
2170797
\Adding Enum for data type
bhargav191098 Jun 15, 2024
5fb5ed4
adding user_id to bucket path
bhargav191098 Jun 15, 2024
14a0182
Merge pull request #2168 from FedML-AI/bhargav191098/removing_archive
bhargav191098 Jun 15, 2024
a1af615
[CoreEngine] refactor to support to pass the communication manager, s…
fedml-alex Jun 17, 2024
07ae3a9
Merge pull request #2173 from FedML-AI/alexleung/dev_branch_latest
fedml-alex Jun 17, 2024
6e8788c
[CoreEngine] refactor to support to pass the communication manager, s…
fedml-alex Jun 17, 2024
ca16d2a
Merge pull request #2174 from FedML-AI/alexleung/dev_branch_latest
fedml-alex Jun 17, 2024
78e310c
[CoreEngine] stop the status center, message center and other process…
fedml-alex Jun 17, 2024
7233d62
Merge pull request #2176 from FedML-AI/alexleung/dev_branch_latest
fedml-alex Jun 17, 2024
aecafb8
Fix compatibility by limiting numpy latest version.
Raphael-Jin Jun 17, 2024
87e11f7
Merge pull request #2177 from FedML-AI/raphael/fix-compat
Raphael-Jin Jun 17, 2024
1af78e7
[CoreEngine] replace the queue with the managed queue to avoid the mu…
fedml-alex Jun 18, 2024
1cac911
Merge pull request #2178 from FedML-AI/alexleung/dev_branch_latest
fedml-alex Jun 18, 2024
89219fb
Workaround device mapping inconsistency
alaydshah Jun 18, 2024
4ceba31
Merge pull request #2179 from FedML-AI/alaydshah/qualcomm/workaround/…
alaydshah Jun 18, 2024
1d5a05d
[Deploy][Autoscale] Bug fix: continue the for loop if no scale op.
Raphael-Jin Jun 19, 2024
a388915
Merge pull request #2182 from FedML-AI/raphael/fix-deploy
alaydshah Jun 19, 2024
31c57e0
Polishing the autoscaler real test.
fedml-dimitris Jun 19, 2024
4cb53fe
Replacing e_id.
fedml-dimitris Jun 19, 2024
4cc39fb
Merge pull request #2185 from FedML-AI/feature/autoscaler-real-test
fedml-dimitris Jun 19, 2024
1422fa1
[CoreEngine] check the nil pointer and update the numpy version.
fedml-alex Jun 19, 2024
c088de4
Merge pull request #2186 from FedML-AI/alexleung/dev_branch_latest
fedml-alex Jun 19, 2024
158eb9c
[CoreEngine] remove the deprecated action runners.
fedml-alex Jun 19, 2024
86b3db0
Merge pull request #2187 from FedML-AI/alexleung/dev_branch_latest
fedml-alex Jun 19, 2024
c485282
[CoreEngine] remove the unused files.
fedml-alex Jun 19, 2024
6b9cb03
Merge pull request #2188 from FedML-AI/alexleung/dev_branch_latest
fedml-alex Jun 19, 2024
9f996ab
[CoreEngine] when the deploy master reports finished status, we shoul…
fedml-alex Jun 19, 2024
82ca218
Merge pull request #2189 from FedML-AI/alexleung/dev_branch_latest
fedml-alex Jun 19, 2024
f28adea
[CoreEngine] Fix the stuck issue in the deploy master agent.
fedml-alex Jun 19, 2024
6b5e56f
Merge pull request #2190 from FedML-AI/alexleung/dev_branch_latest
fedml-alex Jun 19, 2024
31b7ae0
[Deploy] Hotfix: job runner context lost when logout.
Raphael-Jin Jun 20, 2024
eb0f207
Merge pull request #2191 from FedML-AI/raphael/hotfix-jobrunner
Raphael-Jin Jun 20, 2024
942b223
[ TEST ]: Initialize a GitHub Actions framework for CI tests
Jun 20, 2024
afe4147
[CoreEngine] in order to debug easily for multiprocessing, add the pr…
fedml-alex Jun 20, 2024
c47f527
Merge pull request #2193 from FedML-AI/alexleung/dev_branch_latest
fedml-alex Jun 20, 2024
fd257b8
[CoreEngine] update the dependant libs.
fedml-alex Jun 20, 2024
29a397f
Merge pull request #2194 from FedML-AI/alexleung/dev_branch_latest
fedml-alex Jun 20, 2024
7ccf195
[Deploy] Support arbitrary container image onboarding.
Raphael-Jin Jun 15, 2024
9ca6ecc
[Deploy] Add LoraX and Triton examples; Add url match pattern.
Raphael-Jin Jun 18, 2024
786718b
[Deploy] Support serverless container.
Raphael-Jin Jun 20, 2024
c0f691c
[Deploy] Nit.
Raphael-Jin Jun 20, 2024
67e93e8
Merge pull request #2195 from FedML-AI/raphael/pr/support-arbitrary-i…
Raphael-Jin Jun 20, 2024
7a0963e
[TEST]: add windows runners tests
Jun 21, 2024
4355c35
[doc]: make sure the workflow documents are more readable.
Jun 21, 2024
be60443
[doc]: make sure the workflow documents are more readable.
Jun 21, 2024
b63d960
[Merge]
Jun 21, 2024
d7481be
[CoreEngine] set the name of all monitor processes, remove the redund…
fedml-alex Jun 21, 2024
aa813a0
[CoreEngine] remove the API key.
fedml-alex Jun 21, 2024
11ef2a5
Merge pull request #2197 from FedML-AI/alexleung/dev_branch_latest
fedml-alex Jun 21, 2024
0491bb7
Merge pull request #2192 from Qigemingziba/github_action
fedml-alex Jun 21, 2024
fd038b5
Merge pull request #2199 from FedML-AI/alexleung/dev_v070_for_refactor
fedml-alex Jun 21, 2024
33fb5b4
[Deploy] Pass down the api key to container.
Raphael-Jin Jun 21, 2024
f412a26
[Deploy] Nit.
Raphael-Jin Jun 21, 2024
6ec7379
Merge pull request #2200 from FedML-AI/raphael/pass-api-key
Raphael-Jin Jun 21, 2024
d6c9411
[Deploy] Remove example.
Raphael-Jin Jun 21, 2024
dcc0845
Merge pull request #2201 from FedML-AI/raphael/remove-example
fedml-dimitris Jun 21, 2024
5ae6904
[CoreEngine] make the job stopping feature work.
fedml-alex Jun 25, 2024
0db8666
Merge pull request #2203 from FedML-AI/alexleung/dev_branch_latest
fedml-alex Jun 25, 2024
fa44ccc
[Deploy] Return custom path other than /predict.
Raphael-Jin Jun 25, 2024
bd89be1
[Deploy] Add sqlite backup for get_all_deployment_result_list.
Raphael-Jin Jun 25, 2024
43f99cf
[Deploy] Nit.
Raphael-Jin Jun 25, 2024
766c52a
[Deploy] Nit.
Raphael-Jin Jun 25, 2024
0c29c49
[Deploy] Hot fix hash exist.
Raphael-Jin Jun 26, 2024
9930d2d
Merge pull request #2204 from FedML-AI/raphael/refactor-inf-service
Raphael-Jin Jun 26, 2024
36378f8
[Deploy] Indicate worker connection type through cli and api.
Raphael-Jin Jun 26, 2024
5097ff2
[Deploy] Nit.
Raphael-Jin Jun 26, 2024
72ed9a3
Merge pull request #2205 from FedML-AI/raphael/hot-fix-hash-exist
Raphael-Jin Jun 26, 2024
7193577
Merge pull request #2206 from FedML-AI/raphael/indicate-connection-type
bhargav191098 Jun 26, 2024
a932082
Merge branch 'dev/v0.7.0' into alexleung/dev_v070_for_refactor
fedml-alex Jun 26, 2024
a5bbcd2
Merge pull request #2163 from FedML-AI/alexleung/dev_v070_for_refactor
fedml-alex Jun 28, 2024
084781f
Add logs in occupy_gpu_ids, and funcs in hardware_utils for debugging
alaydshah Jul 2, 2024
37be694
Revert "Adjust the design of FedML Python Agent to a decentralized ar…
Jul 2, 2024
babf08c
Merge pull request #2208 from FedML-AI/revert-2163-alexleung/dev_v070…
Raphael-Jin Jul 2, 2024
62c4bb8
Merge pull request #2207 from FedML-AI/alay_and_raphael/debug/deploym…
alaydshah Jul 2, 2024
f5ad35b
[Deploy] Fix round-robin algorithm; Format code.
Raphael-Jin Jul 8, 2024
c8e5755
[Deploy] Use terminology expose_subdomains.
Raphael-Jin Jul 22, 2024
7d47d27
Merge pull request #2214 from FedML-AI/raphael/change-terminology
Raphael-Jin Jul 24, 2024
281f8c0
Add marketplace_type, price_per_hour as optional login parameters
alaydshah Aug 1, 2024
24e4ce4
Fixes
alaydshah Aug 5, 2024
b692734
Nits
alaydshah Aug 5, 2024
06f70a4
Merge pull request #2210 from FedML-AI/raphael/refac-round-robin
alaydshah Aug 5, 2024
043fa6e
Bugfix
alaydshah Aug 6, 2024
682a6c4
Merge pull request #2216 from FedML-AI/alaydshah/update/provider_login
alaydshah Aug 7, 2024
bbf2493
Adding validation and price range restriction
alaydshah Aug 8, 2024
d9b4b8a
Merge pull request #2217 from FedML-AI/alaydshah/fix/provider_login
alaydshah Aug 8, 2024
a990a60
[Deploy] Automatically mount the workspace to container in the defaul…
Raphael-Jin Aug 12, 2024
dfd8308
[Deploy] Support bootstrap and CMD be indicated together.
Raphael-Jin Aug 12, 2024
1a09b0e
[Deploy] Nit.
Raphael-Jin Aug 12, 2024
7e5f6a1
[Deploy] Add example.
Raphael-Jin Aug 13, 2024
0d75918
Merge pull request #2218 from FedML-AI/raphael/refactor-mount-logic
Raphael-Jin Aug 16, 2024
47efcde
feat: Add name parameter to the bindingEdge method
alaydshah Aug 22, 2024
135c55b
Pass name into login
alaydshah Aug 22, 2024
4cf8066
Fixing grpc and trpc ipconfig from 127.0.0.0 to 0.0.0.0
Sep 4, 2024
be6196e
Merge pull request #2221 from FedML-AI/alaydshah/render/login/name
alaydshah Sep 4, 2024
277f4ca
Remove if condition, add log
alaydshah Sep 5, 2024
c7bfa63
Merge pull request #2222 from FedML-AI/alaydshah/fix/name
alaydshah Sep 5, 2024
17dd2b7
Stringify name
alaydshah Sep 12, 2024
4a198eb
Set name arg required to True
alaydshah Sep 12, 2024
16417d5
Making name optional
alaydshah Sep 12, 2024
03f37b8
Merge pull request #2223 from FedML-AI/alaydshah/bugfix/name
alaydshah Sep 12, 2024
53aead3
add the new certs.
fedml-alex Sep 13, 2024
f46cd1e
update new certs.
fedml-alex Sep 13, 2024
fa11d0b
Merge pull request #2224 from FedML-AI/alexleung/dev_v0700_for_merge
fedml-alex Sep 13, 2024
e046f5b
Fixing grpc compatibility with the fedml.ai platform and simplifying …
Sep 13, 2024
87ae30a
Merge branch 'dev/v0.7.0' into dimitris/grpc_fix
Sep 14, 2024
303f29b
Removing empty line.
Sep 14, 2024
1420898
[CoreEngine] set the cuda visible id into the docker container when t…
fedml-alex Sep 16, 2024
d0826d9
Merge pull request #2226 from FedML-AI/alexleung/dev_v0700_for_merge
fedml-alex Sep 16, 2024
30cfe02
set the gpu ids when training.
fedml-alex Sep 18, 2024
99903b2
Merge pull request #2227 from FedML-AI/alexleung/dev_v0700_for_merge
ASCE1885 Sep 18, 2024
f6b8c44
Merge pull request #2225 from FedML-AI/dimitris/grpc_fix
fedml-alex Sep 20, 2024
16a79d9
Adding simple local env docker client checker.
Oct 16, 2024
f299a8e
Adding more docker client existence checkpoints.
Oct 16, 2024
3349667
Fixing grpc readme file.
Oct 17, 2024
d2484fa
Remove circular dependency.
Oct 17, 2024
a959802
Extending grpc support to also consider docker container ips.
Oct 17, 2024
aa69122
Fixing notation and attribute names in grpc config files.
Oct 17, 2024
c302749
testing with ingress ip.
Oct 18, 2024
292bfb3
Polishing grpc + docker examples.
Oct 18, 2024
c6d4daf
Merge pull request #2229 from FedML-AI/dimitris/grpc_with_docker
fedml-alex Oct 24, 2024
55ff447
Parameterizing deploy host, port.
fedml-dimitris Nov 4, 2024
9fc5b4d
Merge pull request #2230 from FedML-AI/inference_runner_custom_host_port
fedml-alex Nov 6, 2024
a108a8a
[Deploy] Edge Case Handling.
Raphael-Jin Nov 11, 2024
98e084a
Merge pull request #2232 from FedML-AI/raphael/quick-fix-error-catch
fedml-alex Nov 11, 2024
698e95e
[fixbug]
charlieyl Dec 2, 2024
56f6059
Merge pull request #2233 from FedML-AI/charlie/dev/v0.7.0
charlieyl Dec 2, 2024
757e5f0
add log in get_available_gpu_ids[hardware_utils.py]
charlieyl Dec 10, 2024
c5c22c8
add logs
charlieyl Dec 10, 2024
ffa54a3
add logs
charlieyl Dec 10, 2024
589dc47
add logs
charlieyl Dec 10, 2024
40735d9
add logs
charlieyl Dec 10, 2024
90c1191
[bugfix]Enhance GPU management(need compare the readtime availabe gpu…
charlieyl Dec 11, 2024
30e1f70
[debug]add logs
charlieyl Dec 11, 2024
21a374c
[bugfix] Enhance GPU cache management by setting initial available GP…
charlieyl Dec 11, 2024
02b87f4
[bugfix]calculate the difference between realtime_available_gpu_ids a…
charlieyl Dec 11, 2024
9ff4a56
[bugfix]set shm_size to 8G if not specified
charlieyl Dec 11, 2024
fee49a4
Merge pull request #2234 from FedML-AI/dev/v0.7.0
charlieyl Dec 17, 2024
d5831b9
Revert "Merge pull request #2233 from FedML-AI/charlie/dev/v0.7.0"
charlieyl Dec 17, 2024
9fa8499
Merge pull request #2235 from FedML-AI/revert-2234-dev/v0.7.0
charlieyl Dec 17, 2024
2055d68
Merge pull request #2237 from FedML-AI/charlie/dev/v0.7.0
charlieyl Dec 17, 2024
bb31c93
remove debug logs
charlieyl Dec 18, 2024
cb489f8
Merge pull request #2238 from FedML-AI/charlie/dev/v0.7.0
charlieyl Dec 18, 2024
e27b830
check the gpu avaiablity using the random api to adapte the rental gpus.
Dec 20, 2024
181621a
Merge pull request #2240 from FedML-AI/alexleung/dev_v0700_4_sync
charlieyl Dec 20, 2024
77c6906
[bugfix]Adapt log method(in transformers/trainer.py) parameters
charlieyl Dec 20, 2024
038faaf
Merge branch 'dev/v0.7.0' into charlie/dev/v0.7.0
charlieyl Dec 20, 2024
a5708ea
Merge branch 'dev/v0.7.0' into charlie/dev/v0.7.0
charlieyl Dec 20, 2024
74803fe
[bugfix]Add the. zip suffix to the s3 key of the model card
charlieyl Dec 20, 2024
63f2110
Merge pull request #2241 from FedML-AI/charlie/dev/v0.7.0
fedml-alex Dec 20, 2024
93f9760
[update]Upgrade official website address: https://tensoropera.ai , an…
charlieyl Dec 20, 2024
ca5e764
Merge pull request #2242 from FedML-AI/charlie/dev/v0.7.0
fedml-alex Dec 20, 2024
4acb0f0
undo "Welcome to FedML.ai!"
charlieyl Dec 20, 2024
24469d2
Merge pull request #2243 from FedML-AI/charlie/dev/v0.7.0
fedml-alex Dec 20, 2024
07ae5ec
[bugfix]start_job_perf on execute_job_task
charlieyl Dec 25, 2024
f08a1ab
Merge pull request #2244 from FedML-AI/charlie/dev/v0.7.0
charlieyl Dec 25, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions python/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,5 +43,5 @@ Other low-level APIs related to security and privacy are also supported. All alg

**utils**: Common utilities shared by other modules.

## About FedML, Inc.
https://FedML.ai
## About TensorOpera, Inc.
https://tensoropera.ai
14 changes: 7 additions & 7 deletions python/examples/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,14 +2,14 @@
# FEDML Examples (Including Prebuilt Jobs in Jobs Store)

- `FedML/python/examples` -- examples for training, deployment, and federated learning
- `FedML/python/examples/launch` -- examples for FEDML®Launch
- `FedML/python/examples/serving` -- examples for FEDML®Deploy
- `FedML/python/examples/train` -- examples for FEDML®Train
- `FedML/python/examples/cross_cloud` -- examples for FEDML®Train cross-cloud distributed training
- `FedML/python/examples/launch` -- examples for TensorOpera®Launch
- `FedML/python/examples/serving` -- examples for TensorOpera®Deploy
- `FedML/python/examples/train` -- examples for TensorOpera®Train
- `FedML/python/examples/cross_cloud` -- examples for TensorOpera®Train cross-cloud distributed training
- `FedML/python/examples/federate/prebuilt_jobs` -- examples for federated learning prebuilt jobs (FedCV, FedNLP, FedGraphNN, Healthcare, etc.)
- `FedML/python/examples/federate/cross_silo` -- examples for cross-silo federated learning
- `FedML/python/examples/federate/cross_device` -- examples for cross-device federated learning
- `FedML/python/examples/federate/simulation` -- examples for federated learning simulation
- `FedML/python/examples/federate/security` -- examples for FEDML®Federate security related features
- `FedML/python/examples/federate/privacy` -- examples for FEDML®Federate privacy related features
- `FedML/python/examples/federate/federated_analytics` -- examples for FEDML®Federate federated analytics (FA)
- `FedML/python/examples/federate/security` -- examples for TensorOpera®Federate security related features
- `FedML/python/examples/federate/privacy` -- examples for TensorOpera®Federate privacy related features
- `FedML/python/examples/federate/federated_analytics` -- examples for TensorOpera®Federate federated analytics (FA)
2 changes: 1 addition & 1 deletion python/examples/deploy/complex_example/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ Use -cf to indicate the configuration file.
curl -XPOST localhost:2345/predict -d '{"text": "Hello"}'
```

## Option 2: Deploy to the Cloud (Using fedml®launch platform)
## Option 2: Deploy to the Cloud (Using TensorOpera®launch platform)
- Uncomment the following line in config.yaml

For information about the configuration, please refer to fedml ® launch.
Expand Down
2 changes: 1 addition & 1 deletion python/examples/deploy/complex_example/config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ environment_variables:
LOCAL_RANK: "0"

# If you do not have any GPU resource but want to serve the model
# Try FedML® Nexus AI Platform, and Uncomment the following lines.
# Try TensorOpera® Nexus AI Platform, and Uncomment the following lines.
# ------------------------------------------------------------
computing:
minimum_num_gpus: 1 # minimum # of GPUs to provision
Expand Down
48 changes: 0 additions & 48 deletions python/examples/deploy/custom_inference_image/README.md

This file was deleted.

This file was deleted.

Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
workspace: "."
inference_image: "your_docker_hub_repo/your_image_name"

workspace_mount_path: "/my_workspace" # Default is "/home/fedml/models_serving"

container_run_command: "echo hello && python3 /my_workspace/main_entry.py"

# If you want to install some packages
# Please write the command in the bootstrap.sh
bootstrap: |
echo "Install some packages..."
echo "Install finished!"
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
from fedml.serving import FedMLPredictor
from fedml.serving import FedMLInferenceRunner
import uuid


class Bot(FedMLPredictor): # Inherit FedMLClientPredictor
def __init__(self):
super().__init__()

# --- Your model initialization code here, here is a example ---
self.uuid = uuid.uuid4()
# -------------------------------------------

def predict(self, request: dict):
input_dict = request
question: str = input_dict.get("text", "").strip()

# --- Your model inference code here ---
response = f"I am a replica, my id is {self.uuid}"
# ---------------------------------------

return {"v1_generated_text": f"V1: The answer to your question {question} is: {response}"}


if __name__ == "__main__":
chatbot = Bot()
fedml_inference_runner = FedMLInferenceRunner(chatbot)
fedml_inference_runner.run()
16 changes: 0 additions & 16 deletions python/examples/deploy/custom_inference_image/serve_main.py

This file was deleted.

22 changes: 22 additions & 0 deletions python/examples/deploy/custom_inference_image/template.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# Required
workspace: "./" # We will pacakge all the files in the workspace directory
expose_subdomains: true # For customized image, if you want to route all the subdomains, set to true. e.g. localhost:2345/{all-subdomain}
inference_image_name: "" # Container image name
container_run_command: "" # str or list, similar to CMD in the dockerfile
port: 80 # Service port, currently you can only indicate one arbitrary port

# Optional, these are the default values
readiness_probe: # Probe for checking whether a container is ready for inference
httpGet:
path: ""
environment_variables: {} # Environment variables inside the container
volumes: # Volumes to mount to the container
- workspace_path: "" # Path to the volume in the workspace
mount_path: "" # Path to mount the volume inside the container
deploy_timeout_sec: 900 # Maximum time waiting for deployment to finish (Does not include the time to pull the image)
request_input_example: {} # Example of input request, will be shown in the UI
registry_specs: # Registry information for pulling the image
registry_name: ""
registry_provider: "DockerHub"
registry_user_name: ""
registry_user_password: ""
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
workspace: "./"

expose_subdomains: true
inference_image_name: "fedml/llama3-8b-tensorrtllm"

# If you put the model repository in $workspace/model_repository, it will be mounted to /home/fedml/models_serving/model_repository
container_run_command: ["sh", "-c", "cd / && huggingface-cli login --token $your_hf_token && pip install sentencepiece protobuf && python3 tensorrtllm_backend/scripts/launch_triton_server.py --model_repo tensorrtllm_backend/all_models/inflight_batcher_llm --world_size 1 && tail -f /dev/null"]

readiness_probe:
httpGet:
path: "/v2/health/ready"

port: 8000

deploy_timeout_sec: 1600


Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
workspace: "./"

expose_subdomains: true
inference_image_name: "nvcr.io/nvidia/tritonserver:24.05-py3"

volumes:
- workspace_path: "./model_repository"
mount_path: "/repo_inside_container"

container_run_command: "tritonserver --model-repository=/repo_inside_container"

readiness_probe:
httpGet:
path: "/v2/health/ready"

port: 8000

deploy_timeout_sec: 1600

request_input_example: {"text_input": "Hello"}
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
import json
import numpy as np
import triton_python_backend_utils as pb_utils

class TritonPythonModel:
def initialize(self, args):
self.model_name = args['model_name']

@staticmethod
def auto_complete_config(auto_complete_model_config):
auto_complete_model_config.add_input( {"name": "text_input", "data_type": "TYPE_STRING", "dims": [-1]})
auto_complete_model_config.add_output({"name": "text_output", "data_type": "TYPE_STRING", "dims": [-1]})
auto_complete_model_config.set_max_batch_size(0)
return auto_complete_model_config

def execute(self, requests):
responses = []
for request in requests:
in_numpy = pb_utils.get_input_tensor_by_name(request, "text_input").as_numpy()
assert np.object_ == in_numpy.dtype, 'in this demo, triton passes in a numpy array of size 1 with object_ dtype, this dtype encapsulates a python bytes-array'
print('in this demo len(in_numpy) is 1:', len(in_numpy.tolist()))
out_numpy = np.array([ (self.model_name + ': ' + python_byte_array.decode('utf-8') + ' World').encode('utf-8') for python_byte_array in in_numpy.tolist()], dtype = np.object_)
out_pb = pb_utils.Tensor("text_output", out_numpy)
responses.append(pb_utils.InferenceResponse(output_tensors = [out_pb]))
return responses
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
workspace: "./"

inference_image_name: "fedml/trt-llm-openai"

# The image has its self-contained cmd, no need for rewriting the command
container_run_command: null

port: 3000

readiness_probe:
httpGet:
path: "/health_check"

# If you do not use serverless container mode, and you want to indicate another resource path,
# e.g. localhost:3000/v1/chat/completions, you can set the following uri:
service:
httpPost:
path: "/v1/chat/completions"

deploy_timeout_sec: 1600

endpoint_api_type: "text2text_llm_openai_chat_completions"
10 changes: 10 additions & 0 deletions python/examples/deploy/debug/inference_timeout/config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
workspace: "./src"
entry_point: "serve_main.py"
bootstrap: |
echo "Bootstrap start..."
sleep 5
echo "Bootstrap finished"
auto_detect_public_ip: true
use_gpu: true

request_timeout_sec: 10
32 changes: 32 additions & 0 deletions python/examples/deploy/debug/inference_timeout/src/serve_main.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
from fedml.serving import FedMLPredictor
from fedml.serving import FedMLInferenceRunner
import uuid
import torch

# Calculate the number of elements
num_elements = 1_073_741_824 // 4 # using integer division for whole elements


class DummyPredictor(FedMLPredictor):
def __init__(self):
super().__init__()
# Create a tensor with these many elements
tensor = torch.empty(num_elements, dtype=torch.float32)

# Move the tensor to GPU
tensor_gpu = tensor.cuda()

# for debug
with open("/tmp/dummy_gpu_occupier.txt", "w") as f:
f.write("GPU is occupied")

self.worker_id = uuid.uuid4()

def predict(self, request):
return {f"AlohaV0From{self.worker_id}": request}


if __name__ == "__main__":
predictor = DummyPredictor()
fedml_inference_runner = FedMLInferenceRunner(predictor)
fedml_inference_runner.run()
4 changes: 2 additions & 2 deletions python/examples/deploy/mnist/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,9 +11,9 @@ curl -XPOST localhost:2345/predict -d '{"arr":[$DATA]}'
#For $DATA, please check the request_input_example, it is a 28*28=784 float array
#Output:{"generated_text":"tensor([0.2333, 0.5296, 0.4350, 0.4537, 0.5424, 0.4583, 0.4803, 0.2862, 0.5507,\n 0.8683], grad_fn=<SigmoidBackward0>)"}
```
## Option 2: Deploy to the Cloud (Using fedml® launch platform)
## Option 2: Deploy to the Cloud (Using TensorOpera® launch platform)
Uncomment the following line in mnist.yaml,
for infomation about the configuration, please refer to fedml® launch.
for infomation about the configuration, please refer to TensorOpera® launch.
```yaml
# computing:
# minimum_num_gpus: 1
Expand Down
2 changes: 1 addition & 1 deletion python/examples/deploy/mnist/mnist.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ data_cache_dir: ""
bootstrap: ""

# If you do not have any GPU resource but want to serve the model
# Try FedML® Nexus AI Platform, and Uncomment the following lines.
# Try TensorOpera® Nexus AI Platform, and Uncomment the following lines.
# ------------------------------------------------------------
computing:
minimum_num_gpus: 1 # minimum # of GPUs to provision
Expand Down
6 changes: 3 additions & 3 deletions python/examples/deploy/multi_service/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ fedml model create --name $model_name --config_file config.yaml
```

## On-premsie Deploy
Register an account on FedML website: https://fedml.ai
Register an account on TensorOpera website: https://tensoropera.ai

You will have a user id and api key, which can be found in the profile page.

Expand Down Expand Up @@ -44,8 +44,8 @@ You will have a user id and api key, which can be found in the profile page.
```
- Result

See the deployment result in https://fedml.ai
See the deployment result in https://tensoropera.ai

- OPT2: Deploy - UI

Follow the instructions on https://fedml.ai
Follow the instructions on https://tensoropera.ai
2 changes: 1 addition & 1 deletion python/examples/deploy/quick_start/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ Use -cf to indicate the configuration file.
curl -XPOST localhost:2345/predict -d '{"text": "Hello"}'
```

## Option 2: Deploy to the Cloud (Using fedml®launch platform)
## Option 2: Deploy to the Cloud (Using TensorOpera®launch platform)
- Uncomment the following line in config.yaml

For information about the configuration, please refer to fedml ® launch.
Expand Down
Loading
Loading