Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

json output for thermal data. #434

Closed
wants to merge 57 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
57 commits
Select commit Hold shift + click to select a range
5b30b13
add variorum_get_thermal_json function interfaces
kfan326 Jun 23, 2023
2fd9be9
push progress on thermal json api for broadwell *not yet working*
kfan326 Jun 23, 2023
d86b3c0
test implementation of thermals json api - currently prints in terminal
kfan326 Jun 26, 2023
aa2f863
broadwell thermal json api now working, but no nested access (objects…
kfan326 Jun 27, 2023
076d6ac
changed thermal json output to average pkg temp per socket
kfan326 Jun 27, 2023
fc8371b
added thermal json for Skylake through Kabylake
kfan326 Jun 28, 2023
7f3457a
changed Intel temp json to output per core thermal data
kfan326 Jun 28, 2023
40dd235
minor cleanup for thermal json
kfan326 Jun 28, 2023
1337ff0
remove unused parameters in thermal json api
kfan326 Jun 28, 2023
f9fa837
added thermal json support for nvidia volta gpu
kfan326 Jun 28, 2023
aae57d6
removed unused static int
kfan326 Jun 28, 2023
ac85c6a
implemented AMD GPU temp reporting json, changed Intel and Nvidia jso…
kfan326 Jun 29, 2023
d00c0cb
ran check format script
kfan326 Jun 29, 2023
80f0e77
debugging nvidia implementation still in progress
Jun 29, 2023
e5ae00e
added thermal json example
Jun 29, 2023
35ba990
fix typo
Jun 29, 2023
abe1c6e
nvidia thermal reporting now verified to work on lassen
Jun 29, 2023
50f85de
updated examples cmake list to include thermal example
Jun 29, 2023
7b0c3b1
thermal json verified to be working on tioga
Jun 30, 2023
6a5e6f9
ran code format script
kfan326 Jun 29, 2023
93d32de
ran check code format script
kfan326 Jun 30, 2023
36a8027
renamed json thermal api example
kfan326 Jun 30, 2023
89c5b96
optimized json output for intel thermals
kfan326 Jul 3, 2023
a4a695b
forgot to update thermal features
kfan326 Jul 3, 2023
d8adab7
ran code format script
kfan326 Jul 3, 2023
0988913
optimized nvidia json output format
kfan326 Jul 3, 2023
ee54da6
fixed nvidia json output bugs, tested on lassen to be working correctly
Jul 3, 2023
6109452
updated json api format for amd gpu - tested on tioga to be working
Jul 3, 2023
6db71eb
removed unused json object from thermal_features and ran code format …
kfan326 Jul 6, 2023
002349a
removed hostname from json output
kfan326 Jul 6, 2023
95ff781
removed unused json_t *parent object
kfan326 Jul 10, 2023
1d72ade
fixed error control reaches end of non_void function
Jul 10, 2023
e015aae
moved json object to variorum.c get thermal json function
Jul 17, 2023
61dafae
fix typo in thermal_features.c
Jul 17, 2023
1a292f8
ran code check script
kfan326 Jul 17, 2023
4d0659e
moved timestamp to cpu json object
kfan326 Jul 18, 2023
e5a8594
added IBM thermal json - tested on lassen
Jul 20, 2023
27b184a
ran code format script
kfan326 Jul 20, 2023
91f0035
added documentation for thermal json api
Jul 21, 2023
a873410
ran code check script
kfan326 Jul 21, 2023
d5d872c
addec copyright header to json.rst
kfan326 Jul 21, 2023
047fdf0
removed extra space in doc
Jul 31, 2023
39c4856
fix typo in doc
Aug 1, 2023
3d81c86
changes to thermal api - intel
Aug 23, 2023
389237a
vendor neutral compatibility changes
Aug 23, 2023
de1c605
fix typo
Aug 23, 2023
db34085
fixes
Aug 23, 2023
4baf5b3
ran check code script
kfan326 Aug 23, 2023
7720055
moved node object to variorum.h to save json function calls
Aug 24, 2023
9c91fa3
fix rebase
Aug 24, 2023
ba72ea3
fix typo
Aug 24, 2023
94b2c9b
fix typos on intel implementation and ran code format script
kfan326 Aug 24, 2023
0203fb0
moved timestamp to variorum.c
kfan326 Aug 24, 2023
3b6c73b
updated doc
kfan326 Aug 30, 2023
1a8052d
minor formatting
slabasan Nov 7, 2023
20f05ca
formatting
slabasan Nov 15, 2023
2df3b19
fix formatting
slabasan Nov 15, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions src/docs/sphinx/VariorumAPI.rst
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ implementations in Variorum are described in the following sections:
- :doc:`api/cap_functions`
- :doc:`api/json_support_functions`
- :doc:`api/enable_disable_functions`
- :doc:`api/json`

*******************
Variorum Wrappers
Expand Down
48 changes: 48 additions & 0 deletions src/docs/sphinx/api/json.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
.. # Copyright 2019-2023 Lawrence Livermore National Security, LLC and other
# Variorum Project Developers. See the top-level LICENSE file for details.
#
# SPDX-License-Identifier: MIT

##########
JSON API
##########

*******************************
Obtaining Thermal Information
*******************************

The API to obtain node thermal has the following format. It takes a string
(``char**``) by reference as input, and populates this string with a nested
JSON object with hostname, followed by socket_{number}, followed by CPU and or
GPU (depending on the platform, may contain only one or both), followed by Core
and Mem for CPU.

The ``variorum_get_thermals_json(char **)`` function returns a string type
nested JSON object. An example is provided below::

{
"hostname": {
"Socket_0": {
"CPU": {
"Core": {
"temp_celsius_core_0": (Integer),
...
"temp_celsius_core_i": (Integer),
},
"Mem": {
"temp_celsius_dimm_0": (Integer),
...
:temp_celsius_dimm_i": (Integer),
},
},
"GPU": {
"temp_celsius_gpu_0": (Integer),
...
"temp_celsius_gpu_i": (Integer),
}
},
"timestamp" : (Integer)
}
}

Here, ``i`` is the index of the core or GPU and ``0 <= i < num_cores/GPUs``.
1 change: 1 addition & 0 deletions src/examples/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ set(BASIC_EXAMPLES
variorum-enable-turbo-example
variorum-get-node-power-json-example
variorum-get-node-power-domain-info-json-example
variorum-get-node-thermal-json-example
variorum-integration-using-json-example
variorum-get-topology-info-example
variorum-monitoring-to-file-example
Expand Down
51 changes: 51 additions & 0 deletions src/examples/variorum-get-node-thermal-json-example.c
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
// Copyright 2019-2023 Lawrence Livermore National Security, LLC and other
// Variorum Project Developers. See the top-level LICENSE file for details.
//
// SPDX-License-Identifier: MIT

#include <getopt.h>
#include <stdio.h>
#include <stdlib.h>

#include <variorum.h>
#include <variorum_topology.h>

int main(int argc, char **argv)
{
int ret;
char *s = NULL;

const char *usage = "Usage: %s [-h] [-v]\n";
int opt;
while ((opt = getopt(argc, argv, "hv")) != -1)
{
switch (opt)
{
case 'h':
printf(usage, argv[0]);
return 0;
case 'v':
printf("%s\n", variorum_get_current_version());
return 0;
default:
fprintf(stderr, usage, argv[0]);
return -1;
}
}

ret = variorum_get_thermals_json(&s);
if (ret != 0)
{
printf("First run: JSON get thermals failed!\n");
free(s);
exit(-1);
}

/* Print the entire JSON object */
puts(s);

/* Deallocate the string */
free(s);

return ret;
}
84 changes: 82 additions & 2 deletions src/variorum/AMD_GPU/amd_gpu_power_features.c
Original file line number Diff line number Diff line change
Expand Up @@ -273,6 +273,7 @@ void get_thermals_data(int chipid, int total_sockets, int verbose, FILE *output)
static int init = 0;
static struct timeval start;
struct timeval now;
int i;

gethostname(hostname, 1024);

Expand Down Expand Up @@ -316,8 +317,7 @@ void get_thermals_data(int chipid, int total_sockets, int verbose, FILE *output)

gettimeofday(&now, NULL);

for (int i = chipid * gpus_per_socket;
i < (chipid + 1) * gpus_per_socket; i++)
for (i = chipid * gpus_per_socket; i < (chipid + 1) * gpus_per_socket; i++)
{
int64_t temp_val = -1;
double temp_val_flt = -1.0;
Expand Down Expand Up @@ -379,6 +379,86 @@ void get_thermals_data(int chipid, int total_sockets, int verbose, FILE *output)
}
}

void get_thermals_json(int chipid, int total_sockets, json_t *output)
{
rsmi_status_t ret;
uint32_t num_devices;
int gpus_per_socket;
char hostname[1024];

gethostname(hostname, 1024);

ret = rsmi_init(0);
if (ret != RSMI_STATUS_SUCCESS)
{
variorum_error_handler("Could not initialize RSMI",
VARIORUM_ERROR_PLATFORM_ENV,
getenv("HOSTNAME"), __FILE__, __FUNCTION__,
__LINE__);
exit(-1);
}

ret = rsmi_num_monitor_devices(&num_devices);
if (ret != RSMI_STATUS_SUCCESS)
{
variorum_error_handler("Could not get number of GPU devices",
VARIORUM_ERROR_PLATFORM_ENV,
getenv("HOSTNAME"), __FILE__, __FUNCTION__,
__LINE__);
}

gpus_per_socket = num_devices / total_sockets;

char socketid[12];
snprintf(socketid, 12, "socket_%d", chipid);

// check if socket object is in node object
json_t *socket_obj = json_object_get(output, socketid);
if (socket_obj == NULL)
{
socket_obj = json_object();
json_object_set_new(output, socketid, socket_obj);
}

// general gpu object
json_t *gpu_obj = json_object();
json_object_set_new(socket_obj, "GPU", gpu_obj);

int i;
for (i = chipid * gpus_per_socket; i < (chipid + 1) * gpus_per_socket; i++)
{
int64_t temp_val = -1;
double temp_val_flt = -1.0;

ret = rsmi_dev_temp_metric_get(i, RSMI_TEMP_TYPE_EDGE, RSMI_TEMP_CURRENT,
&temp_val);
if (ret != RSMI_STATUS_SUCCESS)
{
variorum_error_handler("RSMI API was not successful",
VARIORUM_ERROR_PLATFORM_ENV,
getenv("HOSTNAME"), __FILE__, __FUNCTION__,
__LINE__);
}

temp_val_flt = (double)(temp_val / (1000)); // Convert to Celcius.

// gpu entry
char gpuid[32];
snprintf(gpuid, 32, "temp_celsius_gpu_%d", i);
json_object_set_new(gpu_obj, gpuid, json_real(temp_val_flt));
}

ret = rsmi_shut_down();

if (ret != RSMI_STATUS_SUCCESS)
{
variorum_error_handler("Could not shutdown RSMI",
VARIORUM_ERROR_PLATFORM_ENV,
getenv("HOSTNAME"), __FILE__, __FUNCTION__,
__LINE__);
}
}

void get_clocks_data(int chipid, int total_sockets, int verbose, FILE *output)
{
rsmi_status_t ret;
Expand Down
3 changes: 3 additions & 0 deletions src/variorum/AMD_GPU/amd_gpu_power_features.h
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@

#include <stdint.h>
#include <stdio.h>
#include <jansson.h>

#include <rocm_smi/rocm_smi.h>

Expand All @@ -22,4 +23,6 @@ void get_gpu_utilization_data(int chipid, int total_sockets, int verbose,
void cap_each_gpu_power_limit(int chipid, int total_sockets,
unsigned int powerlimit);

void get_thermals_json(int chipid, int total_sockets, json_t *output);

#endif
2 changes: 2 additions & 0 deletions src/variorum/AMD_GPU/config_amd_gpu.c
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,8 @@ int set_amd_gpu_func_ptrs(int idx)
/* Initialize control interfaces */
g_platform[idx].variorum_cap_each_gpu_power_limit =
amd_gpu_instinct_cap_each_gpu_power_limit;
g_platform[idx].variorum_get_thermals_json =
amd_gpu_instinct_get_thermals_json;
}
else
{
Expand Down
21 changes: 21 additions & 0 deletions src/variorum/AMD_GPU/instinctGPU.c
Original file line number Diff line number Diff line change
Expand Up @@ -77,6 +77,27 @@ int amd_gpu_instinct_get_thermals(int verbose)
return 0;
}

int amd_gpu_instinct_get_thermals_json(json_t *get_thermal_obj)
{
char *val = getenv("VARIORUM_LOG");
if (val != NULL && atoi(val) == 1)
{
printf("Running %s\n", __FUNCTION__);
}

unsigned iter = 0;
unsigned nsockets;

variorum_get_topology(&nsockets, NULL, NULL, P_AMD_GPU_IDX);

for (iter = 0; iter < nsockets; iter++)
{
get_thermals_json(iter, nsockets, get_thermal_obj);
}

return 0;
}

int amd_gpu_instinct_get_clocks(int verbose)
{
char *val = getenv("VARIORUM_LOG");
Expand Down
4 changes: 4 additions & 0 deletions src/variorum/AMD_GPU/instinctGPU.h
Original file line number Diff line number Diff line change
Expand Up @@ -6,11 +6,15 @@
#ifndef INSTINCTGPU_H_INCLUDE
#define INSTINCTGPU_H_INCLUDE

#include <jansson.h>
#include <sys/time.h>

int amd_gpu_instinct_get_power(int verbose);
int amd_gpu_instinct_get_power_limit(int verbose);
int amd_gpu_instinct_get_thermals(int verbose);
int amd_gpu_instinct_get_clocks(int verbose);
int amd_gpu_instinct_get_gpu_utilization(int verbose);
int amd_gpu_instinct_cap_each_gpu_power_limit(unsigned int powerlimit);
int amd_gpu_instinct_get_thermals_json(json_t *get_thermal_obj);

#endif
62 changes: 62 additions & 0 deletions src/variorum/IBM/Power9.c
Original file line number Diff line number Diff line change
Expand Up @@ -471,6 +471,68 @@ int ibm_cpu_p9_get_node_power_json(char **get_power_obj_str)
return 0;
}

int ibm_cpu_p9_get_node_thermal_json(json_t *get_thermal_obj)
{
char *val = ("VARIORUM_LOG");
if (val != NULL && atoi(val) == 1)
{
printf("Running %s\n", __FUNCTION__);
}

void *buf;
int fd;
int rc;
int bytes;
unsigned iter = 0;
unsigned nsockets;

#ifdef VARIORUM_WITH_IBM_CPU
variorum_get_topology(&nsockets, NULL, NULL, P_IBM_CPU_IDX);
#endif

fd = open("/sys/firmware/opal/exports/occ_inband_sensors", O_RDONLY);
if (fd < 0)
{
printf("Failed to open occ_inband_sensors file\n");
return -1;
}

for (iter = 0; iter < nsockets; iter++)
{
lseek(fd, iter * OCC_SENSOR_DATA_BLOCK_SIZE, SEEK_SET);

buf = malloc(OCC_SENSOR_DATA_BLOCK_SIZE);
if (!buf)
{
printf("Failed to allocate\n");
return -1;
}

for (rc = bytes = 0; bytes < OCC_SENSOR_DATA_BLOCK_SIZE; bytes += rc)
{
rc = read(fd, buf + bytes, OCC_SENSOR_DATA_BLOCK_SIZE - bytes);

if (!rc || rc < 0)
{
break;
}
}

if (bytes != OCC_SENSOR_DATA_BLOCK_SIZE)
{
printf("Failed to read data\n");
free(buf);
return -1;
}

json_get_thermal_sensors(iter, get_thermal_obj, buf);
free(buf);
}

close(fd);
return 0;
}

int ibm_cpu_p9_get_node_power_domain_info_json(char **get_domain_obj_str)
{
char *val = ("VARIORUM_LOG");
Expand Down
2 changes: 2 additions & 0 deletions src/variorum/IBM/Power9.h
Original file line number Diff line number Diff line change
Expand Up @@ -24,4 +24,6 @@ int ibm_cpu_p9_get_node_power_json(char **get_power_obj_str);

int ibm_cpu_p9_get_node_power_domain_info_json(char **get_domain_obj_str);

int ibm_cpu_p9_get_node_thermal_json(json_t *get_thermal_obj);

#endif
2 changes: 2 additions & 0 deletions src/variorum/IBM/config_ibm.c
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,8 @@ int set_ibm_func_ptrs(int idx)
g_platform[idx].variorum_get_node_power_json = ibm_cpu_p9_get_node_power_json;
g_platform[idx].variorum_get_node_power_domain_info_json =
ibm_cpu_p9_get_node_power_domain_info_json;
g_platform[idx].variorum_get_thermals_json =
ibm_cpu_p9_get_node_thermal_json;
}
else
{
Expand Down
Loading