Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slurm collector crashes after switch to winter/standard time #518

Closed
QuantumDancer opened this issue Oct 30, 2023 · 1 comment · Fixed by #521
Closed

Slurm collector crashes after switch to winter/standard time #518

QuantumDancer opened this issue Oct 30, 2023 · 1 comment · Fixed by #521
Assignees

Comments

@QuantumDancer
Copy link
Contributor

QuantumDancer commented Oct 30, 2023

systemd[1]: Started AUDITOR SLURM collector.
Oct 30 09:31:27 auditor-slurm-collector[17165]: {"v":0,"name":"AUDITOR-slurm-collector","msg":"[SLURM COLLECTOR - START]","level":30,"hostname":"slurm.bfg.uni-freiburg.de","pid":17165,"time":"2023-10-30T08:31:27.734550119Z","target":"auditor_slurm_collector","line":84,"file":"collectors/slurm/src/main.rs","run_id":"c5762663-b115-4d8d-8e5d-8ea09b30d766"}
Oct 30 09:31:27 auditor-slurm-collector[17165]: {"v":0,"name":"AUDITOR-slurm-collector","msg":"[STARTING SACCT MONITOR - START]","level":30,"hostname":"slurm.bfg.uni-freiburg.de","pid":17165,"time":"2023-10-30T08:31:27.736869247Z","target":"auditor_slurm_collector::sacctcaller","line":40,"file":"collectors/slurm/src/sacctcaller.rs","run_id":"c5762663-b115-4d8d-8e5d-8ea09b30d766"}
Oct 30 09:31:27 auditor-slurm-collector[17165]: {"v":0,"name":"AUDITOR-slurm-collector","msg":"[STARTING SACCT MONITOR - END]","level":30,"hostname":"slurm.bfg.uni-freiburg.de","pid":17165,"time":"2023-10-30T08:31:27.736968219Z","target":"auditor_slurm_collector::sacctcaller","line":40,"file":"collectors/slurm/src/sacctcaller.rs","run_id":"c5762663-b115-4d8d-8e5d-8ea09b30d766","elapsed_milliseconds":0}
Oct 30 09:31:27 auditor-slurm-collector[17165]: {"v":0,"name":"AUDITOR-slurm-collector","msg":"[STARTING AUDITORSENDER - START]","level":30,"hostname":"slurm.bfg.uni-freiburg.de","pid":17165,"time":"2023-10-30T08:31:27.737816231Z","target":"auditor_slurm_collector::auditorsender","line":28,"file":"collectors/slurm/src/auditorsender.rs","run_id":"c5762663-b115-4d8d-8e5d-8ea09b30d766"}
Oct 30 09:31:27 auditor-slurm-collector[17165]: {"v":0,"name":"AUDITOR-slurm-collector","msg":"[STARTING AUDITORSENDER - END]","level":30,"hostname":"slurm.bfg.uni-freiburg.de","pid":17165,"time":"2023-10-30T08:31:27.737872625Z","target":"auditor_slurm_collector::auditorsender","line":28,"file":"collectors/slurm/src/auditorsender.rs","run_id":"c5762663-b115-4d8d-8e5d-8ea09b30d766","elapsed_milliseconds":0}
Oct 30 09:31:27 auditor-slurm-collector[17165]: {"v":0,"name":"AUDITOR-slurm-collector","msg":"[CALLING SACCT AND PARSING OUTPUT - START]","level":30,"hostname":"slurm.bfg.uni-freiburg.de","pid":17165,"time":"2023-10-30T08:31:27.738162231Z","target":"auditor_slurm_collector::sacctcaller","line":94,"file":"collectors/slurm/src/sacctcaller.rs"}
Oct 30 09:31:27 auditor-slurm-collector[17165]: thread 'tokio-runtime-worker' panicked at 'Ambiguous local time, ranging from 2023-10-29T02:22:07+01:00 to 2023-10-29T02:22:07+02:00', collectors/slurm/src/database.rs:93:59
Oct 30 09:31:27 auditor-slurm-collector[17165]: note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Oct 30 09:31:27 auditor-slurm-collector[17165]: {"v":0,"name":"AUDITOR-slurm-collector","msg":"[CALLING SACCT AND PARSING OUTPUT - END]","level":30,"hostname":"slurm.bfg.uni-freiburg.de","pid":17165,"time":"2023-10-30T08:31:27.738773842Z","target":"auditor_slurm_collector::sacctcaller","line":94,"file":"collectors/slurm/src/sacctcaller.rs","elapsed_milliseconds":0}
Oct 30 09:31:27 auditor-slurm-collector[17165]: {"v":0,"name":"AUDITOR-slurm-collector","msg":"[SLURM COLLECTOR - EVENT] Shutdown signal from inside application received.","level":30,"hostname":"slurm.bfg.uni-freiburg.de","pid":17165,"time":"2023-10-30T08:31:27.738940071Z","target":"auditor_slurm_collector","line":139,"file":"collectors/slurm/src/main.rs","run_id":"c5762663-b115-4d8d-8e5d-8ea09b30d766"}
Oct 30 09:31:27 auditor-slurm-collector[17165]: {"v":0,"name":"AUDITOR-slurm-collector","msg":"[SENDING SHUTDOWN SIGNAL TO SACCTCALLER AND AUDITORSENDER - START]","level":30,"hostname":"slurm.bfg.uni-freiburg.de","pid":17165,"time":"2023-10-30T08:31:27.738975333Z","target":"auditor_slurm_collector::shutdown","line":50,"file":"collectors/slurm/src/shutdown.rs","run_id":"c5762663-b115-4d8d-8e5d-8ea09b30d766"}
Oct 30 09:31:27 auditor-slurm-collector[17165]: {"v":0,"name":"AUDITOR-slurm-collector","msg":"[SENDING SHUTDOWN SIGNAL TO SACCTCALLER AND AUDITORSENDER - END]","level":30,"hostname":"slurm.bfg.uni-freiburg.de","pid":17165,"time":"2023-10-30T08:31:27.73906013Z","target":"auditor_slurm_collector::shutdown","line":50,"file":"collectors/slurm/src/shutdown.rs","run_id":"c5762663-b115-4d8d-8e5d-8ea09b30d766","elapsed_milliseconds":0}
Oct 30 09:31:27 auditor-slurm-collector[17165]: {"v":0,"name":"AUDITOR-slurm-collector","msg":"[SLURM COLLECTOR - EVENT] Could not send shutdown signal to tasks: channel closed\n\nLocation:\n    collectors/slurm/src/shutdown.rs:56:13","level":50,"hostname":"slurm.bfg.uni-freiburg.de","pid":17165,"time":"2023-10-30T08:31:27.73913181Z","target":"auditor_slurm_collector","line":144,"file":"collectors/slurm/src/main.rs","run_id":"c5762663-b115-4d8d-8e5d-8ea09b30d766"}
Oct 30 09:31:27 auditor-slurm-collector[17165]: {"v":0,"name":"AUDITOR-slurm-collector","msg":"[SLURM COLLECTOR - END]","level":30,"hostname":"slurm.bfg.uni-freiburg.de","pid":17165,"time":"2023-10-30T08:31:27.739205015Z","target":"auditor_slurm_collector","line":84,"file":"collectors/slurm/src/main.rs","run_id":"c5762663-b115-4d8d-8e5d-8ea09b30d766","elapsed_milliseconds":4}
systemd[1]: auditor-slurm-collector.service holdoff time over, scheduling restart.
systemd[1]: Stopped AUDITOR SLURM collector.

Not sure if related to #249

@QuantumDancer
Copy link
Contributor Author

QuantumDancer commented Oct 30, 2023

This happens because a timestamp with a negative TZ is stored in the DB:

sqlite> SELECT lastcheck, jobid FROM lastcheck;
2023-10-29T02:22:07-01:00|slurm-6811184

The crash is caused when reading the lastcheck datetime object from the database:

pub(crate) async fn get_lastcheck(&self) -> Result<(DateTime<Local>, String)> {
struct Row {
lastcheck: NaiveDateTime,
jobid: String,
}
match sqlx::query_as!(Row, r#"SELECT lastcheck, jobid FROM lastcheck"#)
.fetch_optional(&self.db_pool)
.await?
{
Some(Row { lastcheck, jobid }) => {
Ok((Local.from_local_datetime(&lastcheck).unwrap(), jobid))
}
None => {
let datetime = CONFIG.earliest_datetime;
tracing::info!(
"No last check date found in database. Assuming {}",
datetime.format("%FT%T")
);
Ok((datetime, String::new()))
}
}
}

Minimal working example to reproduce the crash:

use chrono::{offset::Local, offset::TimeZone, NaiveDateTime};

fn main() {
    let lastcheck =
        NaiveDateTime::parse_from_str("2023-10-29T02:22:07-01:00", "%Y-%m-%dT%H:%M:%S%:z").unwrap();
    let datetime = Local.from_local_datetime(&lastcheck).unwrap();
    println!("{}", datetime);
}

Local.from_local_datetime() returns a LocalResult. Instead of using unwrap(), we must handle the three cases independently.

Something like this:

use chrono::{offset::Local, offset::TimeZone, NaiveDateTime};

fn main() {
    let lastcheck =
        NaiveDateTime::parse_from_str("2023-10-29T02:22:07-01:00", "%Y-%m-%dT%H:%M:%S%:z").unwrap();
    let datetime = match Local.from_local_datetime(&lastcheck) {
        chrono::LocalResult::None => todo!(),  // probably return an error here
        chrono::LocalResult::Single(datetime) => datetime,
        chrono::LocalResult::Ambiguous(datetime1, datetime2) => {
            println!("datetime1: {}", datetime1);
            println!("datetime2: {}", datetime2);
            datetime1  // not yet sure whether to pick datetime1 or datetime2
        }
    };
}

This returns

datetime1: 2023-10-29 02:22:07 +01:00
datetime2: 2023-10-29 02:22:07 +02:00

For me, it's unclear yet whether to return the first or second timestamp in the Ambiguous case. Probably, it makes sense to just return the earlier timestamp (not sure if this is always datetime1 -> check docs/code), because then we are on the safe side, i.e. don't lose any jobs. Duplicate jobs should not be a problem, since we also have the last job id stored in the DB.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants