Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make acl-config more resilient #1188

Merged
merged 13 commits into from
Feb 3, 2020
Merged

Conversation

ryanrath
Copy link
Contributor

@ryanrath ryanrath commented Dec 12, 2019

Description

  • acl-config:
    • Added Lock File protection to protect against multiple copies of acl-config running simultaneously
    • Added a signal handler that should ensure the lock file is unlocked when acl-config is ctrl-c'd etc.
    • We no longer need to run acls-xdmod-management multiple times within acl-config, we now run
      it once after the backup tables are handled.
    • We've refactored the way that acl-config works, previously we wholesale Truncated / re-
      populated the "managed" tables ( any table that we can completely recreate from available
      configuration files ) and then re-populated the "non-managed" tables i.e. user_acls and
      user_acl_group_by_parameters. This method did not allow us to utilize db transactions as both
      TRUNCATE and DDL ( drop / create / update table ) force an implicit commit on execution. This
      left us open to having the database in an inconsistent state. Now, instead of truncating / populating
      the tables directly we instead create / populate a set of staging tables from the configuration files
      first. This leaves the production tables in tact in the event some problem arises. Then, wrapped in
      a db transaction, we DELETE FROM && INSERT INTO each production table from it's associated
      staging table.

Motivation and Context

We have run into a number of problems lately that we believe to be caused by multiple copies of acl-config running at the same time and or interrupted runs of acl-config. It's hoped that these changes will resolve these issues.

Tests performed

  • All automated tests in an OpenXDMoD docker: upgrade / fresh install
  • Manual testing:
    • Inserting an Exception at random times to make sure that existing functionality is not compromised with an incomplete execution. Also, to make sure that we can still utilize the lock file.
    • Ctrl+C out of an execution at various times to make sure that we don't bork things with an incomplete execution.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My code follows the code style of this project as found in the CONTRIBUTING document.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

bin/acl-config Outdated Show resolved Hide resolved
@jpwhite4
Copy link
Member

Why do you create _bkup tables? Surely they are not necessary if you are using staging tables.

We've run into cases where multiple instances of `acl-config` have been run at
the same time. This bit of code should make it so there's only one instance of
`acl-config` running at any one time.
- `acl-config` should only concern itself with synchronizing the contents of the
  configuration files w/ the database. The initial creation of the acl tables
  can / should be handled when setting up the database.
There have been a couple of instances now where `acl-config` has not completed
successfully which has left the db in an inconsistent state. We previously had
limited protection against this for the "non-managed" tables ( user_acls,
user_acl_group_by_parameters). But really we should be protecting the entire
process against this class of error. To that end `acl-config` has been modified
to work as such:
  - we create a backup table for the "non-managed" tables. this will be used
    later in the process.
  - a `_staging` table is created for each acl related table. This new staging
    table is identical to the production table in all but name.
  - we process each module, hierarchy, and set of realms, group bys statistics
    etc. but instead of inserting the information into the production tables we
    populate the staging tables.
  - we populate the staging version of the "non-managed" tables from the backup
    tables.
  - drop the "non-managed" backup tables.
  - In a single DB transaction, we DELETE FROM all of the production tables and
    then INSERT INTO all the data from the corresponding staging tables.
  - finally, we drop the staging tables as they are no longer required.

This new process means that none of the information in the production tables is
modified until we reach the second to last step, and this entire step is
protected by a DB transaction.
- So it turns out that the vast majority of what `acls-import` was doing is not
  needed anymore. Currently we only need to keep it around for making sure that
  the public user is created ( which may be able to be moved later ) and the
  population of `report_template_acls` which needs to occur *after* `acl-config`
  is run.

- I've renamed `refreshAclTables` to `syncTableData`and added `$tables` as a
  function argument to make what it does a bit more explicit.
- I've added a bit of "optimization" to `syncTableData` that takes advantage of
  how we have the tables laid out. Basically every table is linked to either
  modules or a table that's linked to modules so instead of deleting /
  populating each table we can just delete modules and that will cascade to the
  rest of the tables. Then we can populate everything. I'm still up in the air
  on this one, it's probably safer to just delete / populate each table, but
  I'll toss it up for comment.
Without this here, on a fresh install, the Users `user_acl_group_by_parameters`
would not be populated.
- These should have been here to begin with but are being added now because of
  bugs encountered during development. Specifically, during an upgrade where
  more then one migration is executed ( i.e. 8.5.0 -> 8.5.1, 8.5.1 -> 9.0.0 ),
  even though we have the population of tables wrapped in a commit, we would end
  up with 3 duplicate sets of `realms` entries which in turn messed everything
  else up. Once these unique indexes were added that behavior went away ( but
  oddly enough, no errors were ever thrown before or after ). I checked the
  counts of each table before deletion, after deletion, and after population and
  they all checked out. I even directly queried the `realms` table after the
  commit and just before `acl-config` was done and it always came back `3`. This
  is something that I would like to pursue further, but it will have to be one
  of the things on my "Poke at in between finishing actual work" list.
After conducting some tests it was found that this piece of code is not needed.
If `acl-config` is interrupted the lock file is unlocked automatically.

**NOTE:** I did investigate the `@flock, @fclose, @unlink` blocks and whether or
  not they are needed. I ran into problems during the upgrade in which the
  `acl-config` runs that occur as part of the migration would complete
  successfully, but the one that runs on ingestion would fail to acquire a lock
  and bork the whole process. Also, during development when an exception would
  be thrown early in one of the migration `acl-config` runs the ingestion
  `acl-config` would also not be able to acquire a lock.
  Once I added the `@flock, @fclose, @unlink` blocks the ingestion `acl-config`
  worked correctly in both cases.

  I did also attempt to not include the `@unlink` but this brought back the
  problem with the ingestion `acl-config` not being able to acquire a lock
  again.
Updated the population sql queries for the non-managed tables to correctly
work with the staging tables.
All of this code is no longer necessary w/ the way that `acl-config` now
operates.
- A user should only ever have one instance of an acl, this index enforces that
  at the table level.
@ryanrath ryanrath merged commit 93af2fe into ubccr:xdmod9.0 Feb 3, 2020
@jtpalmer jtpalmer mentioned this pull request Jul 19, 2020
6 tasks
@jtpalmer jtpalmer added Category:ACL Access Control Lists enhancement Enhancement of the functionality of an existing feature labels Aug 10, 2020
@jtpalmer jtpalmer added this to the 9.0.0 milestone Aug 10, 2020
@jtpalmer jtpalmer changed the title Cleanup / make acl-config more resilient Make acl-config more resilient Aug 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Category:ACL Access Control Lists enhancement Enhancement of the functionality of an existing feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants