Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(elasticsearch) Analytics indices creation on AWS ES #5502

Merged
merged 17 commits into from
Sep 29, 2022

Conversation

tomas-kubin
Copy link
Contributor

@tomas-kubin tomas-kubin commented Jul 27, 2022

🥅 Goal

Solve issue #5376 with analytics Elasticsearch indices being created incorrectly on AWS ES and the Analytics Datahub page then not working.

🔍 Details

When running against AWS Elasticsearch (aka Amazon OpenSearch), analytics indices tend to have problems (see issue #5376 or search Slack for datahub_usage_event-000001). This PR introduces three changes in the create-indices.sh script:

  1. refactoring the script: It contained many copy-pasting and was not easy to follow or maintain. Adding comments, extracting repeatadly-used operations into functions, unifying approaches.
  2. adding index fix: When the script detects that the datahub_usage_event index was created incorrectly (probably by GMS when running with USE_AWS_ELASTICSEARCH incorrectly not set), it drops it and recreates it. This is should help many struggling developers.
  3. configuration hint: The script tries to detect whether the USE_AWS_ELASTICSEARCH should have been used after ES endpoint error and writes a hint about its usage.

🧪 Testing

Building the modified elasticsearch-setup-job image and using it in my Datahub helm charts, then deploying using these charts.

My setup uses Amazon Opensearch. Didn't test with the other case.

Case 1: clean slate

  • Nuking everything
  • Deploy the helm charts
  • Result: indexes created successfully
elasticsearch-setup-job log
2022/07/28 17:12:40 Waiting for: https://xxx.es.amazonaws.com:443
2022/07/28 17:12:40 Received 200 from https://xxx.es.amazonaws.com:443

>>> creating _opendistro/_ism/policies/datahub_usage_event_policy ...
{
  "policy": {
    "policy_id": "datahub_usage_event_policy",
    "description": "Datahub Usage Event Policy",
    "default_state": "Rollover",
    "schema_version": 1,
    "states": [
      {
        "name": "Rollover",
        "actions": [
          {
            "rollover": {
              "min_index_age": "1d"
            }
          }
        ],
        "transitions": [
          {
            "state_name": "ReadOnly",
            "conditions": {
              "min_index_age": "7d"
            }
          }
        ]
      },
      {
        "name": "ReadOnly",
        "actions": [
          {
            "read_only": {}
          }
        ],
        "transitions": [
          {
            "state_name": "Delete",
            "conditions": {
              "min_index_age": "60d"
            }
          }
        ]
      },
      {
        "name": "Delete",
        "actions": [
          {
            "delete": {}
          }
        ],
        "transitions": []
      }
    ],
    "ism_template": {
      "index_patterns": [
        "datahub_usage_event-*"
      ],
      "priority": 100
    }
  }
}{"_id":"datahub_usage_event_policy","_version":1,"_primary_term":1,"_seq_no":0,"policy":{"policy":{"policy_id":"datahub_usage_event_policy","description":"Datahub Usage Event Policy","last_updated_time":1659028360937,"schema_version":1,"error_notification":null,"default_state":"Rollover","states":[{"name":"Rollover","actions":[{"rollover":{"min_index_age":"1d"}}],"transitions":[{"state_name":"ReadOnly","conditions":{"min_index_age":"7d"}}]},{"name":"ReadOnly","actions":[{"read_only":{}}],"transitions":[{"state_name":"Delete","conditions":{"min_index_age":"60d"}}]},{"name":"Delete","actions":[{"delete":{}}],"transitions":[]}],"ism_template":[{"index_patterns":["datahub_usage_event-*"],"priority":100,"last_updated_time":1659028360937}]}}}
>>> creating _template/datahub_usage_event_index_template ...
{
  "index_patterns": ["datahub_usage_event-*"],
  "mappings": {
    "properties": {
      "@timestamp": {
        "type": "date"
      },
      "type": {
        "type": "keyword"
      },
      "timestamp": {
        "type": "date"
      },
      "userAgent": {
        "type": "keyword"
      },
      "browserId": {
        "type": "keyword"
      }
    }
  },
  "settings": {
    "index.opendistro.index_state_management.rollover_alias": "datahub_usage_event"
  }
}{"acknowledged":true}
>>> creating datahub_usage_event-000001 ...
{
  "aliases": {
    "datahub_usage_event": {
      "is_write_index": true
    }
  }
}
2022/07/28 17:12:41 Command finished successfully.

Case 2: invalid index

  • Nuke everything
  • Deploy with USE_AWS_ELASTICSEARCH not set -> elasticsearch-setup-job fails (see log below)
  • Restart GMS
  • Result analytics not working; but there is a configuration hint in elasticsearch-setup-job logs
elasticsearch-setup-job log
2022/07/28 17:20:49 Waiting for: https://xxx.es.amazonaws.com:443
2022/07/28 17:20:49 Received 200 from https://xxx.es.amazonaws.com:443

>>> failed to GET _ilm/policy/datahub_usage_event_policy (401) !
... looks like AWS OpenSearch is used; please set USE_AWS_ELASTICSEARCH env value to true
2022/07/28 17:20:49 Command exited with error: exit status 1
  • Redeploy with correctly set USE_AWS_ELASTICSEARCH=true
  • Result: elasticsearch-setup-job runs successfully, analytics now working correctly
elasticsearch-setup-job log
2022/07/28 17:26:10 Received 200 from https://xxx.es.amazonaws.com:443

>>> creating _opendistro/_ism/policies/datahub_usage_event_policy ...
{
  "policy": {
    "policy_id": "datahub_usage_event_policy",
    "description": "Datahub Usage Event Policy",
    "default_state": "Rollover",
    "schema_version": 1,
    "states": [
      {
        "name": "Rollover",
        "actions": [
          {
            "rollover": {
              "min_index_age": "1d"
            }
          }
        ],
        "transitions": [
          {
            "state_name": "ReadOnly",
            "conditions": {
              "min_index_age": "7d"
            }
          }
        ]
      },
      {
        "name": "ReadOnly",
        "actions": [
          {
            "read_only": {}
          }
        ],
        "transitions": [
          {
            "state_name": "Delete",
            "conditions": {
              "min_index_age": "60d"
            }
          }
        ]
      },
      {
        "name": "Delete",
        "actions": [
          {
            "delete": {}
          }
        ],
        "transitions": []
      }
    ],
    "ism_template": {
      "index_patterns": [
        "datahub_usage_event-*"
      ],
      "priority": 100
    }
  }
}{"_id":"datahub_usage_event_policy","_version":1,"_primary_term":1,"_seq_no":0,"policy":{"policy":{"policy_id":"datahub_usage_event_policy","description":"Datahub Usage Event Policy","last_updated_time":1659029170348,"schema_version":1,"error_notification":null,"default_state":"Rollover","states":[{"name":"Rollover","actions":[{"rollover":{"min_index_age":"1d"}}],"transitions":[{"state_name":"ReadOnly","conditions":{"min_index_age":"7d"}}]},{"name":"ReadOnly","actions":[{"read_only":{}}],"transitions":[{"state_name":"Delete","conditions":{"min_index_age":"60d"}}]},{"name":"Delete","actions":[{"delete":{}}],"transitions":[]}],"ism_template":[{"index_patterns":["datahub_usage_event-*"],"priority":100,"last_updated_time":1659029170348}]}}}
>>> creating _template/datahub_usage_event_index_template ...
{
  "index_patterns": ["datahub_usage_event-*"],
  "mappings": {
    "properties": {
      "@timestamp": {
        "type": "date"
      },
      "type": {
        "type": "keyword"
      },
      "timestamp": {
        "type": "date"
      },
      "userAgent": {
        "type": "keyword"
      },
      "browserId": {
        "type": "keyword"
      }
    }
  },
  "settings": {
    "index.opendistro.index_state_management.rollover_alias": "datahub_usage_event"
  }
}{"acknowledged":true}
>>> deleting invalid datahub_usage_event ...
{"acknowledged":true}
>>> creating datahub_usage_event-000001 ...
{
  "aliases": {
    "datahub_usage_event": {
      "is_write_index": true
    }
  }
}
2022/07/28 17:26:11 Command finished successfully.

Case 3: no-change

  • Redeploy with some unrelated bogus change
  • Result: analytics still working
elasticsearch-setup-job log
2022/07/28 17:28:32 Waiting for: https://xxx.es.amazonaws.com:443
2022/07/28 17:28:32 Received 200 from https://xxx.es.amazonaws.com:443

>>> _opendistro/_ism/policies/datahub_usage_event_policy already exists ✓

>>> _template/datahub_usage_event_index_template already exists ✓

>>> datahub_usage_event-000001 already exists ✓
2022/07/28 17:28:33 Command finished successfully.

☑️ Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues
  • Tests for the changes have been added/updated (not applicable)
  • Docs related to the changes have been added/updated (adding several comments into the script itself)
  • For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub (no downtime expected)

The script contains many copy-pasting and is not easy to follow.
Add comments, extract commonly used operations into functions, unify approaches.
Fix the issue where Amazon OpenSearch (AWS ES) indices are incorrectly initialised
and the Analytics screen shows errors only.
@anshbansal anshbansal added community-contribution PR or Issue raised by member(s) of DataHub Community devops PR or Issue related to DataHub backend & deployment labels Jul 27, 2022
@github-actions
Copy link

github-actions bot commented Jul 27, 2022

Unit Test Results (build & test)

584 tests  ±0   580 ✔️ ±0   12m 48s ⏱️ -7s
143 suites ±0       4 💤 ±0 
143 files   ±0       0 ±0 

Results for commit 0873672. ± Comparison against base commit 9e7bd1a.

♻️ This comment has been updated with latest results.

mention USE_AWS_ELASTICSEARCH env value if it seems it's set the wrong way
@tomas-kubin tomas-kubin marked this pull request as ready for review July 28, 2022 17:32
fi

# path where index definitions are stored
INDEX_DEFINITIONS_ROOT=/index/usage-event
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you!

elif [ $RESOURCE_STATUS -eq 404 ]; then
# resource doesn't exist -> need to create it
echo -e "\n>>> creating $RESOURCE_ADDRESS ..."
# use the given path as definition, but first replace all occurences of PREFIX with the actual prefix
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these comments are super helper. thank you


elif [ $RESOURCE_STATUS -eq 401 ] || [ $RESOURCE_STATUS -eq 405 ]; then
echo -e "\n>>> failed to GET $RESOURCE_ADDRESS ($RESOURCE_STATUS) !"
echo "... make sure you have correct USE_AWS_ELASTICSEARCH env value set (current=$USE_AWS_ELASTICSEARCH)"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why AWS specific?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not having USE_AWS_ELASTICSEARCH set to true is the common issue several people have met, which then causes Analytics to go wrong. I observed the 401/405 errors to be linked with this particular misconfiguration and figured it might be helpful for people to be aware of the obvious fix. (See the log in Testing / Case 2 section of the PR description.)

However I can see this looks too specific for a general-purpose function. Let me add a condition checking whether $ELASTICSEARCH_URL contains some aws-specific substring and display the message only then.

USAGE_EVENT_STATUS=$(curl -o /dev/null -s -w "%{http_code}\n" --header "$ELASTICSEARCH_AUTH_HEADER" "$ELASTICSEARCH_URL/${PREFIX}datahub_usage_event")
if [ $USAGE_EVENT_STATUS -eq 200 ]; then
USAGE_EVENT_DEFINITION=$(curl -s --header "$ELASTICSEARCH_AUTH_HEADER" "$ELASTICSEARCH_URL/${PREFIX}datahub_usage_event")
# the definition is expected to contain "datahub_usage_event-000001" string
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is that?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Consider adding to comment here)

if [[ $USAGE_EVENT_DEFINITION != *"datahub_usage_event-000001"* ]]; then
# ... if it doesn't, we need to drop it
echo -e "\n>>> deleting invalid datahub_usage_event ..."
curl -s -XDELETE --header "$ELASTICSEARCH_AUTH_HEADER" "$ELASTICSEARCH_URL/${PREFIX}datahub_usage_event"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deleting an index in the setup job seems dangerous. We need to be EXTREMELY careful that this doesn't affect existing deployments. Can you explain a bit more why this drop is necessary?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will analytics necessarily not work if the conditions is met to enter this block?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if the index was previously created with USE_AWS_ELASTICSEARCH not set, and then later switched to true, Analytics tab doesn't work and no analytics events get ever recorded. (The name matches, but the contents do not.) The only solution (and the core fix of this PR) is to drop the wrong index and recreate it anew using aws_es_usage_event.json as source.

This block is executed only if USE_AWS_ELASTICSEARCH=true (we are on AWS) and only if there is an existing datahub_usage_event index which doesn't contain the AWS-specific part — indicating the case described above.

What could go wrong? Maybe if the name of the index is later changed and this condition is not adjusted appropriately. Let me extract it to a constant to be safe and add some more comments in the code.

Copy link
Collaborator

@RyanHolstien RyanHolstien Aug 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropping is probably too harsh here, we have an alias. If it's a mapping that needs to be fixed it should use the reindex API rather than a full drop of the index.

# create indices for ES (non-AWS)
function create_datahub_usage_event_datastream() {
create_if_not_exists "_ilm/policy/${PREFIX}datahub_usage_event_policy" policy.json
create_if_not_exists "_index_template/${PREFIX}datahub_usage_event_index_template" index_template.json
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we not need a line like this here:

 create_if_not_exists "${PREFIX}datahub_usage_event" aws_es_usage_event.json

to ensure that the index is actually created for non aws cases?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because I do think there's a separate issue actively open where on fresh quickstarts without any usage events, the analytics index will be missing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how this works outside of AWS. The behavior in non-AWS environment should be the same as before refactoring.

Can you link the issue mentioned?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's this PR in question: #5974

@pedro93 pedro93 merged commit 596d484 into datahub-project:master Sep 29, 2022
@pedro93 pedro93 mentioned this pull request Oct 5, 2022
5 tasks
@mattmatravers mattmatravers mentioned this pull request Oct 26, 2022
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community-contribution PR or Issue raised by member(s) of DataHub Community devops PR or Issue related to DataHub backend & deployment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants