Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] BWC Tests get failed with different shard number for text chunking processor #690

Closed
yuye-aws opened this issue Apr 12, 2024 · 28 comments
Assignees
Labels
bug Something isn't working

Comments

@yuye-aws
Copy link
Member

yuye-aws commented Apr 12, 2024

What is the bug?

The BWC tests get failed after changing the shard number in index. The error is due to "index not found" even if we assert the response after index creation is true.

How can one reproduce the bug?

Follow the code change in these two issues to change the index setting: #684 #685. And then run the BWC tests.

What is the expected behavior?

The BWC tests should keep consistent and regardless of the index setting.

What is your host/environment?

Linux environment on Github.

Do you have any screenshots?

Error log in these two issues: #684 #685

Do you have any additional context?

No

@yuye-aws yuye-aws added bug Something isn't working untriaged labels Apr 12, 2024
@yuye-aws
Copy link
Member Author

Changing the index setting may not affect other bwc tests. I will close the PR: #685

@yuye-aws
Copy link
Member Author

Reopned PR for debugging: #685

@yuye-aws
Copy link
Member Author

I will take a look into this feature in the next few days. Feel free to assign this issue to me.

@yuye-aws yuye-aws changed the title [BUG] BWC Tests get failed with different index setting [BUG] BWC Tests get failed with different index setting in text chunking processor Apr 27, 2024
@yuye-aws
Copy link
Member Author

yuye-aws commented May 7, 2024

Latest observation: BWC tests get passed for 2.14.0-SNAPSHOT but still fail in 2.13.0.

@yuye-aws
Copy link
Member Author

yuye-aws commented May 7, 2024

BWC tests are failing in version 2.13.0 because the neural search code are fetched from url: https://ci.opensearch.org/ci/dbc/distribution-build-opensearch. Here are a few options to resolve this issue:

  1. Disable text chunking BWC tests for version 2.13.
  2. Fetch artifacts for 2.13 BWC tests: https://artifacts.opensearch.org/releases/bundle/opensearch.
  3. Simply close this issue and do nothing with the code because BWC tests is already passing with shard number at three.
  4. Replace the bundle in https://ci.opensearch.org/ci/dbc/distribution-build-opensearch.
  5. Change the version from "2.13.0" to "2.13.0-SNAPSHOT" in backwards_compatibility_tests_workflow.yml

@yuye-aws yuye-aws changed the title [BUG] BWC Tests get failed with different index setting in text chunking processor [BUG] BWC Tests get failed with different shard number for text chunking processor May 7, 2024
@yuye-aws
Copy link
Member Author

yuye-aws commented May 7, 2024

I personally prefer Option 1 and 3.

@vibrantvarun
Copy link
Member

@martin-gaievski and @navneet1v what are your thoughts on this? Shall we go for 2.13.1?

@martin-gaievski
Copy link
Member

I think we need a proper fix for this issue, BWC should be executed otherwise we're flying blind. Sounds like options 1 and 3 are essentially ignoring the failure.
What are differences for artifacts that are in https://artifacts.opensearch.org/releases/bundle/opensearch and https://ci.opensearch.org/ci/dbc/distribution-build-opensearch? For 2.13 release that is already finalize some couple of months back it should be same in both locations.

@yuye-aws
Copy link
Member Author

https://artifacts.opensearch.org/releases/bundle/opensearch is for unreleased version. These snapshots will be updated according to our latest code.

https://ci.opensearch.org/ci/dbc/distribution-build-opensearch is for released version. They are fixed once the certain version get released.

@yuye-aws
Copy link
Member Author

I think we need a proper fix for this issue, BWC should be executed otherwise we're flying blind. Sounds like options 1 and 3 are essentially ignoring the failure.

Actually the failure has been fixed since 2.14. Only open source 2.13 has this failure when user misconfigure their index with improper shard number.

@yuye-aws
Copy link
Member Author

Hi @martin-gaievski and @vibrantvarun , what's your suggestion on the "proper fix" for this issue?

@vibrantvarun
Copy link
Member

Releasing 2.13.1 is the only option for bwc tests to pick up the artifact from ci url . However, you can once try running the tests with 2.13.0-SNAPSHOT. If it passes with 2.13.0-SNAPSHOT I have no issues in closing this.

cc: @martin-gaievski

@martin-gaievski
Copy link
Member

+1 to @vibrantvarun comment, having successful test run for 2.13.0 should be enough

@yuye-aws
Copy link
Member Author

Releasing 2.13.1 is the only option for bwc tests to pick up the artifact from ci url . However, you can once try running the tests with 2.13.0-SNAPSHOT. If it passes with 2.13.0-SNAPSHOT I have no issues in closing this.

cc: @martin-gaievski

I can raise a PR changing the snapshot to 2.13.0-SNAPSHOT, but I am not sure whether it makes sense to change the workflow due to a single test, even when the test is actually passing with the current setting.

@yuye-aws
Copy link
Member Author

Releasing 2.13.1 is the only option for bwc tests to pick up the artifact from ci url . However, you can once try running the tests with 2.13.0-SNAPSHOT. If it passes with 2.13.0-SNAPSHOT I have no issues in closing this.

cc: @martin-gaievski

I am not sure whether you mean we need to update our CI workflow with this PR: #752

@yuye-aws
Copy link
Member Author

From my observation, CI still gets failed in https://github.com/opensearch-project/neural-search/actions/runs/9157136803/job/25172907185?pr=684 with updated CI workflow. We can deep dive this error during the meeting.

@martin-gaievski
Copy link
Member

I've restarted that run for 2.13.0-SNAPSHOT and this time it's successful https://github.com/opensearch-project/neural-search/actions/runs/9157136803/job/25247195168?pr=684. Same versions for different platform is also green - https://github.com/opensearch-project/neural-search/actions/runs/9157136803/job/25247196478?pr=684.

It's looks ok at the first glance, I understand we may have flaky tests. @vibrantvarun do you have context, is this a flaky test issue or it was failing constantly?

@yuye-aws
Copy link
Member Author

Here is the failing link: https://github.com/opensearch-project/neural-search/actions/runs/9157136803/job/25172907185?pr=684. It is the first time this issue gets falky. I am wondering whether this issue is related to BWC workflow. Does @vibrantvarun know more details?

@martin-gaievski
Copy link
Member

martin-gaievski commented May 22, 2024

I've made few test runs for #684 , got mixed results https://github.com/opensearch-project/neural-search/actions/runs/9157136803/job/25255088325?pr=684. I think now it's more like a flaky test, as per my understanding it's different from what we had initially when test always failed.

jdk11, linux, PASS
jdk11, win, PASS
jdk17, linux, PASS
jdk17, win, FAIL
jdk21, linux, FAIL
jdk21, win, PASS

@yuye-aws
Copy link
Member Author

Thanks @martin-gaievski for providing more results? Can we check the snapshots in https://artifacts.opensearch.org/releases/bundle/opensearch?

@yuye-aws
Copy link
Member Author

yuye-aws commented May 23, 2024

Our conclusion:
Try running 2.13.0-SNAPSHOT BWC tests on local machine. If the tests get pass with shard number 1, I will paste the results and @vibrantvarun will close this issue. Also try 2.14.0 BWC tests.

@yuye-aws
Copy link
Member Author

Latest update: 2.13.0-SNAPSHOT does not include the PR fixing the bug.

@yuye-aws
Copy link
Member Author

BWC tests are failing in version 2.13.0 because the neural search code are fetched from url: https://ci.opensearch.org/ci/dbc/distribution-build-opensearch. Here are a few options to resolve this issue:

  1. Disable text chunking BWC tests for version 2.13.
  2. Fetch artifacts for 2.13 BWC tests: https://artifacts.opensearch.org/releases/bundle/opensearch.
  3. Simply close this issue and do nothing with the code because BWC tests is already passing with shard number at three.
  4. Replace the bundle in https://ci.opensearch.org/ci/dbc/distribution-build-opensearch.
  5. Change the version from "2.13.0" to "2.13.0-SNAPSHOT" in backwards_compatibility_tests_workflow.yml

Option 5 is not valid.

@martin-gaievski
Copy link
Member

Please try following steps:

@yuye-aws
Copy link
Member Author

Hi @martin-gaievski and @vibrantvarun ! I have double confirmed that 2.13 branch can fix the BWC test PR. First, here is the error log when directly running BWC tests. bwc_test_error_logs.txt Then, I follow your steps. I build a 2.13.0 jar file for neural search plugin and then replace it. Here is the successful log:

./gradlew ':qa:restart-upgrade:testAgainstNewCluster' --tests "org.opensearch.neuralsearch.bwc.TextChunkingProcessorIT.testTextChunkingProcessor_E2EFlow" -Dtests.bwc.version=2.13.0-SNAPSHOT
=======================================
OpenSearch Build Hamster says Hello!
  Gradle Version        : 8.4
  OS Info               : Linux 5.10.218-186.862.amzn2int.x86_64 (amd64)
  JDK Version           : 17 (Amazon Corretto JDK)
  JAVA_HOME             : /usr/lib/jvm/java-17-amazon-corretto.x86_64
  Random Testing Seed   : D094611A8B47EBAB
  In FIPS 140 mode      : false
=======================================

> Task :generatePomFileForNebulaPublication
Maven publication 'nebula' pom metadata warnings (silence with 'suppressPomMetadataWarningsFor(variant)'):
  - Variant testFixturesApiElements:
      - Declares capability org.opensearch:neural-search-test-fixtures:3.0.0.0-SNAPSHOT which cannot be mapped to Maven
  - Variant testFixturesRuntimeElements:
      - Declares capability org.opensearch:neural-search-test-fixtures:3.0.0.0-SNAPSHOT which cannot be mapped to Maven
These issues indicate information that is lost in the published 'pom' metadata file, which may be an issue if the published library is consumed by an old Gradle version or Apache Maven.
The 'module' metadata file, which is used by Gradle 6+ is not affected.

Deprecated Gradle features were used in this build, making it incompatible with Gradle 9.0.

You can use '--warning-mode all' to show the individual deprecation warnings and determine if they come from your own scripts or plugins.

For more on this, please refer to https://docs.gradle.org/8.4/userguide/command_line_interface.html#sec:command_line_warnings in the Gradle documentation.

BUILD SUCCESSFUL in 2m 37s
38 actionable tasks: 14 executed, 24 up-to-date

@yuye-aws
Copy link
Member Author

I have also tested with ./gradlew publishToMavenLocal and then run the bwc tests. You can confirm that the BWC test would get passed after publishing to .m2 repository.

@yuye-aws
Copy link
Member Author

Since 2.13.0-SNAPSHOT bwc tests is good, can we close this issue?

@yuye-aws
Copy link
Member Author

Closing this issue as it has been resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants