Skip to content
This repository has been archived by the owner on Jan 9, 2025. It is now read-only.

feat: support metadata in Pinecone connector #87

Merged
merged 5 commits into from
Dec 28, 2023
Merged

Conversation

jvallesm
Copy link
Collaborator

@jvallesm jvallesm commented Dec 27, 2023

Because

  • RAG pipelines lack the ability to insert / query metadata from Pinecone

This PR

  • Adds a metadata input to the Pinecone UPSERT task (thx @donch1989)
  • Parses the user input and sends it to Pinecone
  • When the metadata flag is enabled in QUERY actions, the metadata field in a vector is shown in the output

🔨 QA

2 pipelines, one upserts a 3D vector with metadata and the other fetches the closest vector

CleanShot 2023-12-27 at 15 13 57
CleanShot 2023-12-27 at 15 15 23

> curl -X POST 'http://localhost:8080/vdp/v1beta/users/admin/pipelines/pinecone-query/trigger' \
--header 'Content-Type: application/json' \
--header "Authorization: Bearer $API_TOKEN" \
--data-raw '{"inputs":[{"coordinates":{"vector":[1.234,1,0.45]}}]}' | jq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   236  100   182  100    54    451    133 --:--:-- --:--:-- --:--:--   585
{
  "outputs": [
    {
      "closest_match": [
        {
          "id": "b",
          "metadata": {
            "color": "magenta"
          },
          "score": 0.770666,
          "values": [
            2.234,
            9.2342,
            0.123345
          ]
        }
      ],
      "input": [
        1.234,
        1,
        0.45
      ]
    }
  ],
  "metadata": {
    "traces": {}
  }
}

🗒️ Notes

  • ⏭️ Next I'll implement the query by metadata.

  • I'm not the happiest with submitting this without coverage, let me know if it's a blocker for merging. At this point the functionality is more important than the coverage so I'll do ☝️first.

  • I didn't manage to use directly an object input in the start component as metadata. However, using {start.myObjectInput.metadata} worked just fine. I think it has more to do with references than with the Pinecone connector, I'll check if it's an identified bug and otherwise I'll create a ticket for it.

  • I could trigger the pipelines that have a JSON object as input from the canvas and the CLI but not from the pipeline view. Same CTA as the point above.

CleanShot 2023-12-27 at 15 09 07

@jvallesm jvallesm self-assigned this Dec 27, 2023
Copy link

linear bot commented Dec 27, 2023

@jvallesm jvallesm added enhancement New feature or request and removed instill vdp labels Dec 27, 2023
Copy link

codecov bot commented Dec 27, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (3288445) 21.31% compared to head (8c0c2f9) 27.41%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main      #87      +/-   ##
==========================================
+ Coverage   21.31%   27.41%   +6.10%     
==========================================
  Files          11       12       +1     
  Lines         610      715     +105     
==========================================
+ Hits          130      196      +66     
- Misses        464      491      +27     
- Partials       16       28      +12     
Flag Coverage Δ
unittests 27.41% <100.00%> (+6.10%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@jvallesm jvallesm marked this pull request as ready for review December 27, 2023 14:17
@jvallesm
Copy link
Collaborator Author

Added filter to query:

CleanShot 2023-12-27 at 18 30 52
CleanShot 2023-12-27 at 18 31 29
CleanShot 2023-12-27 at 18 37 51

@jvallesm
Copy link
Collaborator Author

Small extra fix (screenshots before and after):
CleanShot 2023-12-27 at 18 46 00
CleanShot 2023-12-27 at 18 54 18

@donch1989
Copy link
Member

donch1989 commented Dec 28, 2023

I'm not the happiest with submitting this without coverage, let me know if it's a blocker for merging. At this point the functionality is more important than the coverage so I'll do ☝️first.

Currently, this won't block the merging of pull requests, I'll check this branch later today. But I believe it's time for us to start addressing coverage tests.

At present, our testing framework for the connector is not complete. If you have the time, it will be great if you can assist in planning this testing framework. One of the key discussions revolves around how to handle vendor secrets. Currently, I have thought of roughly two methods:

  1. We need to inject vendor secrets into tests using environment variables. The advantage is that it provides a realistic test scenario. However, the drawback is that, due to the potentially extensive execution of tests and CI, this may result in additional testing costs.

  2. We create a mock vendor for each vendor. Most of the time, we rely on a mock server for testing, and we only execute real vendor tests in the release version of VDP. The advantage is that costs can be more controlled, but the downside is that we must maintain the mock server (although this issue might be resolved by directly downloading the vendor's OpenAPI schema and generating mock server from the schema).

@jvallesm
Copy link
Collaborator Author

I believe it's time for us to start addressing coverage tests.

I absolutely agree, we need at least a basic safeguard to prevent breaking changes, in this PR for instance I QA'd what I expected from the changes I introduced but not that any previous changes in the connector are still working. So I'd advocate for a minimal input-output unit test for each task we modify, at least that will cover marshalling and error handling.

I'd love to assist in the framework testing planning. Some thoughts:

  • Cost and time of tests using real vendors is always of concern. Especially being an open repository, we don't control how many times the CI is triggered and it doesn't even depend on the activity of the team. We may find ourselves blocking or blocked by external contributions due to e.g. reaching the quota limit.
  • This makes me lean towards faking our vendors and making requests against them. It takes time but I believe it will pay off, especially when more people contribute to the same codebase and the stability expectations increase. Creating mocks from OpenAPI schemas, if available, is a good idea and will save some time on this. Sometimes we'll want to create complex scenarios (e.g. in Pinecone, UPSERT with metadata and query with / by metadata) and we might want to implement a dummy server that stores values in memory and serves them later.
  • But I do agree is positive running a test suite against the real vendors. This can be a daily job or, as you proposed, something we trigger as part of the release process. The good thing is that this tests our code end-to-end, and also verifies our assumptions on vendors, it serves as a contract test.

Using both produces an error:

```
Component pinecone_0 failed to execute. non-200 status code: 400, while
calling URL: {url} response body: {"code":3,"message":"Cannot provide
both 'ID' and 'vector' at the same time","details":[]}
```
@jvallesm
Copy link
Collaborator Author

@donch1989 I added some unit tests to increase the coverage 27.41% <100.00%> (+6.10%)

Copy link
Member

@donch1989 donch1989 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@donch1989 donch1989 merged commit 3734773 into main Dec 28, 2023
8 checks passed
@donch1989 donch1989 deleted the jvalles/ins-2326 branch December 28, 2023 16:08
donch1989 pushed a commit that referenced this pull request Jan 2, 2024
🤖 I have created a release *beep* *boop*
---


##
[0.9.0-beta](v0.8.1-beta...v0.9.0-beta)
(2024-01-01)


### Features

* **airbyte:** wrap all Airbyte connectors into one
([#79](#79))
([30fe290](30fe290))
* **numbers:** migrate to Capture API
([#89](#89))
([e976854](e976854))
* support metadata in Pinecone connector
([#87](#87))
([3734773](3734773))


### Bug Fixes

* **instill:** fix wrong Airbyte image_name
([#91](#91))
([52e8409](52e8409))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request instill vdp
Projects
No open projects
Archived in project
Development

Successfully merging this pull request may close these issues.

3 participants