Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Integrate a Data Importer Plugin into Dashboards #9199

Open
huyaboo opened this issue Jan 17, 2025 · 3 comments
Open

[RFC] Integrate a Data Importer Plugin into Dashboards #9199

huyaboo opened this issue Jan 17, 2025 · 3 comments
Assignees
Labels
enhancement New feature or request RFC Substantial changes or new features that require community input to garner consensus.

Comments

@huyaboo
Copy link
Member

huyaboo commented Jan 17, 2025

Overview

Currently in OpenSearch Dashboards (OSD), users can ingest documents into OpenSearch and visualize their data in OpenSearch Dashboards. However, there is no existing mechanism to easily import custom static data through the Dashboards UI. Sample data exists but there is no way for custom data to be imported. This is similar to #1791, which articulates the issue as well. In short, there are several use cases for enabling data import through Dashboards:

  1. For users:
    1. They can quickly try out Dashboards with their data
    2. They do not need to write one-off scripts to ingest documents into OpenSearch, especially if they don't have the OpenSearch endpoint on hand or they cannot install NPM scripts, like OpenSearch-CSV
  2. For developers:
    1. Developers can quickly test OSD features with datasets that are more relevant to their use case
    2. Developers can write custom file parsers (elaborated later)

Requirements

This list is by no means exhaustive but there are several basic features that must be a part of this feature.

As a user:

  • I should be able to type my data in the desired format and ingest it into dashboards
  • I should be able to upload a file of choice in the desired format and ingest it into dashboards
  • I should be able to select which index to ingest it into
  • I should be able to select which datasource to ingest it into

As a developer/admin:

  • I should have the ability to register my own custom file parsing logic
  • I should be able to configure exactly what file types are exposed to end users
  • I should be able to configure the file size limit and text character limit

Out of scope

  • A way to create the index and any mappings/settings necessary before import. This is out of scope because the user flow hasn't been finalized. The main purpose of this RFC is to establish a method of ingesting data via OSD directly. However this plugin should be extensible enough that this requirement can be accommodated
  • A custom setup page/data analysis page. Like the bullet point above, it's better suited as a separate issue but should be considered
  • Bulk ingesting multiple datasets/ingesting large datasets. This feature focuses on importing smaller static datasets. With larger datasets (large in this context means datasets in the GB), users are more suited to ingest directly into OpenSearch as it was designed for that specific purpose.

Approach

To integrate support, we must split our approach into UI and server components

UI

OUI component library should contain the necessary components for the user to execute the actions specified in the requirements:

  • OuiFilePicker: For uploading data
  • OuiCodeEditor/monaco: For inputting text
  • OuiSelect: For choosing file type and import type (file/text upload)
  • OuiSelectable: For choosing the index name (Fetching can be done with the IndexPatternsService, or as a last resort, expose an API to query for index names)
  • DataSourceSelectable: This is exposed via the dataSourceManagement plugin and should handle the datasource fetching for us
  • OuiButton: For executing the import process

Server

In OSD core, we have a similar method to import files via Saved Objects import.

For client side:

/**
* Import
*
* Does the initial import of a file, resolveImportErrors then handles errors and retries
*/
import = async () => {
const { http, dataSourceEnabled } = this.props;
const { file, importMode, selectedDataSourceId } = this.state;
this.setState({ status: 'loading', error: undefined });
// Import the file
try {
const response = await importFile(
http,
file!,
importMode,
selectedDataSourceId,
dataSourceEnabled
);
this.setState(processImportResponse(response), () => {
// Resolve import errors right away if there's no index patterns to match
// This will ask about overwriting each object, etc
if (this.state.unmatchedReferences?.length === 0) {
this.resolveImportErrors();
}
});
} catch (e) {
this.setState({
status: 'error',
error: getErrorMessage(e),
});
return;
}
};

For server side:

router.post(
{
path: '/_import',
options: {
body: {
maxBytes: maxImportPayloadBytes,
output: 'stream',
accepts: 'multipart/form-data',
},
},
validate: {
query: schema.object(
{
overwrite: schema.boolean({ defaultValue: false }),
createNewCopies: schema.boolean({ defaultValue: false }),
dataSourceId: schema.maybe(schema.string({ defaultValue: '' })),
workspaces: schema.maybe(
schema.oneOf([schema.string(), schema.arrayOf(schema.string())])
),
dataSourceEnabled: schema.maybe(schema.boolean({ defaultValue: false })),
},
{
validate: (object) => {
if (object.overwrite && object.createNewCopies) {
return 'cannot use [overwrite] with [createNewCopies]';
}
},
}
),
body: schema.object({
file: schema.stream(),
}),
},
},

We can follow a similar approach and expose two routes (tentatively named):

/api/data_importer_plugin/_import_text
/api/data_importer_plugin/_import_file

As the names suggest, we split up importing text and file inputs into two routes. There are two reasons why they're split:

  1. It makes the schemas smaller, easier to write, and easier to validate
  2. Some file types can only be upload-able (Excel sheets for instance) and cannot be textual input. There can exist some data formats that can only be textual and cannot be uploaded to a file, but I cannot think of any formats. The point is, handling a file should be separate from handling text, especially as file uploads will most definitely involve larger datasets than typed ones.

For text input, the flow is as follows:

  1. Call validateText() to ensure data is well formed
  2. Call ingestText() to ingest that document into OpenSearch

For file input, the flow is as follows:

  1. Call ingestFile() and use the stream to validate and ingest to OpenSearch in chunks

Because the underlying FileStream may be arbitrarily large (indeed, we can add a config to limit the max size in bytes but that limit is set by the user), we cannot store the entire contents in memory. We must process this as a stream. This means there will be no pre-validation step and there's a possibility that only some documents can be ingested. The issue of how to handle these failed records is an implementation detail but we can specify which documents succeeded/failed to ingest in the response body.

To accommodate the many types of files, there needs to be a parser dedicated to each type of file called IFileParser. The structure of this parser is as follows:

/**
 * Parser that handles a particular file type
 */
export interface IFileParser {
  /**
   * Can this file type support text input? If true, validateText() and ingestText() MUST be supplied
   * @returns
   */
  supportsText: () => boolean;

  /**
   * Given text input, validate that it is in the expected format; required if supportsText() returns true
   * @param text
   * @param options
   * @returns
   * @throws Can throw an error if text doesn't match expected format
   */
  validateText?: (text: string, options: ValidationOptions) => Promise<boolean>;

  /**
   * Assuming valid text input, handle the ingestion into OpenSearch; required if supportsText() returns true
   * @param text
   * @param options
   * @returns
   * @throws Can throw server errors when attempting to ingest into OpenSearch
   */
  ingestText?: (text: string, options: IngestOptions) => Promise<IngestResponse>;

  /**
   * Can this file type support text input? If true, validateFile() MUST be supplied
   * @returns
   */
  supportsFile: () => boolean;

  /**
   * Given an arbitrary file stream, handle the validation and ingestion into OpenSearch; required if supportsFile() returns true
   * @param file
   * @param options
   * @returns
   * @throws Can throw server errors when attempting to ingest into OpenSearch
   */
  ingestFile?: (file: Readable, options: IngestOptions) => Promise<IngestResponse>;
}

Registering Custom file formats

By default, the DataImporterPlugin will supply three parsers: .ndjson, .csv, and .json. Parsers have to be registered in the FileParserService, which will register the IFileParsers generated by the DataImporterPlugin as well as any other plugins. For the latter, DataImporterPluginSetup will expose a function registerFileParser() to other plugins to register a custom IFileParser (like for example .xlsx, .geojson, .gltf, .biojson, etc.):

export interface DataImporterPluginSetup {
  /**
   * Register custom file type parsers to ingest into OpenSearch
   * @param fileType The file type to register a parser for (should NOT be csv, ndjson, or json filetypes)
   * @param fileParser
   * @throws errors if a filetype is already registered in this plugin or another plugin or if the IFileParser does not implement some methods
   */
  registerFileParser: (fileType: string, fileParser: IFileParser) => void;
}

export class DataImporterPlugin
  implements Plugin<DataImporterPluginSetup, DataImporterPluginStart> {
  private readonly fileParsers: FileParserService = new FileParserService();

  constructor(private readonly initializerContext: PluginInitializerContext) {}

  public async setup(
    core: CoreSetup,
    deps: DataImporterPluginSetupDeps
  ): Promise<DataImporterPluginSetup> {
    // Register default file parsers
    this.fileParsers.registerFileParser(CSV_FILE_TYPE, new CSVParser());
    this.fileParsers.registerFileParser(NDJSON_FILE_TYPE, new NDJSONParser());
    this.fileParsers.registerFileParser(JSON_FILE_TYPE, new JSONParser());

    // Handle other setup() logic

    return {
	  // Exposed function for other plugins to consume
      registerFileParser: (fileType, fileParser) => {
        this.fileParsers.registerFileParser(fileType, fileParser);
      },
    };
  }
}

POC

A PoC plugin data-importer-plugin is introduced which helps capture this vision. It doesn't implement everything stated in this RFC, but it provides the core feature set outlined in the requirements

@huyaboo huyaboo added the enhancement New feature or request label Jan 17, 2025
@huyaboo huyaboo self-assigned this Jan 17, 2025
@huyaboo huyaboo added the RFC Substantial changes or new features that require community input to garner consensus. label Jan 17, 2025
@huyaboo huyaboo changed the title Data Importer Plugin RFC [RFC] Integrate a Data Importer Plugin into Dashboards Jan 17, 2025
@ashwin-pc
Copy link
Member

I love the idea and the high level architecture. I do want us to think about the UX a bit more though. Lets embed this expereince correctly within the UI for OSD. e.g. the Sample data page is a good place for example to add an entry point to this. I also dont like it being a separate page. @kgcreative @lauralexis thoughts?

@ruanyl
Copy link
Member

ruanyl commented Jan 17, 2025

I believe this is the wanted feature at least from my perspective :) But I'm not sure if it's necessary to introduce a new plugin for it? Could we have data ingestion be integrated seamlessly with the existing workflow? like within index management plugin, so that you can creating index with data from static data files or selecting an existing index and then ingest data via files?

Or like @ashwin-pc suggested, could this ingest experience be part of the existing sample data import page?

@huyaboo
Copy link
Member Author

huyaboo commented Jan 17, 2025

I believe this is the wanted feature at least from my perspective :) But I'm not sure if it's necessary to introduce a new plugin for it? Could we have data ingestion be integrated seamlessly with the existing workflow? like within index management plugin, so that you can creating index with data from static data files or selecting an existing index and then ingest data via files?

Or like @ashwin-pc suggested, could this ingest experience be part of the existing sample data import page?

The plugin itself was more of a way to PoC changes without changing core all too much. I'm not opposed to porting this functionality to the existing sample data experience, especially the UI experience, but the benefit of a separate plugin is to easily enable/disable this feature and provide a separate platform for data import, especially if we consider adding more supported file types (structured/unstructured formats).

Also, thinking out loud, if we were to integrate this into the sample data experience, we should rename "Sample Data" to "Import Data". Its mostly semantics but users now have the option to import our "sample data" or their "real data".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request RFC Substantial changes or new features that require community input to garner consensus.
Projects
Status: New
Development

No branches or pull requests

3 participants