Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add ParseComponent class and mark existing parsers as legacy #6111

Open
wants to merge 19 commits into
base: main
Choose a base branch
from

Conversation

Cristhianzl
Copy link
Member

@Cristhianzl Cristhianzl commented Feb 4, 2025

This pull request introduces a new ParseComponent class to handle parsing of DataFrame and Data objects, and marks existing parsing components as legacy. The most important changes include the addition of the ParseComponent class with its methods and inputs/outputs, and marking the ParseDataComponent and ParseDataFrameComponent as legacy.

New ParseComponent class:

  • src/backend/base/langflow/components/processing/parse.py: Added ParseComponent class to parse DataFrame or Data objects into plain text using a specified template. This class includes methods to update build configuration, clean arguments, and parse data into combined text or a new DataFrame.

Marking existing components as legacy:

image

…s or Data objects into plain text using a specified template. This component allows users to choose between parsing a DataFrame or a single Data object, format the input using a template, and combine rows/items into a single text with a specified separator. It also provides outputs for the parsed text and a DataFrame with formatted text rows.

🔄 (parse_data.py): Add a legacy flag to the 'ParseDataComponent' to indicate it is a legacy component for converting Data objects into Messages using any field name from input data.

🔄 (parse_dataframe.py): Add a legacy flag to the 'ParseDataFrameComponent' to indicate it is a legacy component for converting DataFrames into text rows.
@Cristhianzl Cristhianzl self-assigned this Feb 4, 2025
@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. enhancement New feature or request labels Feb 4, 2025
@github-actions github-actions bot added enhancement New feature or request and removed enhancement New feature or request labels Feb 4, 2025
@github-actions github-actions bot added enhancement New feature or request and removed enhancement New feature or request labels Feb 4, 2025
@github-actions github-actions bot added enhancement New feature or request and removed enhancement New feature or request labels Feb 4, 2025
Comment on lines 94 to 103
input_type = self.input_type
if input_type == "DataFrame":
if not isinstance(self.df, DataFrame):
raise ValueError("Expected a valid DataFrame for input type 'DataFrame'.")
return self.df, None, self.template, self.sep
if input_type == "Data":
if not isinstance(self.data, Data):
raise ValueError("Expected a valid Data object for input type 'Data'.")
return None, self.data, self.template, self.sep
raise ValueError(f"Unsupported input type: {input_type}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
input_type = self.input_type
if input_type == "DataFrame":
if not isinstance(self.df, DataFrame):
raise ValueError("Expected a valid DataFrame for input type 'DataFrame'.")
return self.df, None, self.template, self.sep
if input_type == "Data":
if not isinstance(self.data, Data):
raise ValueError("Expected a valid Data object for input type 'Data'.")
return None, self.data, self.template, self.sep
raise ValueError(f"Unsupported input type: {input_type}")
# Combine input type check and validation in one step for efficiency
input_type_map = {"DataFrame": (self.df, DataFrame), "Data": (self.data, Data)}
# Fetch the data and corresponding type for validation
data, expected_type = input_type_map.get(self.input_type, (None, None))
if data is None:
raise ValueError(f"Unsupported input type: {self.input_type}")
if not isinstance(data, expected_type):
raise ValueError(f"Expected a valid {expected_type.__name__} for input type '{self.input_type}'.")
return (
data if self.input_type == "DataFrame" else None,
data if self.input_type == "Data" else None,
self.template,
self.sep,
)

Copy link
Contributor

codeflash-ai bot commented Feb 4, 2025

⚡️ Codeflash found optimizations for this PR

📄 12% (0.12x) speedup for ParseComponent._clean_args in src/backend/base/langflow/components/processing/parse.py

⏱️ Runtime : 15.2 microseconds 13.5 microseconds (best of 36 runs)

📝 Explanation and details

Explanation of Optimizations.

  1. Merged Input Type Check and Validation:

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 4 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage undefined
🌀 Generated Regression Tests Details
import pandas as pd
# imports
import pytest  # used for our unit tests
from langflow.components.processing.parse import ParseComponent
# function to test
from langflow.custom import Component
from langflow.schema import Data, DataFrame
from pydantic import BaseModel


class DataFrame(pd.DataFrame):
    """A pandas DataFrame subclass specialized for handling collections of Data objects.

    This class extends pandas.DataFrame to provide seamless integration between
    Langflow's Data objects and pandas' powerful data manipulation capabilities.

    Args:
        data: Input data in various formats:
            - List[Data]: List of Data objects
            - List[Dict]: List of dictionaries
            - Dict: Dictionary of arrays/lists
            - pandas.DataFrame: Existing DataFrame
            - Any format supported by pandas.DataFrame
        **kwargs: Additional arguments passed to pandas.DataFrame constructor
    """

    def __init__(self, data: list[dict] | list[Data] | pd.DataFrame | None = None, **kwargs):
        if data is None:
            super().__init__(**kwargs)
            return

        if isinstance(data, list):
            if all(isinstance(x, Data) for x in data):
                data = [d.data for d in data if hasattr(d, "data")]
            elif not all(isinstance(x, dict) for x in data):
                msg = "List items must be either all Data objects or all dictionaries"
                raise ValueError(msg)
            kwargs["data"] = data
        elif isinstance(data, dict | pd.DataFrame):
            kwargs["data"] = data

        super().__init__(**kwargs)

    def to_data_list(self) -> list[Data]:
        """Converts the DataFrame back to a list of Data objects."""
        list_of_dicts = self.to_dict(orient="records")
        return [Data(data=row) for row in list_of_dicts]

    def add_row(self, data: dict | Data) -> "DataFrame":
        """Adds a single row to the dataset.

        Args:
            data: Either a Data object or a dictionary to add as a new row

        Returns:
            DataFrame: A new DataFrame with the added row
        """
        if isinstance(data, Data):
            data = data.data
        new_df = self._constructor([data])
        return pd.concat([self, new_df], ignore_index=True)

    def add_rows(self, data: list[dict | Data]) -> "DataFrame":
        """Adds multiple rows to the dataset.

        Args:
            data: List of Data objects or dictionaries to add as new rows

        Returns:
            DataFrame: A new DataFrame with the added rows
        """
        processed_data = []
        for item in data:
            if isinstance(item, Data):
                processed_data.append(item.data)
            else:
                processed_data.append(item)
        new_df = self._constructor(processed_data)
        return pd.concat([self, new_df], ignore_index=True)

    @property
    def _constructor(self):
        def _c(*args, **kwargs):
            return DataFrame(*args, **kwargs).__finalize__(self)

        return _c

    def __bool__(self):
        """Truth value testing for the DataFrame.

        Returns True if the DataFrame has at least one row, False otherwise.
        """
        return not self.empty

class Data(BaseModel):
    """Represents a record with text and optional data.

    Attributes:
        data (dict, optional): Additional data associated with the record.
    """

    text_key: str = "text"
    data: dict = {}
    default_value: str | None = ""

    @classmethod
    def validate_data(cls, values):
        if not isinstance(values, dict):
            msg = "Data must be a dictionary"
            raise ValueError(msg)
        if not values.get("data"):
            values["data"] = {}
        for key in values:
            if key not in values["data"] and key not in {"text_key", "data", "default_value"}:
                values["data"][key] = values[key]
        return values

    def get_text(self):
        """Retrieves the text value from the data dictionary.

        If the text key is present in the data dictionary, the corresponding value is returned.
        Otherwise, the default value is returned.
        """
        return self.data.get(self.text_key, self.default_value)

    def set_text(self, text: str | None) -> str:
        """Sets the text value in the data dictionary."""
        new_text = "" if text is None else str(text)
        self.data[self.text_key] = new_text
        return new_text
from langflow.components.processing.parse import ParseComponent


# unit tests
class TestParseComponent:
    def setup_method(self):
        """Setup common attributes for the tests."""
        self.component = ParseComponent()
        self.component.template = "template"
        self.component.sep = ","

    

import pandas as pd
# imports
import pytest  # used for our unit tests
from langflow.components.processing.parse import ParseComponent
from langflow.custom import Component
from langflow.schema import Data, DataFrame


# function to test
class DataFrame(pd.DataFrame):
    """A pandas DataFrame subclass specialized for handling collections of Data objects.

    This class extends pandas.DataFrame to provide seamless integration between
    Langflow's Data objects and pandas' powerful data manipulation capabilities.

    Args:
        data: Input data in various formats:
            - List[Data]: List of Data objects
            - List[Dict]: List of dictionaries
            - Dict: Dictionary of arrays/lists
            - pandas.DataFrame: Existing DataFrame
            - Any format supported by pandas.DataFrame
        **kwargs: Additional arguments passed to pandas.DataFrame constructor
    """

    def __init__(self, data=None, **kwargs):
        if data is None:
            super().__init__(**kwargs)
            return

        if isinstance(data, list):
            if all(isinstance(x, Data) for x in data):
                data = [d.data for d in data if hasattr(d, "data")]
            elif not all(isinstance(x, dict) for x in data):
                msg = "List items must be either all Data objects or all dictionaries"
                raise ValueError(msg)
            kwargs["data"] = data
        elif isinstance(data, (dict, pd.DataFrame)):
            kwargs["data"] = data

        super().__init__(**kwargs)

class Data:
    """Represents a record with text and optional data."""
    def __init__(self, data=None):
        self.data = data or {}
from langflow.components.processing.parse import ParseComponent


# unit tests
class TestCleanArgs:
    # Setup a mock ParseComponent with necessary attributes
    class MockParseComponent(ParseComponent):
        def __init__(self, input_type, df=None, data=None, template=None, sep=None):
            self.input_type = input_type
            self.df = df
            self.data = data
            self.template = template
            self.sep = sep

    # Valid Input Scenarios

Codeflash

… maintainability

📝 (parse.py): Update ParseComponent description for better clarity and understanding of functionality
📝 (parse.py): Update ParseComponent input descriptions for improved user guidance
📝 (parse.py): Update ParseComponent output descriptions for better indication of returned data
📝 (parse.py): Update ParseComponent method comments for clearer explanation of functionality
@github-actions github-actions bot added enhancement New feature or request and removed enhancement New feature or request labels Feb 5, 2025
@github-actions github-actions bot added enhancement New feature or request and removed enhancement New feature or request labels Feb 5, 2025
@github-actions github-actions bot added enhancement New feature or request and removed enhancement New feature or request labels Feb 5, 2025
Copy link

codspeed-hq bot commented Feb 5, 2025

CodSpeed Performance Report

Merging #6111 will not alter performance

Comparing cz/parse-data (550dc3c) with main (17f1ecf)

Summary

✅ 9 untouched benchmarks

…omponent class to improve code readability and maintainability.
@github-actions github-actions bot added enhancement New feature or request and removed enhancement New feature or request labels Feb 6, 2025
@github-actions github-actions bot added enhancement New feature or request and removed enhancement New feature or request labels Feb 10, 2025
Copy link
Member

@italojohnny italojohnny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @Cristhianzl

All Component PRs need tests. Could you, please, add them?

…s_mapping

✨ (test_parse_data_component.py): Add unit tests for ParseDataComponent class including basic setup, cleaning arguments, parsing data with custom template and separator, handling empty list, parsing data as list, nested fields, missing required fields, invalid template fields, and preserving original data after parsing.
@github-actions github-actions bot added enhancement New feature or request and removed enhancement New feature or request labels Feb 12, 2025
@github-actions github-actions bot added enhancement New feature or request and removed enhancement New feature or request labels Feb 12, 2025
…ify the code

♻️ (test_parse_data_component.py): refactor import statement to use DID_NOT_EXIST constant from base module for better readability and maintainability
…parison of data fields and templates in ParseDataComponent class
@github-actions github-actions bot added enhancement New feature or request and removed enhancement New feature or request labels Feb 12, 2025
@github-actions github-actions bot added enhancement New feature or request and removed enhancement New feature or request labels Feb 12, 2025
…t or None by converting to list and filtering out None values for consistency and improved error handling

📝 (parse.py): Update _format_dataframe_row method to handle case where template_to_parse is not provided by defaulting to JSON format
📝 (parse.py): Refactor _format_data_object method to handle None values in data list for improved error handling
📝 (test_parse_data_component.py): Remove unnecessary result assignment in test method to improve code readability and efficiency
@github-actions github-actions bot added enhancement New feature or request and removed enhancement New feature or request labels Feb 12, 2025
@github-actions github-actions bot added enhancement New feature or request and removed enhancement New feature or request labels Feb 12, 2025
@Cristhianzl
Copy link
Member Author

Hey @Cristhianzl

All Component PRs need tests. Could you, please, add them?

Done! Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request size:L This PR changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants