feat: add ParseComponent class and mark existing parsers as legacy #6111

Cristhianzl · 2025-02-04T12:41:32Z

This pull request introduces a new ParseComponent class to handle parsing of DataFrame and Data objects, and marks existing parsing components as legacy. The most important changes include the addition of the ParseComponent class with its methods and inputs/outputs, and marking the ParseDataComponent and ParseDataFrameComponent as legacy.

New ParseComponent class:

src/backend/base/langflow/components/processing/parse.py: Added ParseComponent class to parse DataFrame or Data objects into plain text using a specified template. This class includes methods to update build configuration, clean arguments, and parse data into combined text or a new DataFrame.

Marking existing components as legacy:

src/backend/base/langflow/components/processing/parse_data.py: Marked ParseDataComponent class as legacy.
src/backend/base/langflow/components/processing/parse_dataframe.py: Marked ParseDataFrameComponent class as legacy.

…s or Data objects into plain text using a specified template. This component allows users to choose between parsing a DataFrame or a single Data object, format the input using a template, and combine rows/items into a single text with a specified separator. It also provides outputs for the parsed text and a DataFrame with formatted text rows. 🔄 (parse_data.py): Add a legacy flag to the 'ParseDataComponent' to indicate it is a legacy component for converting Data objects into Messages using any field name from input data. 🔄 (parse_dataframe.py): Add a legacy flag to the 'ParseDataFrameComponent' to indicate it is a legacy component for converting DataFrames into text rows.

codeflash-ai · 2025-02-04T12:55:00Z

src/backend/base/langflow/components/processing/parse.py

+        input_type = self.input_type
+        if input_type == "DataFrame":
+            if not isinstance(self.df, DataFrame):
+                raise ValueError("Expected a valid DataFrame for input type 'DataFrame'.")
+            return self.df, None, self.template, self.sep
+        if input_type == "Data":
+            if not isinstance(self.data, Data):
+                raise ValueError("Expected a valid Data object for input type 'Data'.")
+            return None, self.data, self.template, self.sep
+        raise ValueError(f"Unsupported input type: {input_type}")


Suggested change

input_type = self.input_type

if input_type == "DataFrame":

if not isinstance(self.df, DataFrame):

raise ValueError("Expected a valid DataFrame for input type 'DataFrame'.")

return self.df, None, self.template, self.sep

if input_type == "Data":

if not isinstance(self.data, Data):

raise ValueError("Expected a valid Data object for input type 'Data'.")

return None, self.data, self.template, self.sep

raise ValueError(f"Unsupported input type: {input_type}")

# Combine input type check and validation in one step for efficiency

input_type_map = {"DataFrame": (self.df, DataFrame), "Data": (self.data, Data)}

# Fetch the data and corresponding type for validation

data, expected_type = input_type_map.get(self.input_type, (None, None))

if data is None:

raise ValueError(f"Unsupported input type: {self.input_type}")

if not isinstance(data, expected_type):

raise ValueError(f"Expected a valid {expected_type.__name__} for input type '{self.input_type}'.")

return (

data if self.input_type == "DataFrame" else None,

data if self.input_type == "Data" else None,

self.template,

self.sep,

)

codeflash-ai · 2025-02-04T12:55:03Z

⚡️ Codeflash found optimizations for this PR

📄 12% (0.12x) speedup for `ParseComponent._clean_args` in `src/backend/base/langflow/components/processing/parse.py`

⏱️ Runtime : 15.2 microseconds → 13.5 microseconds (best of 36 runs)

📝 Explanation and details

Explanation of Optimizations.

Merged Input Type Check and Validation:

✅ Correctness verification report:

Test	Status
⚙️ Existing Unit Tests	🔘 None Found
🌀 Generated Regression Tests	✅ 4 Passed
⏪ Replay Tests	🔘 None Found
🔎 Concolic Coverage Tests	🔘 None Found
📊 Tests Coverage	undefined

🌀 Generated Regression Tests Details

import pandas as pd
# imports
import pytest  # used for our unit tests
from langflow.components.processing.parse import ParseComponent
# function to test
from langflow.custom import Component
from langflow.schema import Data, DataFrame
from pydantic import BaseModel


class DataFrame(pd.DataFrame):
    """A pandas DataFrame subclass specialized for handling collections of Data objects.

    This class extends pandas.DataFrame to provide seamless integration between
    Langflow's Data objects and pandas' powerful data manipulation capabilities.

    Args:
        data: Input data in various formats:
            - List[Data]: List of Data objects
            - List[Dict]: List of dictionaries
            - Dict: Dictionary of arrays/lists
            - pandas.DataFrame: Existing DataFrame
            - Any format supported by pandas.DataFrame
        **kwargs: Additional arguments passed to pandas.DataFrame constructor
    """

    def __init__(self, data: list[dict] | list[Data] | pd.DataFrame | None = None, **kwargs):
        if data is None:
            super().__init__(**kwargs)
            return

        if isinstance(data, list):
            if all(isinstance(x, Data) for x in data):
                data = [d.data for d in data if hasattr(d, "data")]
            elif not all(isinstance(x, dict) for x in data):
                msg = "List items must be either all Data objects or all dictionaries"
                raise ValueError(msg)
            kwargs["data"] = data
        elif isinstance(data, dict | pd.DataFrame):
            kwargs["data"] = data

        super().__init__(**kwargs)

    def to_data_list(self) -> list[Data]:
        """Converts the DataFrame back to a list of Data objects."""
        list_of_dicts = self.to_dict(orient="records")
        return [Data(data=row) for row in list_of_dicts]

    def add_row(self, data: dict | Data) -> "DataFrame":
        """Adds a single row to the dataset.

        Args:
            data: Either a Data object or a dictionary to add as a new row

        Returns:
            DataFrame: A new DataFrame with the added row
        """
        if isinstance(data, Data):
            data = data.data
        new_df = self._constructor([data])
        return pd.concat([self, new_df], ignore_index=True)

    def add_rows(self, data: list[dict | Data]) -> "DataFrame":
        """Adds multiple rows to the dataset.

        Args:
            data: List of Data objects or dictionaries to add as new rows

        Returns:
            DataFrame: A new DataFrame with the added rows
        """
        processed_data = []
        for item in data:
            if isinstance(item, Data):
                processed_data.append(item.data)
            else:
                processed_data.append(item)
        new_df = self._constructor(processed_data)
        return pd.concat([self, new_df], ignore_index=True)

    @property
    def _constructor(self):
        def _c(*args, **kwargs):
            return DataFrame(*args, **kwargs).__finalize__(self)

        return _c

    def __bool__(self):
        """Truth value testing for the DataFrame.

        Returns True if the DataFrame has at least one row, False otherwise.
        """
        return not self.empty

class Data(BaseModel):
    """Represents a record with text and optional data.

    Attributes:
        data (dict, optional): Additional data associated with the record.
    """

    text_key: str = "text"
    data: dict = {}
    default_value: str | None = ""

    @classmethod
    def validate_data(cls, values):
        if not isinstance(values, dict):
            msg = "Data must be a dictionary"
            raise ValueError(msg)
        if not values.get("data"):
            values["data"] = {}
        for key in values:
            if key not in values["data"] and key not in {"text_key", "data", "default_value"}:
                values["data"][key] = values[key]
        return values

    def get_text(self):
        """Retrieves the text value from the data dictionary.

        If the text key is present in the data dictionary, the corresponding value is returned.
        Otherwise, the default value is returned.
        """
        return self.data.get(self.text_key, self.default_value)

    def set_text(self, text: str | None) -> str:
        """Sets the text value in the data dictionary."""
        new_text = "" if text is None else str(text)
        self.data[self.text_key] = new_text
        return new_text
from langflow.components.processing.parse import ParseComponent


# unit tests
class TestParseComponent:
    def setup_method(self):
        """Setup common attributes for the tests."""
        self.component = ParseComponent()
        self.component.template = "template"
        self.component.sep = ","

    

import pandas as pd
# imports
import pytest  # used for our unit tests
from langflow.components.processing.parse import ParseComponent
from langflow.custom import Component
from langflow.schema import Data, DataFrame


# function to test
class DataFrame(pd.DataFrame):
    """A pandas DataFrame subclass specialized for handling collections of Data objects.

    This class extends pandas.DataFrame to provide seamless integration between
    Langflow's Data objects and pandas' powerful data manipulation capabilities.

    Args:
        data: Input data in various formats:
            - List[Data]: List of Data objects
            - List[Dict]: List of dictionaries
            - Dict: Dictionary of arrays/lists
            - pandas.DataFrame: Existing DataFrame
            - Any format supported by pandas.DataFrame
        **kwargs: Additional arguments passed to pandas.DataFrame constructor
    """

    def __init__(self, data=None, **kwargs):
        if data is None:
            super().__init__(**kwargs)
            return

        if isinstance(data, list):
            if all(isinstance(x, Data) for x in data):
                data = [d.data for d in data if hasattr(d, "data")]
            elif not all(isinstance(x, dict) for x in data):
                msg = "List items must be either all Data objects or all dictionaries"
                raise ValueError(msg)
            kwargs["data"] = data
        elif isinstance(data, (dict, pd.DataFrame)):
            kwargs["data"] = data

        super().__init__(**kwargs)

class Data:
    """Represents a record with text and optional data."""
    def __init__(self, data=None):
        self.data = data or {}
from langflow.components.processing.parse import ParseComponent


# unit tests
class TestCleanArgs:
    # Setup a mock ParseComponent with necessary attributes
    class MockParseComponent(ParseComponent):
        def __init__(self, input_type, df=None, data=None, template=None, sep=None):
            self.input_type = input_type
            self.df = df
            self.data = data
            self.template = template
            self.sep = sep

    # Valid Input Scenarios

… maintainability 📝 (parse.py): Update ParseComponent description for better clarity and understanding of functionality 📝 (parse.py): Update ParseComponent input descriptions for improved user guidance 📝 (parse.py): Update ParseComponent output descriptions for better indication of returned data 📝 (parse.py): Update ParseComponent method comments for clearer explanation of functionality

codspeed-hq · 2025-02-05T13:37:56Z

CodSpeed Performance Report

Merging #6111 will not alter performance

_{Comparing cz/parse-data (550dc3c) with main (17f1ecf)}

Summary

✅ 9 untouched benchmarks

src/backend/base/langflow/components/processing/parse.py

…omponent class to improve code readability and maintainability.

italojohnny

Hey @Cristhianzl

All Component PRs need tests. Could you, please, add them?

…s_mapping ✨ (test_parse_data_component.py): Add unit tests for ParseDataComponent class including basic setup, cleaning arguments, parsing data with custom template and separator, handling empty list, parsing data as list, nested fields, missing required fields, invalid template fields, and preserving original data after parsing.

…ify the code ♻️ (test_parse_data_component.py): refactor import statement to use DID_NOT_EXIST constant from base module for better readability and maintainability

…parison of data fields and templates in ParseDataComponent class

…t or None by converting to list and filtering out None values for consistency and improved error handling 📝 (parse.py): Update _format_dataframe_row method to handle case where template_to_parse is not provided by defaulting to JSON format 📝 (parse.py): Refactor _format_data_object method to handle None values in data list for improved error handling 📝 (test_parse_data_component.py): Remove unnecessary result assignment in test method to improve code readability and efficiency

Cristhianzl · 2025-02-12T18:51:02Z

Hey @Cristhianzl

All Component PRs need tests. Could you, please, add them?

Done! Thanks!

Cristhianzl requested review from italojohnny, rodrigosnader and ogabrielluiz February 4, 2025 12:41

Cristhianzl self-assigned this Feb 4, 2025

dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. enhancement New feature or request labels Feb 4, 2025

github-actions bot added enhancement New feature or request and removed enhancement New feature or request labels Feb 4, 2025

[autofix.ci] apply automated fixes

5971ef6

github-actions bot added enhancement New feature or request and removed enhancement New feature or request labels Feb 4, 2025

[autofix.ci] apply automated fixes (attempt 2/3)

fdbb90d

github-actions bot added enhancement New feature or request and removed enhancement New feature or request labels Feb 4, 2025

codeflash-ai bot reviewed Feb 4, 2025

View reviewed changes

Cristhianzl added 2 commits February 5, 2025 10:21

merge fix

f315c76

github-actions bot added enhancement New feature or request and removed enhancement New feature or request labels Feb 5, 2025

[autofix.ci] apply automated fixes

031dc01

github-actions bot added enhancement New feature or request and removed enhancement New feature or request labels Feb 5, 2025

[autofix.ci] apply automated fixes (attempt 2/3)

3eff526

github-actions bot added enhancement New feature or request and removed enhancement New feature or request labels Feb 5, 2025

ogabrielluiz requested changes Feb 5, 2025

View reviewed changes

src/backend/base/langflow/components/processing/parse.py Show resolved Hide resolved

📝 (parse.py): Add type hints and docstrings for methods in the ParseC…

66233cd

…omponent class to improve code readability and maintainability.

github-actions bot added enhancement New feature or request and removed enhancement New feature or request labels Feb 6, 2025

merge fix

b2fe716

github-actions bot added enhancement New feature or request and removed enhancement New feature or request labels Feb 10, 2025

italojohnny requested changes Feb 12, 2025

View reviewed changes