-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add ParseComponent class and mark existing parsers as legacy #6111
base: main
Are you sure you want to change the base?
Conversation
…s or Data objects into plain text using a specified template. This component allows users to choose between parsing a DataFrame or a single Data object, format the input using a template, and combine rows/items into a single text with a specified separator. It also provides outputs for the parsed text and a DataFrame with formatted text rows. 🔄 (parse_data.py): Add a legacy flag to the 'ParseDataComponent' to indicate it is a legacy component for converting Data objects into Messages using any field name from input data. 🔄 (parse_dataframe.py): Add a legacy flag to the 'ParseDataFrameComponent' to indicate it is a legacy component for converting DataFrames into text rows.
input_type = self.input_type | ||
if input_type == "DataFrame": | ||
if not isinstance(self.df, DataFrame): | ||
raise ValueError("Expected a valid DataFrame for input type 'DataFrame'.") | ||
return self.df, None, self.template, self.sep | ||
if input_type == "Data": | ||
if not isinstance(self.data, Data): | ||
raise ValueError("Expected a valid Data object for input type 'Data'.") | ||
return None, self.data, self.template, self.sep | ||
raise ValueError(f"Unsupported input type: {input_type}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
input_type = self.input_type | |
if input_type == "DataFrame": | |
if not isinstance(self.df, DataFrame): | |
raise ValueError("Expected a valid DataFrame for input type 'DataFrame'.") | |
return self.df, None, self.template, self.sep | |
if input_type == "Data": | |
if not isinstance(self.data, Data): | |
raise ValueError("Expected a valid Data object for input type 'Data'.") | |
return None, self.data, self.template, self.sep | |
raise ValueError(f"Unsupported input type: {input_type}") | |
# Combine input type check and validation in one step for efficiency | |
input_type_map = {"DataFrame": (self.df, DataFrame), "Data": (self.data, Data)} | |
# Fetch the data and corresponding type for validation | |
data, expected_type = input_type_map.get(self.input_type, (None, None)) | |
if data is None: | |
raise ValueError(f"Unsupported input type: {self.input_type}") | |
if not isinstance(data, expected_type): | |
raise ValueError(f"Expected a valid {expected_type.__name__} for input type '{self.input_type}'.") | |
return ( | |
data if self.input_type == "DataFrame" else None, | |
data if self.input_type == "Data" else None, | |
self.template, | |
self.sep, | |
) | |
⚡️ Codeflash found optimizations for this PR📄 12% (0.12x) speedup for
|
Test | Status |
---|---|
⚙️ Existing Unit Tests | 🔘 None Found |
🌀 Generated Regression Tests | ✅ 4 Passed |
⏪ Replay Tests | 🔘 None Found |
🔎 Concolic Coverage Tests | 🔘 None Found |
📊 Tests Coverage | undefined |
🌀 Generated Regression Tests Details
import pandas as pd
# imports
import pytest # used for our unit tests
from langflow.components.processing.parse import ParseComponent
# function to test
from langflow.custom import Component
from langflow.schema import Data, DataFrame
from pydantic import BaseModel
class DataFrame(pd.DataFrame):
"""A pandas DataFrame subclass specialized for handling collections of Data objects.
This class extends pandas.DataFrame to provide seamless integration between
Langflow's Data objects and pandas' powerful data manipulation capabilities.
Args:
data: Input data in various formats:
- List[Data]: List of Data objects
- List[Dict]: List of dictionaries
- Dict: Dictionary of arrays/lists
- pandas.DataFrame: Existing DataFrame
- Any format supported by pandas.DataFrame
**kwargs: Additional arguments passed to pandas.DataFrame constructor
"""
def __init__(self, data: list[dict] | list[Data] | pd.DataFrame | None = None, **kwargs):
if data is None:
super().__init__(**kwargs)
return
if isinstance(data, list):
if all(isinstance(x, Data) for x in data):
data = [d.data for d in data if hasattr(d, "data")]
elif not all(isinstance(x, dict) for x in data):
msg = "List items must be either all Data objects or all dictionaries"
raise ValueError(msg)
kwargs["data"] = data
elif isinstance(data, dict | pd.DataFrame):
kwargs["data"] = data
super().__init__(**kwargs)
def to_data_list(self) -> list[Data]:
"""Converts the DataFrame back to a list of Data objects."""
list_of_dicts = self.to_dict(orient="records")
return [Data(data=row) for row in list_of_dicts]
def add_row(self, data: dict | Data) -> "DataFrame":
"""Adds a single row to the dataset.
Args:
data: Either a Data object or a dictionary to add as a new row
Returns:
DataFrame: A new DataFrame with the added row
"""
if isinstance(data, Data):
data = data.data
new_df = self._constructor([data])
return pd.concat([self, new_df], ignore_index=True)
def add_rows(self, data: list[dict | Data]) -> "DataFrame":
"""Adds multiple rows to the dataset.
Args:
data: List of Data objects or dictionaries to add as new rows
Returns:
DataFrame: A new DataFrame with the added rows
"""
processed_data = []
for item in data:
if isinstance(item, Data):
processed_data.append(item.data)
else:
processed_data.append(item)
new_df = self._constructor(processed_data)
return pd.concat([self, new_df], ignore_index=True)
@property
def _constructor(self):
def _c(*args, **kwargs):
return DataFrame(*args, **kwargs).__finalize__(self)
return _c
def __bool__(self):
"""Truth value testing for the DataFrame.
Returns True if the DataFrame has at least one row, False otherwise.
"""
return not self.empty
class Data(BaseModel):
"""Represents a record with text and optional data.
Attributes:
data (dict, optional): Additional data associated with the record.
"""
text_key: str = "text"
data: dict = {}
default_value: str | None = ""
@classmethod
def validate_data(cls, values):
if not isinstance(values, dict):
msg = "Data must be a dictionary"
raise ValueError(msg)
if not values.get("data"):
values["data"] = {}
for key in values:
if key not in values["data"] and key not in {"text_key", "data", "default_value"}:
values["data"][key] = values[key]
return values
def get_text(self):
"""Retrieves the text value from the data dictionary.
If the text key is present in the data dictionary, the corresponding value is returned.
Otherwise, the default value is returned.
"""
return self.data.get(self.text_key, self.default_value)
def set_text(self, text: str | None) -> str:
"""Sets the text value in the data dictionary."""
new_text = "" if text is None else str(text)
self.data[self.text_key] = new_text
return new_text
from langflow.components.processing.parse import ParseComponent
# unit tests
class TestParseComponent:
def setup_method(self):
"""Setup common attributes for the tests."""
self.component = ParseComponent()
self.component.template = "template"
self.component.sep = ","
import pandas as pd
# imports
import pytest # used for our unit tests
from langflow.components.processing.parse import ParseComponent
from langflow.custom import Component
from langflow.schema import Data, DataFrame
# function to test
class DataFrame(pd.DataFrame):
"""A pandas DataFrame subclass specialized for handling collections of Data objects.
This class extends pandas.DataFrame to provide seamless integration between
Langflow's Data objects and pandas' powerful data manipulation capabilities.
Args:
data: Input data in various formats:
- List[Data]: List of Data objects
- List[Dict]: List of dictionaries
- Dict: Dictionary of arrays/lists
- pandas.DataFrame: Existing DataFrame
- Any format supported by pandas.DataFrame
**kwargs: Additional arguments passed to pandas.DataFrame constructor
"""
def __init__(self, data=None, **kwargs):
if data is None:
super().__init__(**kwargs)
return
if isinstance(data, list):
if all(isinstance(x, Data) for x in data):
data = [d.data for d in data if hasattr(d, "data")]
elif not all(isinstance(x, dict) for x in data):
msg = "List items must be either all Data objects or all dictionaries"
raise ValueError(msg)
kwargs["data"] = data
elif isinstance(data, (dict, pd.DataFrame)):
kwargs["data"] = data
super().__init__(**kwargs)
class Data:
"""Represents a record with text and optional data."""
def __init__(self, data=None):
self.data = data or {}
from langflow.components.processing.parse import ParseComponent
# unit tests
class TestCleanArgs:
# Setup a mock ParseComponent with necessary attributes
class MockParseComponent(ParseComponent):
def __init__(self, input_type, df=None, data=None, template=None, sep=None):
self.input_type = input_type
self.df = df
self.data = data
self.template = template
self.sep = sep
# Valid Input Scenarios
… maintainability 📝 (parse.py): Update ParseComponent description for better clarity and understanding of functionality 📝 (parse.py): Update ParseComponent input descriptions for improved user guidance 📝 (parse.py): Update ParseComponent output descriptions for better indication of returned data 📝 (parse.py): Update ParseComponent method comments for clearer explanation of functionality
CodSpeed Performance ReportMerging #6111 will not alter performanceComparing Summary
|
…omponent class to improve code readability and maintainability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @Cristhianzl
All Component PRs need tests. Could you, please, add them?
…s_mapping ✨ (test_parse_data_component.py): Add unit tests for ParseDataComponent class including basic setup, cleaning arguments, parsing data with custom template and separator, handling empty list, parsing data as list, nested fields, missing required fields, invalid template fields, and preserving original data after parsing.
…ify the code ♻️ (test_parse_data_component.py): refactor import statement to use DID_NOT_EXIST constant from base module for better readability and maintainability
…parison of data fields and templates in ParseDataComponent class
…t or None by converting to list and filtering out None values for consistency and improved error handling 📝 (parse.py): Update _format_dataframe_row method to handle case where template_to_parse is not provided by defaulting to JSON format 📝 (parse.py): Refactor _format_data_object method to handle None values in data list for improved error handling 📝 (test_parse_data_component.py): Remove unnecessary result assignment in test method to improve code readability and efficiency
Done! Thanks! |
This pull request introduces a new
ParseComponent
class to handle parsing ofDataFrame
andData
objects, and marks existing parsing components as legacy. The most important changes include the addition of theParseComponent
class with its methods and inputs/outputs, and marking theParseDataComponent
andParseDataFrameComponent
as legacy.New
ParseComponent
class:src/backend/base/langflow/components/processing/parse.py
: AddedParseComponent
class to parseDataFrame
orData
objects into plain text using a specified template. This class includes methods to update build configuration, clean arguments, and parse data into combined text or a newDataFrame
.Marking existing components as legacy:
src/backend/base/langflow/components/processing/parse_data.py
: MarkedParseDataComponent
class as legacy.src/backend/base/langflow/components/processing/parse_dataframe.py
: MarkedParseDataFrameComponent
class as legacy.