-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add JSON field extraction and enhanced URL validation #6051
Open
Cristhianzl
wants to merge
28
commits into
main
Choose a base branch
from
cz/url-improve
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
28 commits
Select commit
Hold shift + click to select a range
4ca1bf1
URL component improvement - JSON URL
Cristhianzl 6006b07
[autofix.ci] apply automated fixes
autofix-ci[bot] 2b93e89
[autofix.ci] apply automated fixes (attempt 2/3)
autofix-ci[bot] 7b44f8b
♻️ (url.py): refactor URLComponent class to simplify data_dict creati…
Cristhianzl 3d0e49f
📝 (url.py): import json module for JSON operations
Cristhianzl e9f4cc9
[autofix.ci] apply automated fixes
autofix-ci[bot] fc37187
[autofix.ci] apply automated fixes (attempt 2/3)
autofix-ci[bot] b307dff
📝 (url.py): improve formatting of info string for DropdownInput in UR…
Cristhianzl 5df7bb6
[autofix.ci] apply automated fixes
autofix-ci[bot] 075a4b8
Merge branch 'main' into cz/url-improve
Cristhianzl e183959
✨ (url.py): Add BoolInput and StrInput to support new features in URL…
Cristhianzl a28723b
[autofix.ci] apply automated fixes
autofix-ci[bot] ec94e39
[autofix.ci] apply automated fixes (attempt 2/3)
autofix-ci[bot] 6232a4d
♻️ (url.py): remove unnecessary comments and improve code readability…
Cristhianzl 225670e
Merge branch 'cz/url-improve' of https://github.com/langflow-ai/langf…
Cristhianzl 39154fd
[autofix.ci] apply automated fixes
autofix-ci[bot] 70be801
merge fix
Cristhianzl cda054e
📝 (url.py): improve readability by splitting long description and inf…
Cristhianzl 060aa15
🔧 (Blog Writer.json, Custom Component Maker.json, Graph Vector Store …
Cristhianzl 950fee3
[autofix.ci] apply automated fixes
autofix-ci[bot] c3e5ed6
Merge branch 'main' into cz/url-improve
Cristhianzl da22270
merge fix
Cristhianzl d93b6f9
Merge branch 'cz/url-improve' of https://github.com/langflow-ai/langf…
Cristhianzl 89bbc2c
Merge branch 'main' into cz/url-improve
Cristhianzl c63a1d7
🐛 (url.py): fix validation of JSON content from URLs to ensure correc…
Cristhianzl 9bddfc3
Merge branch 'cz/url-improve' of https://github.com/langflow-ai/langf…
Cristhianzl a942ef5
[autofix.ci] apply automated fixes
autofix-ci[bot] a43d82d
[autofix.ci] apply automated fixes (attempt 2/3)
autofix-ci[bot] File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change | ||||||
---|---|---|---|---|---|---|---|---|
@@ -1,18 +1,23 @@ | ||||||||
import asyncio | ||||||||
import json | ||||||||
import re | ||||||||
|
||||||||
import aiohttp | ||||||||
from langchain_community.document_loaders import AsyncHtmlLoader, WebBaseLoader | ||||||||
|
||||||||
from langflow.custom import Component | ||||||||
from langflow.helpers.data import data_to_text | ||||||||
from langflow.io import DropdownInput, MessageTextInput, Output | ||||||||
from langflow.io import BoolInput, DropdownInput, MessageTextInput, Output, StrInput | ||||||||
from langflow.schema import Data | ||||||||
from langflow.schema.dataframe import DataFrame | ||||||||
from langflow.schema.message import Message | ||||||||
|
||||||||
|
||||||||
class URLComponent(Component): | ||||||||
display_name = "URL" | ||||||||
description = "Load and retrive data from specified URLs." | ||||||||
description = ( | ||||||||
"Load and retrieve data from specified URLs. Supports output in plain text, raw HTML, " | ||||||||
"or JSON, with options for cleaning and separating multiple outputs." | ||||||||
) | ||||||||
icon = "layout-template" | ||||||||
name = "URL" | ||||||||
|
||||||||
|
@@ -28,69 +33,143 @@ class URLComponent(Component): | |||||||
DropdownInput( | ||||||||
name="format", | ||||||||
display_name="Output Format", | ||||||||
info="Output Format. Use 'Text' to extract the text from the HTML or 'Raw HTML' for the raw HTML content.", | ||||||||
options=["Text", "Raw HTML"], | ||||||||
info=( | ||||||||
"Output Format. Use 'Text' to extract text from the HTML, 'Raw HTML' for the raw HTML " | ||||||||
"content, or 'JSON' to extract JSON from the HTML." | ||||||||
), | ||||||||
options=["Text", "Raw HTML", "JSON"], | ||||||||
value="Text", | ||||||||
real_time_refresh=True, | ||||||||
), | ||||||||
StrInput( | ||||||||
name="separator", | ||||||||
display_name="Separator", | ||||||||
value="\n\n", | ||||||||
show=True, | ||||||||
info=( | ||||||||
"Specify the separator to use between multiple outputs. Default for Text is '\\n\\n'. " | ||||||||
"Default for Raw HTML is '\\n<!-- Separator -->\\n'." | ||||||||
), | ||||||||
), | ||||||||
BoolInput( | ||||||||
name="clean_extra_whitespace", | ||||||||
display_name="Clean Extra Whitespace", | ||||||||
value=True, | ||||||||
show=True, | ||||||||
info="Whether to clean excessive blank lines in the text output. Only applies to 'Text' format.", | ||||||||
), | ||||||||
] | ||||||||
|
||||||||
outputs = [ | ||||||||
Output(display_name="Data", name="data", method="fetch_content"), | ||||||||
Output(display_name="Message", name="text", method="fetch_content_text"), | ||||||||
Output(display_name="Text", name="text", method="fetch_content_text"), | ||||||||
Output(display_name="DataFrame", name="dataframe", method="as_dataframe"), | ||||||||
] | ||||||||
|
||||||||
def ensure_url(self, string: str) -> str: | ||||||||
"""Ensures the given string is a URL by adding 'http://' if it doesn't start with 'http://' or 'https://'. | ||||||||
|
||||||||
Raises an error if the string is not a valid URL. | ||||||||
async def validate_json_content(self, url: str) -> bool: | ||||||||
"""Validates if the URL content is actually JSON.""" | ||||||||
try: | ||||||||
async with aiohttp.ClientSession() as session, session.get(url) as response: | ||||||||
http_ok = 200 | ||||||||
if response.status != http_ok: | ||||||||
return False | ||||||||
|
||||||||
content = await response.text() | ||||||||
try: | ||||||||
json.loads(content) | ||||||||
except json.JSONDecodeError: | ||||||||
return False | ||||||||
else: | ||||||||
return True | ||||||||
except (aiohttp.ClientError, asyncio.TimeoutError): | ||||||||
# Log specific error for debugging if needed | ||||||||
return False | ||||||||
|
||||||||
def update_build_config(self, build_config: dict, field_value: str, field_name: str | None = None) -> dict: | ||||||||
"""Dynamically update fields based on selected format.""" | ||||||||
if field_name == "format": | ||||||||
is_text_mode = field_value == "Text" | ||||||||
is_json_mode = field_value == "JSON" | ||||||||
build_config["separator"]["value"] = "\n\n" if is_text_mode else "\n<!-- Separator -->\n" | ||||||||
build_config["clean_extra_whitespace"]["show"] = is_text_mode | ||||||||
build_config["separator"]["show"] = not is_json_mode | ||||||||
return build_config | ||||||||
|
||||||||
Parameters: | ||||||||
string (str): The string to be checked and possibly modified. | ||||||||
|
||||||||
Returns: | ||||||||
str: The modified string that is ensured to be a URL. | ||||||||
|
||||||||
Raises: | ||||||||
ValueError: If the string is not a valid URL. | ||||||||
""" | ||||||||
def ensure_url(self, string: str) -> str: | ||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||
"""Ensures the given string is a valid URL.""" | ||||||||
if not string.startswith(("http://", "https://")): | ||||||||
string = "http://" + string | ||||||||
|
||||||||
# Basic URL validation regex | ||||||||
url_regex = re.compile( | ||||||||
r"^(https?:\/\/)?" # optional protocol | ||||||||
r"(www\.)?" # optional www | ||||||||
r"([a-zA-Z0-9.-]+)" # domain | ||||||||
r"(\.[a-zA-Z]{2,})?" # top-level domain | ||||||||
r"(:\d+)?" # optional port | ||||||||
r"(\/[^\s]*)?$", # optional path | ||||||||
r"^(https?:\/\/)?" | ||||||||
r"(www\.)?" | ||||||||
r"([a-zA-Z0-9.-]+)" | ||||||||
r"(\.[a-zA-Z]{2,})?" | ||||||||
r"(:\d+)?" | ||||||||
r"(\/[^\s]*)?$", | ||||||||
re.IGNORECASE, | ||||||||
) | ||||||||
|
||||||||
error_msg = "Invalid URL - " + string | ||||||||
if not url_regex.match(string): | ||||||||
msg = f"Invalid URL: {string}" | ||||||||
raise ValueError(msg) | ||||||||
raise ValueError(error_msg) | ||||||||
|
||||||||
return string | ||||||||
|
||||||||
def fetch_content(self) -> list[Data]: | ||||||||
urls = [self.ensure_url(url.strip()) for url in self.urls if url.strip()] | ||||||||
"""Fetch content based on selected format.""" | ||||||||
urls = list({self.ensure_url(url.strip()) for url in self.urls if url.strip()}) | ||||||||
|
||||||||
no_urls_msg = "No valid URLs provided." | ||||||||
if not urls: | ||||||||
raise ValueError(no_urls_msg) | ||||||||
|
||||||||
# If JSON format is selected, validate JSON content first | ||||||||
if self.format == "JSON": | ||||||||
for url in urls: | ||||||||
is_json = asyncio.run(self.validate_json_content(url)) | ||||||||
if not is_json: | ||||||||
error_msg = "Invalid JSON content from URL - " + url | ||||||||
raise ValueError(error_msg) | ||||||||
|
||||||||
if self.format == "Raw HTML": | ||||||||
loader = AsyncHtmlLoader(web_path=urls, encoding="utf-8") | ||||||||
else: | ||||||||
loader = WebBaseLoader(web_paths=urls, encoding="utf-8") | ||||||||
|
||||||||
docs = loader.load() | ||||||||
data = [Data(text=doc.page_content, **doc.metadata) for doc in docs] | ||||||||
self.status = data | ||||||||
return data | ||||||||
|
||||||||
if self.format == "JSON": | ||||||||
data = [] | ||||||||
for doc in docs: | ||||||||
try: | ||||||||
json_content = json.loads(doc.page_content) | ||||||||
data_dict = {"text": json.dumps(json_content, indent=2), **json_content, **doc.metadata} | ||||||||
data.append(Data(**data_dict)) | ||||||||
except json.JSONDecodeError as err: | ||||||||
source = doc.metadata.get("source", "unknown URL") | ||||||||
error_msg = "Invalid JSON content from " + source | ||||||||
raise ValueError(error_msg) from err | ||||||||
return data | ||||||||
|
||||||||
return [Data(text=doc.page_content, **doc.metadata) for doc in docs] | ||||||||
|
||||||||
def fetch_content_text(self) -> Message: | ||||||||
"""Fetch content and return as formatted text.""" | ||||||||
data = self.fetch_content() | ||||||||
|
||||||||
result_string = data_to_text("{text}", data) | ||||||||
self.status = result_string | ||||||||
return Message(text=result_string) | ||||||||
if self.format == "JSON": | ||||||||
text_list = [item.text for item in data] | ||||||||
result = "\n".join(text_list) | ||||||||
else: | ||||||||
text_list = [item.text for item in data] | ||||||||
if self.format == "Text" and self.clean_extra_whitespace: | ||||||||
text_list = [re.sub(r"\n{3,}", "\n\n", text) for text in text_list] | ||||||||
result = self.separator.join(text_list) | ||||||||
|
||||||||
self.status = result | ||||||||
return Message(text=result) | ||||||||
|
||||||||
def as_dataframe(self) -> DataFrame: | ||||||||
"""Return fetched content as a DataFrame.""" | ||||||||
return DataFrame(self.fetch_content()) |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.