Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[EPIC] We should use a ReportOOI entity per Report, or update a Report in-place and traverse the history API #3729

Open
4 of 6 tasks
Donnype opened this issue Oct 24, 2024 · 8 comments · May be fixed by #4047
Open
4 of 6 tasks
Assignees
Labels
epic octopoes Issues related to octopoes report
Milestone

Comments

@Donnype
Copy link
Contributor

Donnype commented Oct 24, 2024

TODO

First steps

Subsequent issues

Follow up issues:

Notes:

  • This also means that the report recipe can contain a query, meaning that the Report OOI will have a changing input_oois field. Hence it is not guaranteed in this case that a recipe will keep updating a single AssetReport.

To be discussed

Given some new idea/input on reporting, and especially given the fact that we want to re-use ReportOOIs as per the conclusion at the bottom of this ticket, some practical issues have arisen from the current data model. The parent-report vs. subreport dynamic is a bit strange in the new scenario because the parent-child relation if no longer one-to-many, but many-to-many. This is because I could collect the same dns-report on example.org in multiple parent reports. Also, would we have to re-use parent OOIs as well? This only makes sense if the parent points to (a list of) child reports.

Questions arise:

  • Should we refactor the ReportOOIs data model?
  • And should we somehow refactor so we could stop checking if we should create a parent ReportOOI or not? We could perhaps do that by making every Report have a parent, or always have a list of "child" reports that in some cases is just 1 report.
  • I suppose we should not reuse ReportRecipes?
  • Reports have a field called input_oois: list[str]. How do we re-use these in the case of multiple oois? Or should this be input_ooi: str now? Then we need to wrap a multi-report in a different way

Notes frontend meeting:

@Rieven:

  • Perhaps use a timeline instead of pagination perhaps in the case pagination is going to be restructured?
  • I worry about the use-case where we use a query and the assets we report on grow?
  • If we update AssetReports in-place and hence has new Bytes-data, aggregated report data is no longer in-sync: the data in the aggregate depends on all its children, that can change continuously.

@Donnype:
Let's collect the affected features/pages and other aspects to refine.


First suggestion, perhaps ignore the input_oois being part of the natural key since this will become very unreadable:

class BaseReport(OOI):
    name: str
    template: str | None = None
    date_generated: datetime

    organization_code: str
    organization_name: str
    organization_tags: list[str]
    data_raw_id: str

    observed_at: datetime
    report_recipe: Reference | None = ReferenceField("ReportRecipe", default=None)

    @classmethod
    def format_reference_human_readable(cls, reference: Reference) -> str:
        return f"Report {reference.tokenized.name}"


class Report(BaseReport):
    object_type: Literal["Report"] = "Report"
    report_type: Literal["concatenated-report", "aggregate-organisation-report", "multi-organization-report"]

    input_oois: list[str]

    _natural_key_attrs = ["report_recipe"]


class AssetReport(BaseReport):
    object_type: Literal["AssetReport"] = "AssetReport"
    report_type: str

    input_ooi: str

    _natural_key_attrs = ["input_ooi", "report_type"]

Per Report, we now create a ReportOOI:

class Report(OOI):
    object_type: Literal["Report"] = "Report"

    name: str
    report_type: str
    template: str | None = None
    date_generated: datetime

    input_oois: list[str]

    report_id: UUID

    organization_code: str
    organization_name: str
    organization_tags: list[str]
    data_raw_id: str

    observed_at: datetime
    parent_report: Reference | None = ReferenceField("Report", default=None)
    report_recipe: Reference | None = ReferenceField("ReportRecipe", default=None)
    has_parent: bool

# as a reference:
class ReportRecipe(OOI):
    object_type: Literal["ReportRecipe"] = "ReportRecipe"

    recipe_id: UUID

    report_name_format: str
    subreport_name_format: str | None = None

    input_recipe: dict[str, Any]  # can contain a query which maintains a live set of OOIs or manually picked OOIs.
    parent_report_type: str | None = None
    report_types: list[str]

    cron_expression: str

    _natural_key_attrs = ["recipe_id"]

This means that every run of a recipe, we add a new object to XTDB (and Bytes) that has its own history, which shows it as a seperate row in the current table overview.

Another option is to update a ReportOOI in-place, reducing the list but meaning we should query the history API more often. The question is if we should change the implementation to perform

Pros and Cons

Pros

  • We have a smaller reports overview page.
  • Seeing changes in a single over time is more natively supported through the history API.
  • We could re-use this per-entity history view logic for OOI detail pages.

Cons

(@Donnype: When I try to model for XTDB, I first ask myself if it makes sense to save an entity over time in a regular relational database, which in the case of a Report I have to answer with "Yes". Older versions of reports are too important to hide in a historic version.)

  • Any query involving looking for older versions of reports become more complicated and/or slow(er):
    • "Give me the reports generated between 10-10-2024 and 21-10-2024" implies fetching the history of every report.
  • You always need to provide timestamps or timestamped urls when sharing or talking about historic reports.
  • You cannot provide a proper per-generated-report overview (i.e. query across historic reports) unless you start filtering and querying in memory over the history API.

Conclusion after discussion on 25-10-2024

After a vote we concluded that we will re-use the Report OOI. We expect XTDB 2.0 to be able to resolve any use-cases that pop up and use the history API to traverse any other queries/use-cases. Both @dekkers and @underdarknl think using the history for new versions of a Report is a more adequate representation of the actual situation. (@Donnype thinks creating new objects is a more adequate representation that would also save us from potentially not being able to handle more intricate queries across reports.)

@Donnype Donnype added this to KAT Oct 24, 2024
@Donnype Donnype converted this from a draft issue Oct 24, 2024
@Donnype Donnype added octopoes Issues related to octopoes report labels Oct 24, 2024
@originalsouth
Copy link
Contributor

Give me the reports generated between 10-10-2024 and 21-10-2024" implies fetching the history of every report.

If I understand correctly, only if the result changes between the two valid_times which can be queried.

@dekkers dekkers self-assigned this Nov 11, 2024
@Donnype Donnype changed the title Should we use a ReportOOI entity per Report, or update a Report in-place and traverse the history API ~Should we~ *We should* use a ReportOOI entity per Report, or update a Report in-place and traverse the history API Nov 12, 2024
@Donnype Donnype changed the title ~Should we~ *We should* use a ReportOOI entity per Report, or update a Report in-place and traverse the history API We should use a ReportOOI entity per Report, or update a Report in-place and traverse the history API Nov 12, 2024
@Donnype Donnype moved this from Backlog / To do to In Progress in KAT Dec 2, 2024
@Donnype
Copy link
Contributor Author

Donnype commented Dec 2, 2024

Questions:

  • I suppose we can update the input_ooi field when reusing the report and see what were the inputs over time?
  • A report recipe does not have an organization code/name/etc. Is it safe to assume that since it can only live in 1 XTDB database that it refers to only 1 organization?
  • Are these the fields we should filter on to get the right Report to update? Writing this down I just remembered once drawing the conclusion that the "ReportRecipe -> Report" relation would be one-to-one. Looking at the code I should correct that and say that this only holds for "ReportRecipe -> ParentReport" when we have multiple reports. So the steps to update the reports in place would be:
    1. Determine if there are multiple reports by looking at the report_data
    2. If there are multiple, find the parent report that points to this ReportRecipe, else search for a single report that points to this recipe
    3. Find all subreports of this parent report if we have a parent
    4. Based on the report type and input ooi, try to find the subreports that already exist to update those in-place as well
  • Or should we not update subreports in-place? It might be strange that the amount of updates vs. creates happening during reporting can depend on the input oois that result from the input query.

@underdarknl
Copy link
Contributor

underdarknl commented Dec 2, 2024

Im looking at this in a different Light I think:

An aggregate report is the result of the underlying reports being combined. We currently have the relation the wrong way round I think..
A recipe holds a list of Input OOI's, or a query (resulting in a live list of input OOI's possibly changing over time).
The reporting job derived from the recipe's schedule on a given moment in time produces a set of Asset-reports (input-ooi * report-type). those reports should get a reference to the recipe that was used to create them, and a reference to the job. (no news there I suppose?).
The asset-reports should get a (I think) deterministic OOI-ID, which encodes the input-ooi, the used report type, and any settings that change the actual data being stored (settings that only apply to the rendered version, but not the json dont need to be part of the hash/ooi-ID. [1]
The aggregate report should now hold the underlying report-ooi's OOI-ID's and their valid-time at the time of combination as its input-OOI list. It could also hold a reference to the recipe or job that triggered the aggregation to run.
If a given OOI is no longer available for the asset-reporting, we could include the most recent report or decide to not include the Asset-report anymore. The underlying report might have been generated yesterday, or might have been manually uploaded by a user, in any case we do or don't include it in the input-ooi's for the given agregate-report.

1: This means, multiple recipe's and reporting jobs might write into the same report-OOI at different intervals. This would mean we are making more 'snapshots' of the data at those points.

@Donnype Donnype moved this from In Progress to To be discussed in KAT Dec 9, 2024
@Donnype Donnype moved this from To be discussed to In Progress in KAT Dec 16, 2024
@Rieven
Copy link
Contributor

Rieven commented Dec 18, 2024

  • We have here a scenario where we select 2 OOIs and want to make 2 reports.
  • We want an Aggregate Report
  • So we have in total 4 sub/child reports. We create a Report Recipe with:

Parent report: Aggregate Organisation Report
assets: mispo.es, minvws.nl
subreports: DNS Report, Findings Reports

flowchart TD
    A[Report Recipe]
    B[Scheduler]
    C[Octopoes]
    D[Rocky Worker]
    E[Bytes holds Report Data]
    F(DNS Report for mispo.es)
    G(Findings Report for mispo.es)
    H(DNS Report for minvws.nl)
    I(Findings Report for minvws.nl)
    J[Aggregate Data]
    K[Aggregate Report]
    A -- create recipe OOI --> C
    A -- schedule --> B
    B --> D
    D --> C
    C -- Report OOI --> F
    C -- Report OOI --> G
    C -- Report OOI --> H
    C -- Report OOI --> I
    F --> E
    G --> E
    H --> E
    I --> E
    E --> J
    J --> K
Loading
  • What I see here is that an Aggregate Report is basically a way of formatting the data, we aggregate the data, make summaries.
  • So it make sense that an Aggregate Report holds a list of ReportData or we have to fetch for each report OOI the bytes data

We can maybe use ReportData which is already an OOI. ReportData is the evidence and what we are comparing.

class ReportData(OOI):
    object_type: Literal["ReportData"] = "ReportData"
    organization_code: str
    organization_name: str
    organization_tags: list[str]
    data: dict

    _natural_key_attrs = ["organization_code"]

    @classmethod
    def format_reference_human_readable(cls, reference: Reference) -> str:
        return f"Report data of organization {reference.tokenized.organization_code}"

What really is updating is the data it generates, which means that we can also have an edited Report data. We can compare ReportData's or we can aggregate report datas, or we can show them separately in the case of an Concat report.

report data matches template is what we see in a report.

So what if we keep track of the data and also keep updating the OOIs

I would remove an aggregate OOI so parent reports does not really exists, it is a way of displaying the underlying reports.

@madelondohmen
Copy link
Contributor

I would remove an aggregate OOI so parent reports does not really exists, it is a way of displaying the underlying reports.

@Rieven If we do this, we still need a field in the ReportRecipe to know what kind of report should be created.

@Rieven
Copy link
Contributor

Rieven commented Dec 18, 2024

I would remove an aggregate OOI so parent reports does not really exists, it is a way of displaying the underlying reports.

@Rieven If we do this, we still need a field in the ReportRecipe to know what kind of report should be created.

We already have those fields, we have the report types and the input oois or live set.

@madelondohmen
Copy link
Contributor

We already have those fields, we have the report types and the input oois or live set.

@Rieven
Currently we use the field parent_report_type in the ReportRecipe to tell if it's an aggregate report or not. The fields report_types and input_recipe (with input_oois/query) just tells if it's a single report or a concatinated report. So what I wanted to say is that even if we remove the parent Report OOI, we should make sure not to remove the parent_report_type in the ReportRecipe. Otherwise we would never know if it's an aggregate report or not.

Besides that, I think the problem with removing the parent Report OOI entirely would be opening the aggregate report (on the history page and scheduled reports page). Where would you link to when clicking on the generated report?

@underdarknl
Copy link
Contributor

underdarknl commented Dec 30, 2024

Possible future improvement. Store the various report-export-formats in bytes at runtime of the job instead of storing only the json. This would avoid concatting / producing these rich output formats at runtime once the user clicks on the html/pdf links.
Eg:

class BaseReport(OOI):
    data_raw_id: str
    data_raw_html_id: str | None = None
    data_raw_pdf_id: str | None = None
    data_raw_html_concat_id: str | None = None
    data_raw_html_concat_zip_id: str | None = None
    data_raw_pdf_concat_id: str | None = None
    data_raw_pdf_concat_zip: str | None = None

@Donnype Donnype added the epic label Jan 7, 2025
@Donnype Donnype changed the title We should use a ReportOOI entity per Report, or update a Report in-place and traverse the history API [EPIC] We should use a ReportOOI entity per Report, or update a Report in-place and traverse the history API Jan 9, 2025
@Donnype Donnype moved this from In Progress to Backlog / To do in KAT Jan 9, 2025
@madelondohmen madelondohmen moved this from Backlog / To do to In Progress in KAT Jan 23, 2025
@madelondohmen madelondohmen moved this from In Progress to Backlog / To do in KAT Jan 23, 2025
@madelondohmen madelondohmen moved this from Backlog / To do to Review in KAT Jan 27, 2025
@underdarknl underdarknl added this to the OpenKAT v1.18 milestone Jan 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
epic octopoes Issues related to octopoes report
Projects
Status: Review
Development

Successfully merging a pull request may close this issue.

7 participants