Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

new field set: for log documents themselves #2258

Open
rsk0 opened this issue Aug 20, 2023 · 0 comments
Open

new field set: for log documents themselves #2258

rsk0 opened this issue Aug 20, 2023 · 0 comments
Labels
enhancement New feature or request

Comments

@rsk0
Copy link

rsk0 commented Aug 20, 2023

Summary

Create a new field set for information about log documents themselves.

Motivation:

I've identified several valuable pieces of information that pertain to log documents, but there is no field set for such things.

(The pieces of information include two already captured in ECS but in the wrong set: 1, 2.)

Detailed Design:

(To maybe help reduce confusion, please note that I variously use the terms "log document", "document", and "record" to mean the same thing. Other possible terms are discussed in the "naming the field set" section.)

In issue 700 I saw a proposal for a field event.age to capture the amount of time it takes a log message to travel through a logs pipeline into the data store. This and other things got me thinking about the idea of a field set where one could store information about log documents themselves. I also thought about how log documents are not actual events but we often use the same word, "event", to mean both. Issue 1059 discusses this, but here's some recap:

event field set: for events, not records of events

The event field set is meant for "context information about the log or metric event itself" and doesn't seem to be about records of events. event.duration, for example, talks about how long "the thing that occurred" lasted. However, there are a couple fields in there that don't refer to "the thing that occurred", but instead "the record of the thing that occurred."

information about log documents

document creation and ingestion

In the event field set there's a field event.created which is not about the original event, it's about a record of it, when the record got into the pipeline. (Again, see issue 1059 for further discussion.)

I think you could say that records of events also belong in the same field set as events themselves (event.*), but how would you know which field is talking about which phenomenon? You couldn't by name alone, you'd have to do the extra work of reading the reference doc. You could easily think event.created was talking about the event. (It's not, it's talking about the record of the event.) Probably better to have distinct field sets to avoid this confusion.

If ECS continues to have distinct fields for when a document gets into the pipeline ("created") and when it arrives in the data store ("ingested"), then those could also be in this new field set meant for documents. (I say "if" here because I think there might be a chance that some generalized scheme for capturing pipeline transit information could replace these fields.)

pipeline transit more generally

There are other suggestions, as in issue 453, around pipeline transit information (rolled up into issue 940), which, from one perspective, is about the log document's movement through a pipeline.

document aggregation

Another datum that would be good to capture that I think might also belong in a document-oriented field set is "count of messages". (See this forum discussion, "ECS field for pre-aggregated messages".) I say might because there are two different things that could be being referred to with the aggregation count: either a count of actual events or a count of log documents (that happen to represent events). It's a fine distinction, but it makes a difference. (Schema development is hard.) If this aggregation count is about log documents (not the count of events), then it's related to this transit time we're talking about, since the transit time is about the log documents themselves travelling to the data store.

Document/event aggregation counts deserves its own discussion; I think there's a lot to consider in whether the count should be meant to refer to records or to actual events (or either/both?). The aforelinked forum thread is closed due to timeout; using a GitHub issue would allow more discussion time.

document size

Some other document information that could be useful: Size of the document, at least in bytes, maybe also in field count, maybe other dimensions I haven't considered. (Measuring field counts would help us monitor for abusive data sources eating up too many fields.)

field set about log documents

So, with all these data, maybe we need a field set about log documents.

Eg.:

{
  "log_document_field_set_with_a_meaningful_but_simple_and_catchy_name": {
    "count": 1234,
    "bytes":  2037,
    "delivery_duration_ms":  1000,
    "created": 1692308448000,
    "ingested": 1692308449000
}
count This record/message/log document represents an aggregation of 1234 original messages.
bytes This record/message/log document is 2047 bytes big.
delivery_duration_ms This record/message/log document took 1000 milliseconds from the moment it was created to arrival in the data store.
created This record/message/log document was first seen by our logs infrastructure at this time.
ingested This record/message/log document arrived in our logs data store at this time.

You can think of this as a "meta" field set in the sense that it's information about the document containing the information about the document.

naming the field set

Candidates for a term to refer to a field set about log documents:

  • message - often used for this purpose (referring to a log document), but conflicts with base field so is a non-starter
  • log - rarely but sometimes used for this purpose, but conflicts with an existing field set (about log document generators/processors/handlers rather than log documents themselves) so is a non-starter
  • document - often used for this purpose, easily comprehensible/meaningful, slight potential for conflict with documents involved in events
  • log_document - aesthetically distasteful, awkward and exceptional use of multi-word term for field set name [edit: data_stream is multi-word]; crucially, ECS is not just for logs
  • event - often used in discussion to refer to log documents, but problematically conflicts with the term meaning actual events and of course with the existing events field set
  • record - not often used for this purpose (not attested in ECS docs), maybe adequately comprehensible/meaningful, very slight potential conflict with any "record" information involved in events
  • event_record - no one uses this term and it's a bit ugly (being a multi-word term), but it's very clear, though it doesn't exactly express "we're talking about the document you're looking at"
  • event_document - same issues as event_record; feels like the best fit for meaning and clarity
  • _document - nicely distinct with a styling that possibly connotes its special case (being for this particular meta purpose), but probably conflicts/collides with the Elasticsearch convention of styling metadata fields (also has a faint whiff of relationship to doc typing)
  • meta - somewhat semantically aligned but an ambiguous and possibly overly specific use of a field set concept that could be for other purposes in the future; also liable to conflict with Facebook's renaming -- I would say that that's Meta's fault/problem, though, for choosing a company name that's a generic concept. (If you called your company "Proxy" I would not sacrifice a proxy.* tree to provide you space.)

One of the problems in choosing (field set) naming is that there isn't a common colloquial term people use to refer to log documents that works well. Most often folks say "event" but that's very problematic given that refers to the actual event.

I lean towards document, record, and event_document.

alternatively, a subtree in event

Maybe instead of a top-level field set we can add an event.record or event.document subtree:

{
    "event": {
        "duration": 810372,
        "reason": "Executed an unexpected process",
        "record": {
            "count": 1234,
            "bytes": 2047,
            "created": 1692308448000,
            "ingested": 1692308449000,
            "delivery_duration_ms": 1000
        }
    }
}

If nested under the event set, I prefer record.

Please pardon the long post. Thanks for your review.

EOF

@rsk0 rsk0 added the enhancement New feature or request label Aug 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant