-
Notifications
You must be signed in to change notification settings - Fork 418
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Asset Integrations & Entity Store RFC - Stage 2 #2233
base: main
Are you sure you want to change the base?
Conversation
Initial touch
Including examples of source documents and mappings
User fieldset
Adding level keys
I am removing the 'phone' fields from the proposal to reduce the risk of PII exposure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some initial comments on a first review. As a general comment, I suggest reviewing what can be used in the existing schema to avoid adding overlapping fields as much as possible.
For example, is user.profile.organization
necessary when we have organization.name
and organization.id
fields? Reusing existing fields avoids adding more fields, and it also allows users to query across potentially other data sources that also populate the organization.*
fields.
Co-authored-by: Eric Beahan <[email protected]>
Co-authored-by: Eric Beahan <[email protected]>
Co-authored-by: Eric Beahan <[email protected]>
Co-authored-by: Eric Beahan <[email protected]>
Co-authored-by: Eric Beahan <[email protected]>
Co-authored-by: Eric Beahan <[email protected]>
Co-authored-by: Eric Beahan <[email protected]>
Co-authored-by: Eric Beahan <[email protected]>
Co-authored-by: Eric Beahan <[email protected]>
Co-authored-by: Eric Beahan <[email protected]>
Co-authored-by: Eric Beahan <[email protected]>
LGTM 👍 |
I have updated stage 1 RFC artifacts per my prior comment. @jasonrhodes @chrisdistasio please give this PR a formal approval so @ebeahan can help move the RFC to stage 2. Thank you! |
@ebeahan, resurfacing this so we can move this to Stage 2. Thanks for your help! |
rfcs/text/0041-asset-integration.md
Outdated
@@ -1,7 +1,7 @@ | |||
# 0041: Asset Integration | |||
<!-- Leave this ID at 0000. The ECS team will assign a unique, contiguous RFC number upon merging the initial stage of this RFC. --> | |||
|
|||
- Stage: **0 (strawperson)** <!-- Update to reflect target stage. See https://elastic.github.io/ecs/stages.html --> | |||
- Stage: **1 (Draft)** <!-- Update to reflect target stage. See https://elastic.github.io/ecs/stages.html --> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we still targeting stage 1 here @SourinPaul? You mentioned stage 2 in a different conversation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ebeahan, thanks for the ping. We are currently in stage 1 and targeting stage 2.
Do I need to update this section? Please advise, or feel free to update before merging.
Before merging, can @elastic/sec-deployment-and-devices and/or @trisch-me also review these changes? |
an infrastructure. These fields can be nested under other objects that | ||
identifies an asset such as host, user, network, and cloud schemas. | ||
reusable: | ||
top_level: false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why should asset not be used on the top level?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably because they are tied to main object and don't have meaning themselves. But good question
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MikePaquette @oatkiller @jaredburgettelastic, do you have any insights on why asset
should not be used at the top level?
IMO, the asset fieldset should and can be used at the top level. There are valid use cases where an asset needs to be represented independently of any specific context like host, user, network, or cloud. For example, consider an IT asset management system (some CMDB) that tracks all the assets in an organization. This includes not only physical assets like workstations and servers but also mobile devices and other assets that might not fit neatly into the host, user, network, or cloud fieldsets.
Also, based on our use cases, we often need to query or analyze assets across different contexts (e.g., find all assets owned by a specific user, regardless of whether they're associated with a host, network, or cloud). This would be easier to achieve if asset is a top-level field.
With this in mind, I propose that we use the asset fieldset at the top level. This would give us the flexibility to represent assets that are not directly associated with a host, user, network, or cloud environment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tinnytintin10 agreed. Because of the open source nature of ECS, we should not limit ourselves to the vision of solution (Security & Observability) use cases only, as any Elastic user may desire to map data into the asset schema the likes of which we haven't considered.
@trisch-me, could you please review this from an Otel perspective? Should we be planning a contribution to Semantic Conventions? |
@@ -0,0 +1,170 @@ | |||
--- | |||
- name: user.profile.id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't create profile
as separate namespace because it's tuned and should be used only in user
namespace?
multi_fields: | ||
- type: text | ||
example: [email protected] | ||
description: Array of additional user identities (usually email addresses). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't it have
normalize:
- array
level: extended | ||
type: keyword | ||
example: Regular | ||
description: Further classification type for the user account. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how it correlates to just user.profile.type
Should we add more details into description to resolve this question for ECS users?
type: keyword | ||
example: US - Washington - Distributed | ||
description: Assigned location for the user account. | ||
- name: user.profile.mobile_phone |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought it was written these 2 fields will be removed?
level: extended | ||
type: date | ||
example: June 5, 2023 @ 18:25:57.000 | ||
description: Date account was activated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The format of the date should be in description
From otel POV we don't have assets as namespace there and currently is not a good place to start to add a new namespace. For opening PR there we would need a use-case story - why we should have those fields |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tehilashn @eyalkraft @oren-zohar @kfirpeled @romulets, I have taken a first look at some of the suggestions here that I believe will impact some of the experiences we want to build out. See my thoughts and feedback below. I will soon take a second pass-through of the remaining asset and user fields. In the meantime, please review and provide your thoughts.
an infrastructure. These fields can be nested under other objects that | ||
identifies an asset such as host, user, network, and cloud schemas. | ||
reusable: | ||
top_level: false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MikePaquette @oatkiller @jaredburgettelastic, do you have any insights on why asset
should not be used at the top level?
IMO, the asset fieldset should and can be used at the top level. There are valid use cases where an asset needs to be represented independently of any specific context like host, user, network, or cloud. For example, consider an IT asset management system (some CMDB) that tracks all the assets in an organization. This includes not only physical assets like workstations and servers but also mobile devices and other assets that might not fit neatly into the host, user, network, or cloud fieldsets.
Also, based on our use cases, we often need to query or analyze assets across different contexts (e.g., find all assets owned by a specific user, regardless of whether they're associated with a host, network, or cloud). This would be easier to achieve if asset is a top-level field.
With this in mind, I propose that we use the asset fieldset at the top level. This would give us the flexibility to represent assets that are not directly associated with a host, user, network, or cloud environment.
- cloud | ||
type: group | ||
fields: | ||
- name: category |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MikePaquette @oatkiller @jaredburgettelastic Have we considered establishing a more detailed taxonomy up-front for asset categorization, which will include a set of allowed values (like we do with the event fieldset) that could expand as needed?
A detailed taxonomy could significantly enhance the schema’s flexibility and precision. For instance, consider the following:
Field Name | Description |
---|---|
asset.category |
The top-level classification of assets, reflecting the primary nature of the asset groups, like Software , Infrastructure , or Identity . See mind-map for more details |
asset.subcategory |
A division within a category that encompasses a range of assets sharing common characteristics, such as 'Applications' under 'Software', or 'Compute' and 'Storage' under 'Infrastructure'. See mind-map for more details |
asset.type |
A specific classification of assets within a subcategory, characterizing the primary function or purpose, like 'Operating System' or 'Development Tools' under the 'Applications' subcategory. See mind-map for more details |
asset.subtype |
The most granular classification that provides specific details about the asset, including the exact version, model, or configuration, such as 'Microsoft Windows 10' or 'Ubuntu Server 20.04' under the 'Operating System' type or AWS EC2 under the Virtual Machine type. See mind-map for more details |
This level of granularity/detail in our schema/taxonomy will enable us to deliver experiences tailored to an asset's specific classification—like specific security suggestions for workstations versus cloud storage vs cloud compute assets (i.e., allow our product to behave more intelligently). Precise asset classification will also enable us to provide sophisticated billing models in our serverless offering by allowing us to track assets more accurately (in a transparent and explainable way we can expose to users for billing). And of course, as new technologies emerge that we need to track, this schema can evolve without overhauling the existing framework, ensuring the product adapts to future developments.
That being said, over-flexibility at higher classification levels could lead to consistency and an unpredictable product experience. Therefore, IMO for asset.category
and asset.subcategory
, a defined set of allowed values (similar to what we do with event categorization fields) is crucial to maintain a standardized, navigable, and intuitive interface. This will ensure that as users interact with different assets, the experience remains coherent and aligned with the overall experience we aim to provide:
Field Name | Allowed Values | Rationale |
---|---|---|
asset.category |
Yes | Ensures consistency and predictability at the highest level of asset classification. This aids in defining standard operational procedures and analytics across the organization. |
asset.subcategory |
Yes | Provides a controlled expansion of categories, ensuring relevant details are captured while maintaining uniformity in data segmentation for reliable analysis and reporting. |
asset.type |
No | Allows for specific and varied asset identification within subcategories, accommodating unique and diverse organizational assets without modifying the overarching schema. |
asset.subtype |
No | Permits detailed and nuanced asset distinctions, reflecting the granular variations and characteristics specific to the asset types for in-depth management and tracking. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI @dimadavid @r4zr32d3k1l, as I belive decisions here will have an impact on our ultimate UX for the asset inventory experience we want to build out
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, I agree with the sentiment of "establishing a more detailed taxonomy" 👍
As for the fields, I wonder if we should instead leverage the existing field names for categorization that ECS already provides. These are (in largest-to-smallest bucket order): kind
, category
, type
, outcome
, and can be found here.
The only one of those that doesn't fit the use case of assets would be outcome
, and therefore it may be appropriate to incorporate part of your proposal, leaving us with kind
, category
, type
, and subtype
.
As for whether each categorization bucket has a list of allowed values, I'm curious to hear the thoughts of others. I believe I agree with your assessment that the highest values should be allowlisted (in my proposal, that would be kind
and category
, while in yours that would be category
and subcategory
). The counterpoint, though, is that I believe the primary reason events
categorization had allowlisted values is because events data can be seen as metadata, while asset data is domain-specific, and we don't know every desired domain up front. If we went with the taxonomy currently defined in the linked mind-map, a new ECS RFC would have to be created to update those allowed values if an Elastic user wanted to begin considering their production cutting machine (as a random example) as an asset, and have that data be ECS-compliant. And maybe that's fine! But food for thought. Would the alternative, that solutions define and standardize on their own allowlisted values for these fields and ECS remains agnostic to them, be a more desirable approach?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
type: keyword | ||
example: [email protected] | ||
description: The primary user entity who owns the 'Host' asset | ||
- name: priority |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@oatkiller @jaredburgettelastic, is there a clear distinction between asset priority
and asset criticality
? If so, what's the difference? See below:
- name: priority
level: extended
type: keyword
example: Priority 1
description: A priority classification for the asset obtained from outside this
system, such as from external CMDB or Directory service.- name: criticality
level: extended
type: keyword
example: Critical
description: A business criticality classification assigned to the asset.
The only difference seems to be where the context (how important this asset is to the organization/security team) came from, either from an external source or natively provided in our system. I don't think this warrants separate mapping and I also think it will be confusing for users.
example: workstation | ||
description: "A sub-classification of assets. Possible values for host assets: | ||
workstation, S3,Compute. Possible values for host assets: (NULL/ TBD)" | ||
- name: id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we include multiple forms of asset identifiers (IDs) in the schema to capture both an asset's native/external ID (such as the ARN for an AWS resource) and a unique internal identifier (some UUID or a hash value of some sort we come up with) to account for assets that might have native/external IDs that could lead to collisions or lack of uniqueness? I could see this being an issue with k8s components, some IPs we have limited metadata on, etc.
This is a technical consideration, so I don't have a strong opinion one way or the other. However, we should keep two things in mind
- If we end up going with only one ID value, we should ensure (where possible/relevant) we use an asset's native ID (like an AWS asset's ARN) as the value for this field.
- If we do end up going with a dual ID approach (an internal UUID and native ID), then we should capture these two data points separately in a clear/easy-to-understand field. Ideally, from a UX perspective, we don't surface this UUID but instead, the ID the user is more familiar with (native ID).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Experience and external standards suggest that just a singular "ID" field, without context, is not enough when IDs are exposed beyond some internal bound. With that said, I'd go even further: I don't feel we should only limit ourselves to "native"/"internal" or surrogate/non-surrogate IDs.
Some concrete examples to make this case:
- Microsoft uses SIDs for Windows Server, which is made up of not only an identifier, but also what authority assigned that identifier (the authority being the "context" mentioned above).
- AWS EC2 instances have an ARN, but in most AWS APIs you don't use the ARN to identify that EC2 instance. Instead, you use its "Instance ID", which would look something like "i-01e8de571c1ea7903". But you can compute the ARN if you have more information on where that instance is hosted (provided the region, account id, and instance ID). And both of these are useful depending on the context, as you'd need the ARN if you were writing an AWS policy, but the instance ID if you were trying to stop the instance. So ID context is helpful (in this case, the identifier type such as "instance ID" vs "ARN", is the "context" mentioned above).
Although prior art in ECS doesn't match this (such as the ECS host.id field), my recommendation would be for asset.id
to represent something like an array of identifiers, or subfields, having context data, such as (at least) a type. An example document might look like:
{
"asset": {
"id": {
"arn": "arn:aws:ec2:us-east-1:123456789012:i-01e8de571c1ea7903",
"instance_id": "i-01e8de571c1ea7903"
}
}
}
or even
{
"asset": {
"ids": [
{
"id_type": "arn",
"id_value": "arn:aws:sns:us-east-1:123456789012:example-sns-topic-name"
}
]
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From one side, I like how rich it becomes. From the other side, I wonder how easy it becomes to correlate data.
For example, how easy would be to have a ES|QL query joining documents
<!-- | ||
Stage 1: If the changes include field additions or modifications, please create a folder titled as the RFC number under rfcs/text/. This will be where proposed schema changes as standalone YAML files or extended example mappings and larger source documents will go as the RFC is iterated upon. | ||
--> | ||
|
||
This proposal extends the existing ECS field set to store inventory metadata for hosts and users from external application repositories. Using ECS to store such fields will improve metadata querying and retrieval across various use cases. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This description primarily focuses on hosts and users. Have we considered the benefits of expanding the asset schema to encompass a broader range of assets from the start?
An asset schema that enables us to capture a broader range of assets is essential for developing new asset-centric features and experiences, such as the proposed asset inventory experience/workflow.
An extensible schema will also be crucial for specific use cases like our proposed enhancements to SIEM to enable better Cloud Detection and Response (CDR), where accurately modeling and representing a diverse array of cloud assets beyond hosts and users is a crucial requirement for the experiences we want to deliver. For example, today, in our SIEM, we have threat detection rules for AWS and other CSPs that detect malicious activity related to non-host and user entities. When these detection rules trigger, our current alert flyout will only highlight host and user assets as being present, even though the detection rule explicitly mentions other assets too (ex., RDS database, SecurityGroup, etc.); having a standardized schema is one of the first steps in addressing enhancements like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
``` | ||
|
||
|
||
#### AzureAD Hosts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@oatkiller @jaredburgettelastic do you have any sample AzureAD data on hand? If so, can you help out with this section or should we drop it?
@jaredburgettelastic @oatkiller, While working on concrete examples for my ask in this epic, I realized there might be an issue with using the term "asset" to model the wide variety of entities present in our logs. The core concern is that "asset" implies ownership or inherent value to the user, which isn't always the case for some resources in the logs. For example, a network flow log or cloud trail log might reference an external IP address belonging to a third party or even a malicious actor. Modeling these with the "asset" field set could be misleading and cause confusion. In short, while asset equals entity, not all entities are assets. Instead, I propose that we consistently use the term "entity" across our entire security solution, from the data model being proposed here to the in-product experiences (e.g., This change would provide several key benefits:
My ask for you:
|
@tinnytintin10 thank you for the input! @tommyers-elastic on the Observability Solution mentioned that they are also shifting away from the term "assets" and toward "entities", so this aligns well from that perspective. However, I'm unsure of:
@tommyers-elastic could you provide input from the perspective of Observability? I know of only one feature in the Security Solution today that adheres to the asset portion of this RFC, which is "Asset Criticality". There are two pieces to this: one is an index that stores asset criticality information, and the other is an enrichment on alert documents that adds the The asset criticality feature is currently behind an advanced setting, so there could be wiggle room there in our ability to change the data structures. However, from a nomenclature perspective, "Asset Criticality" explicitly applies to resources that are owned by the customer/Elastic Security user, because they must explicitly assign those classifications. I don't know if our product partners have a desire to change that nomenclature in the Security Solution platform. @MikePaquette could you please provide input from a product perspective on this portion? cc @oatkiller |
@tinnytintin10, I completely agree with your underlying point that using the term "asset" to model the wide variety of entities in our logs can be misleading, as it implies ownership or inherent value to the user, which isn't always the case. However, I suggest keeping "Asset Inventory" as is. The asset inventory integration focuses on managed entities, aligning with your definition of "assets." Additionally, the collected assets could still be indexed into an "entity" ECS, which makes sense in my opinion. This distinction helps maintain clarity and consistency for user/organization-owned resources. What do you think about this? |
@tinnytintin10 you are probably aware of this, but to add to @oren-zohar's point from an InfoSec perspective: in InfoSec we are using the term asset, and this comes from the ISO definition of an asset: "An asset is an item, thing or entity that has potential or actual value to an organisation" (although other definitions of an asset would imply ownership, this one does not address ownership explicitly). I acknowledge terminology is extremely important, and there are of course many different inputs to this, hope this helps! |
@jaredburgettelastic o11y team has chosen to use "entity" (instead of asset). i'm not aware of any o11y feature(s) that use |
Hi! We just realized that we haven't looked into this PR in a while. We're We're labeling this PR as Thank you for your contribution! |
@MikePaquette @lauravoicu @chrisdistasio - any progress on reviewing this? |
Stage1 checklist:
Key change-log:
user.phone.*
fields from this stage-1 of this proposal.