Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

about making breaking changes to serialization format #392

Open
ambyjkl opened this issue Aug 29, 2024 · 4 comments
Open

about making breaking changes to serialization format #392

ambyjkl opened this issue Aug 29, 2024 · 4 comments

Comments

@ambyjkl
Copy link
Contributor

ambyjkl commented Aug 29, 2024

As I was looking at potentially implementing more features from #1 and uBO-parity pending features that are already available in uBO , I noticed that NetworkFilterMask is already full, and in order to add more bit flags, it would need to be expanded from u32 to u64. That will create a breaking change to the on-disk serialization format. What is the current policy about this?

@antonok-edm
Copy link
Collaborator

@ambyjkl this is a fun topic 😅

The biggest issue with expanding the bit flags beyond u32 is that it will cause the in-memory representation of each network filter to be increased significantly. That means the adblock engine will consume much more memory at runtime and will also make queries a bit slower due to cache effects.

In terms of the on-disk serialization format - we can always append additional data to the end of the format, which will be ignored by earlier versions of adblock-rust, and then use it to "reconstruct" missing information elsewhere in newer versions only. So that's not a major concern, although it can be slightly annoying to build the glue code for this.

The vision I have for #1 is to actually free up some of the existing flags using external mechanisms. As an example, if you look at Blocker, you'll see that we already sort all the $important filters into a separate Vec, so the IS_IMPORTANT bitflag isn't really adding any utility for runtime matching once the engine is compiled. Pulling this off will require some form of intermediate representation for network filters that stores e.g. IS_IMPORTANT in a format that can be safely discarded later on.

@ambyjkl
Copy link
Contributor Author

ambyjkl commented Aug 29, 2024

The biggest issue with expanding the bit flags beyond u32 is that it will cause the in-memory representation of each network filter to be increased significantly.

On my 64 bit system, after rust's struct alignment, both u32 and u64 versions use 208 bytes per NetworkFilter

#[test]
fn compare() {
    println!(
        "with u32 {} with u64 {}",
        std::mem::size_of::<NetworkFilter>(),
        std::mem::size_of::<NetworkFilterNew>(),
    );
}

outputs:

with u32 208 with u64 208

With 32 bit, there would be a 4 byte increase, but I'm not sure how much that would matter, since the struct is already quite big. I see some low-hanging fruit that should immediately address some of this though. As for cache effects, there are some ways to deal with that, such as using a struct of arrays instead of an array of structs (Vec<NetworkFilter>) as it is currently. Doing that can also enable usage of SIMD in the checking of the bit masks.

@ambyjkl
Copy link
Contributor Author

ambyjkl commented Aug 29, 2024

swapping out all the Vec and String with Box brings it down to 176 already:

pub struct NetworkFilterNew {
    pub mask: u64,
    pub filter: FilterPart,
    pub opt_domains: Option<Box<[Hash]>>,
    pub opt_not_domains: Option<Box<[Hash]>>,
    /// Used for `$redirect`, `$redirect-rule`, `$csp`, and `$removeparam` - only one of which is
    /// supported per-rule.
    pub modifier_option: Option<Box<str>>,
    pub hostname: Option<Box<str>>,
    pub(crate) tag: Option<Box<str>>,

    pub raw_line: Option<Box<str>>,

    pub id: Hash,

    // All domain option values (their hashes) OR'ed together to quickly dismiss mis-matches
    pub opt_domains_union: Option<Hash>,
    pub opt_not_domains_union: Option<Hash>,
}
old 208 new 176

@ambyjkl
Copy link
Contributor Author

ambyjkl commented Aug 30, 2024

@antonok-edm I managed to find another memory optimization, a rather massive one at that, I would appreciate it if you took a look at it #393

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants