Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ak.unzip touches all contents whilst typetracing #2372

Closed
jpivarski opened this issue Apr 7, 2023 · 6 comments · Fixed by #2373
Closed

ak.unzip touches all contents whilst typetracing #2372

jpivarski opened this issue Apr 7, 2023 · 6 comments · Fixed by #2373
Labels
bug The problem described is something that must be fixed

Comments

@jpivarski
Copy link
Member

Version of Awkward Array

HEAD, after #2370

Description and code to reproduce

Even after applying #2370 (which fixes a similar bug), the following code:

import json
import awkward as ak


def delta_r2(a, b):
    return (a.eta - b.eta) ** 2 + (a.phi - b.phi) ** 2


fromjson = {
    "class": "RecordArray",
    "fields": ["muon", "jet"],
    "contents": [
        {
            "class": "ListOffsetArray",
            "offsets": "i64",
            "content": {
                "class": "RecordArray",
                "fields": ["pt", "eta", "phi", "crossref"],
                "contents": [
                    {
                        "class": "NumpyArray",
                        "primitive": "int64",
                        "inner_shape": [],
                        "parameters": {},
                        "form_key": "muon_pt!",
                    },
                    {
                        "class": "NumpyArray",
                        "primitive": "int64",
                        "inner_shape": [],
                        "parameters": {},
                        "form_key": "muon_eta!",
                    },
                    {
                        "class": "NumpyArray",
                        "primitive": "int64",
                        "inner_shape": [],
                        "parameters": {},
                        "form_key": "muon_phi!",
                    },
                    {
                        "class": "ListOffsetArray",
                        "offsets": "i64",
                        "content": {
                            "class": "NumpyArray",
                            "primitive": "int64",
                            "inner_shape": [],
                            "parameters": {},
                            "form_key": "muon_crossref_content!",
                        },
                        "parameters": {},
                        "form_key": "muon_crossref_index!",
                    },
                ],
                "parameters": {},
                "form_key": "muon_record!",
            },
            "parameters": {},
            "form_key": "muon_list!",
        },
        {
            "class": "ListOffsetArray",
            "offsets": "i64",
            "content": {
                "class": "RecordArray",
                "fields": ["pt", "eta", "phi", "crossref", "thing1"],
                "contents": [
                    {
                        "class": "NumpyArray",
                        "primitive": "int64",
                        "inner_shape": [],
                        "parameters": {},
                        "form_key": "jet_pt!",
                    },
                    {
                        "class": "NumpyArray",
                        "primitive": "int64",
                        "inner_shape": [],
                        "parameters": {},
                        "form_key": "jet_eta!",
                    },
                    {
                        "class": "NumpyArray",
                        "primitive": "int64",
                        "inner_shape": [],
                        "parameters": {},
                        "form_key": "jet_phi!",
                    },
                    {
                        "class": "ListOffsetArray",
                        "offsets": "i64",
                        "content": {
                            "class": "NumpyArray",
                            "primitive": "int64",
                            "inner_shape": [],
                            "parameters": {},
                            "form_key": "jet_crossref_content!",
                        },
                        "parameters": {},
                        "form_key": "jet_crossref_index!",
                    },
                    {
                        "class": "NumpyArray",
                        "primitive": "int64",
                        "inner_shape": [],
                        "parameters": {},
                        "form_key": "jet_thing1!",
                    },
                ],
                "parameters": {},
                "form_key": "jet_record!",
            },
            "parameters": {},
            "form_key": "jet_list!",
        },
    ],
    "parameters": {},
    "form_key": "outer!",
}

form = ak.forms.from_json(json.dumps(fromjson))

ttlayout, report = ak._nplikes.typetracer.typetracer_with_report(form)

ttarray = ak.Array(ttlayout)

a, b = ak.unzip(ak.cartesian([ttarray.muon, ttarray.jet], axis=1, nested=True))

print("ab>>>", report.data_touched, "\n")

mval = delta_r2(a, b)

print("dr>>>>", report.data_touched, "\n")

mmin = ak.argmin(mval, axis=2)

print("mmin>>", report.data_touched, "\n")

ak.firsts(b[mmin], axis=2).pt

print("pt>>>>", report.data_touched, "\n")

produces the following output:

ab>>> ['muon_list!', 'jet_list!', 'muon_crossref_index!', 'jet_crossref_index!'] 

dr>>>> ['muon_list!', 'jet_list!', 'muon_crossref_index!', 'jet_crossref_index!', 'muon_eta!', 'jet_eta!', 'muon_phi!', 'jet_phi!'] 

mmin>> ['muon_list!', 'jet_list!', 'muon_crossref_index!', 'jet_crossref_index!', 'muon_eta!', 'jet_eta!', 'muon_phi!', 'jet_phi!'] 

pt>>>> ['muon_list!', 'jet_list!', 'muon_crossref_index!', 'jet_crossref_index!', 'muon_eta!', 'jet_eta!', 'muon_phi!', 'jet_phi!', 'jet_pt!', 'jet_crossref_content!', 'jet_thing1!'] 

It's a little mysterious (but not a major pain point) that the "crossref indexes" are touched by the ak.cartesian step, which ought to treat the muons and jets as opaque objects. A little mysterious, and maybe related.

But the real problem is represented by the "jet thing" in the final output. The "jet thing" should never have been accessed by any of these operations: delta_r2, ak.argmin, the b[mmin] slice, or ak.firsts. It's also a stand-in for dozens of large fields, so it's a real pain point that needs to be addressed.

@jpivarski jpivarski added the bug The problem described is something that must be fixed label Apr 7, 2023
@jpivarski
Copy link
Member Author

From @lgray:

so in particular it's when there is a ListOffsetArray in the records that are being cartesian'd and then [index]'d

and

and it particular I'd also say that tagging the ListOffsetArrays in the cartesian is also wrong? Unless there's some mandatory manipulation that needs to happy to the objects? That makes a lot of assumptions about what those ListOffsetArrays are doing that doesn't make much sense (at least on first thought).

@lgray, if you mean the appearance of 'muon_list!', 'jet_list!' in ab>>>, I expect them to be there because ak.cartesian needs to know the structure of the lists at the axis value where it's enumerating the Cartesian product. In an imperative implementation, it would be doing for loops over those lists, so it definitely has to see the lists and their lengths.

But if you mean the appearance of 'muon_crossref_index!', 'jet_crossref_index!', yeah, that's what I called mysterious. It might be a clue.

@lgray
Copy link
Contributor

lgray commented Apr 7, 2023

But if you mean the appearance of 'muon_crossref_index!', 'jet_crossref_index!', yeah, that's what I called mysterious. It might be a clue.

Yes, I mean this exactly. muon_list! and jet_list! are expected.
The cross refs make little sense that they should be there, they play no role in broadcasting.

@lgray
Copy link
Contributor

lgray commented Apr 7, 2023

I'll also mention that the cross refs can be a pain since they're doubly nested indices, so their on disk / in-memory representations can be pretty large (since they're N^2 in the size of the collection, for two large collections). It would be best to avoid loading them spuriously.

@agoose77 agoose77 changed the title ak.cartesian with ListOffsetArray in a RecordArray causes unrelated fields to get touched by TypeTracerArray analysis ak.unzip touches all contents Apr 7, 2023
@agoose77 agoose77 changed the title ak.unzip touches all contents ak.unzip touches all contents whilst typetracing Apr 7, 2023
@agoose77
Copy link
Collaborator

agoose77 commented Apr 7, 2023

The cause of this bug is recursively_apply, which visits all of the nodes in ak.unzip. Currently, recursively_apply invokes touch_data on each node that it visits.

I think this is reasonable for recursively_apply, which trims the contents as it goes. The real bug here is in our visitor, which isn't quite behaving properly.

@jpivarski
Copy link
Member Author

jpivarski commented Apr 7, 2023

Trimming content (which recursively_apply must do) shouldn't/doesn't? count as a touch. That's part of why slicing

some_array[from_here:to_there]

doesn't count as a touch.

I hope you're right that it's something that can be fixed by fixing the visitor.

@agoose77
Copy link
Collaborator

agoose77 commented Apr 7, 2023

@jpivarski actually yes, it doesn't. I was looking at this branch (not related to the layout in this issue, but I was already on the tab), which does (is this a bug?)

self._touch_data(recursive=False)

I have to wrap my hand around how we want touch_data to behave. In theory, we want any operation that involves buffers to touch data, but the assertion

some_array[from_here:to_there]

doesn't count as a touch.

disagrees with that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug The problem described is something that must be fixed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants