Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU-like Memory Hierarchy using Merlin #1161

Closed
mkhairy opened this issue Nov 6, 2018 · 16 comments
Closed

GPU-like Memory Hierarchy using Merlin #1161

mkhairy opened this issue Nov 6, 2018 · 16 comments
Assignees

Comments

@mkhairy
Copy link

mkhairy commented Nov 6, 2018

Hello,

I am trying to use Merlin to model non-coherent cache hierarchy, similar to Nvidia-like GPU cache hierarchy these days.
http://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf

So, I connect a processor-side non-coherent L1 cache to Merlin crossbar router at one side and directory-less memory-side L2 caches on the other side.

However, I came across the error listed below. It seems like Merlin has to be connected with a directory interface. Is that true? Does Merlin need a coherence protocol to work?

I attached a dumped python file that tries to model the GPU-like memory hierarchy.
python file.zip

#0 0x00000000004caa9a in SST::Units::operator== (this=this@entry=0x20515f40,
lhs=...) at unitAlgebra.cc:278
#1 0x00000000004cce82 in SST::Units::operator!= (lhs=..., this=0x20515f40)
at unitAlgebra.h:90
#2 SST::UnitAlgebra::operator> (this=this@entry=0x20515f38, v=...)
at unitAlgebra.cc:480
#3 0x00007fffe8f3ffb5 in SST::Merlin::LinkControl::init (this=0x20515ef0,
phase=) at linkControl.cc:161
#4 0x00007fffe619d7b8 in SST::MemHierarchy::MemNIC::init (this=0x20514de0,
phase=1) at memNIC.cc:224
#5 0x00007fffe6135f8c in SST::MemHierarchy::Cache::init (this=0x203c37f0,
phase=1) at cacheEventProcessing.cc:456
#6 0x00000000004ba874 in SST::Simulation::initialize (this=0xb85c00)
at simulation.cc:430
#7 0x000000000046382a in start_simulation (tid=tid@entry=0, info=...,
barrier=...) at main.cc:321
#8 0x0000000000456de3 in main (argc=3, argv=0x7fffffffe278) at main.cc:679

@hughes-c
Copy link
Member

hughes-c commented Nov 7, 2018

Same error with smaller configuration
single_core.py

@gvoskuilen
Copy link
Contributor

@mkhairy There are some port mismatches in the configuration. Some ports only work over a network, some will not work over a network. In the configuration, there are network ports connecting to non-network ports and non-network ports connecting to merlin. It looks like what you're trying to do on the GPU cache hierarchy is:
GPU <--> L1 <--> Network <--> L2 <--> Memory
Is that right? Or is memory supposed to be on the network too?

@mkhairy
Copy link
Author

mkhairy commented Nov 8, 2018

Yes, that is right as you described
GPU <--> L1 <--> Network <--> L2 <--> Memory

and L1 caches are non-coherent and so there is no directory on the L2 cache side.
The l2 caches are only connected to network, not the memory.

@mkhairy
Copy link
Author

mkhairy commented Nov 8, 2018

Could you please share with me the right way to connect the components into Merlin, to describe the above model?
Thanks!

@mkhairy
Copy link
Author

mkhairy commented Nov 8, 2018

This pdf attachment contains a photo of the model.
stt-gpu-mem.pdf

I also attached the original python script.
gpu-test.txt

@gvoskuilen
Copy link
Contributor

Thanks! We've never tested "network <-> L2 <-> memory" so there's a small change to the cache constructor that will be needed. I'll push it soon (today/tomorrow) and let you know, along with the right ports and memNIC parameters to set. Essentially you'll have L1 connect to the GPU via the "high_network_0" port and connect to the network via the "cache" port. Then you'll have L2 connect to the network via the "cache" port and to the memory via the "low_network_0" port. The memory will stay on the "direct_link" port. Then there will be a few memNIC parameters to set since you don't have a directory in that path.

@mkhairy
Copy link
Author

mkhairy commented Nov 9, 2018

Thanks a lot!

@gvoskuilen
Copy link
Contributor

The fix is getting merged right now (the branch is "mh_ports" if you want to grab it in the meantime). For the GPU L1s, connect them to the GPU via "high_network_0" and to the network via "cache". Then add the following parameters to the L1s:
"memNIC.group" : 1, # Think of this as the cache level on the network so L1s get '1' and L2s get '2'
"memNIC.network_bw" : "2155GB/s", # Or whatever it should be

For the GPU L2s, connect them to the network via "cache" and to the memories via "low_network_0".
Then add these parameters to the L2s:
"memNIC.group" : 2,
"memNIC.network_bw" : "2155GB/s", # Or whatever it should be

Since you have one L2 per memory, is the L2 for each memory the "home" for that memory's addresses? If so, you'll need to set each L2 memNIC's address ranges so that the L1s can route requests correctly. As long as each memory is getting a contiguous block of addresses (so mem0 gets 0-X, mem1 gets X-Y, etc.), you shouldn't need to touch the cache "slice" parameters. Just add the region parameters to the L2s:
"memNIC.addr_range_start" : X, # starting address for the associated memory
"memNIC.addr_range_end" : X + memSize - 1, # Last address for the associated memory

Hope this solves the problem! If it does, go head and close the issue. Also, if you did want to interleave addresses across memories instead of do contiguous chunks, that is also possible but will probably require another change to the cache slice addressing policy.

@mkhairy
Copy link
Author

mkhairy commented Nov 9, 2018

Thanks! It is working! and simulation is running successfully.
Now, I am wondering how to interleave data on L2 every 256 bytes.
I found this parameter:
"memNIC.interleave_size" : "256B"
is this the right parameter to use? do i need change the start and end address in the interleaving or I can keep them the same? becuase when I just use the above parameter, it gives me this error:

[tgrogers-pc02:32089] Signal: Floating point exception (8)
[tgrogers-pc02:32089] Signal code: Integer divide-by-zero (1)
[tgrogers-pc02:32089] Failing at address: 0x7f3c0347819f

@mkhairy
Copy link
Author

mkhairy commented Nov 18, 2018

Hi, Could you please help me with the error above? What is the right way to do interleaving?

@hughes-c
Copy link
Member

@mkhairy You need to set the interleave_step size. In your case, this should be (numL2s * interleave_size).

@gvoskuilen
Copy link
Contributor

@mkhairy Also there will be an issue with cache sets because the set mapping policy is currently hard-coded, so some sets may not get valid addresses mapped to them. I'm fixing this - I'll let you know when it's ready. Just to clarify, the interleaving across L2s matches the interleaving across memories right? Make sure to set "memNIC." address range/interleave params for the L2s and "cpulink." (same parameter names, different prefix) for the memories so that they match. I'll push the fix to make sure the cache array picks up the interleaving and maps addresses into sets correctly.

@mkhairy
Copy link
Author

mkhairy commented Nov 20, 2018

@hughes-c thanks for your help, it's working successfully.
@gvoskuilen Yes, the interleaving across L2s are the same memories. I set memNIC for L2 and memctrl as described above. Now, I am wondering, what do you mean to set the interleaving parameters for "cpulink"?

@gvoskuilen
Copy link
Contributor

That's the memory controller's link manager that faces the L2 (towards the 'cpu'). The memory and the L2 should have the same parameters on their link managers. So set on the memory controllers:
cpulink.addr_range_start
cpulink.interleave_size
cpulink.interleave_step
Set them to the same as what you set the corresponding "memNIC" parameters to in the L2s. This way memory knows it's only storing every nth block of 256 bytes instead of everything.

@mkhairy
Copy link
Author

mkhairy commented Nov 20, 2018

Hi @gvoskuilen , so to make it clear. I have the L2 cache interleaving parameters look like this:
"memNIC.addr_range_start" : memID * 256,
"memNIC.addr_range_end" : self.gpu_memory_capacity_inB - (256 * memID),
"memNIC.interleave_size" : "256B",
"memNIC.interleave_step" : str(self.gpu_mem_parts * 256) + "B",

and I set the Memory controller interleaving parameters to have the same parameters like above. Now, I connect the memory controller and L2 using a link as shown below.

link = sst.Link("l2g_mem_link_%d"%next_mem_id)
link.connect( (c0, port0, latency), (c1, port1, latency) )

So, you want me to update this link config to have the same interleaving parameters as L2 and MC, as shown below. Is that correct?

    "link.addr_range_start" : memID * 256,
        "link.addr_range_end" : self.gpu_memory_capacity_inB - (256 * memID),
        "link.interleave_size" : "256B",
        "link.interleave_step" : str(self.gpu_mem_parts * 256) + "B",

@gvoskuilen
Copy link
Contributor

@mkhairy I didn't realize you'd set the memory parameters for address range/interleaving. As long as you either set those or set the same parameters prefixed with 'cpulink' in the memory controller, you'll get the correct behavior. The cache set address fix is located in a branch called 'mh_cache_interleave' and is currently going through the process of getting merged into the devel branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants