GPU-like Memory Hierarchy using Merlin #1161

mkhairy · 2018-11-06T17:49:28Z

Hello,

I am trying to use Merlin to model non-coherent cache hierarchy, similar to Nvidia-like GPU cache hierarchy these days.
http://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf

So, I connect a processor-side non-coherent L1 cache to Merlin crossbar router at one side and directory-less memory-side L2 caches on the other side.

However, I came across the error listed below. It seems like Merlin has to be connected with a directory interface. Is that true? Does Merlin need a coherence protocol to work?

I attached a dumped python file that tries to model the GPU-like memory hierarchy.
python file.zip

#0 0x00000000004caa9a in SST::Units::operator== (this=this@entry=0x20515f40,
lhs=...) at unitAlgebra.cc:278
#1 0x00000000004cce82 in SST::Units::operator!= (lhs=..., this=0x20515f40)
at unitAlgebra.h:90
#2 SST::UnitAlgebra::operator> (this=this@entry=0x20515f38, v=...)
at unitAlgebra.cc:480
#3 0x00007fffe8f3ffb5 in SST::Merlin::LinkControl::init (this=0x20515ef0,
phase=) at linkControl.cc:161
#4 0x00007fffe619d7b8 in SST::MemHierarchy::MemNIC::init (this=0x20514de0,
phase=1) at memNIC.cc:224
#5 0x00007fffe6135f8c in SST::MemHierarchy::Cache::init (this=0x203c37f0,
phase=1) at cacheEventProcessing.cc:456
#6 0x00000000004ba874 in SST::Simulation::initialize (this=0xb85c00)
at simulation.cc:430
#7 0x000000000046382a in start_simulation (tid=tid@entry=0, info=...,
barrier=...) at main.cc:321
#8 0x0000000000456de3 in main (argc=3, argv=0x7fffffffe278) at main.cc:679

hughes-c · 2018-11-07T19:38:25Z

Same error with smaller configuration
single_core.py

gvoskuilen · 2018-11-08T22:01:57Z

@mkhairy There are some port mismatches in the configuration. Some ports only work over a network, some will not work over a network. In the configuration, there are network ports connecting to non-network ports and non-network ports connecting to merlin. It looks like what you're trying to do on the GPU cache hierarchy is:
GPU <--> L1 <--> Network <--> L2 <--> Memory
Is that right? Or is memory supposed to be on the network too?

mkhairy · 2018-11-08T22:05:25Z

Yes, that is right as you described
GPU <--> L1 <--> Network <--> L2 <--> Memory

and L1 caches are non-coherent and so there is no directory on the L2 cache side.
The l2 caches are only connected to network, not the memory.

mkhairy · 2018-11-08T22:07:03Z

Could you please share with me the right way to connect the components into Merlin, to describe the above model?
Thanks!

mkhairy · 2018-11-08T22:23:20Z

This pdf attachment contains a photo of the model.
stt-gpu-mem.pdf

I also attached the original python script.
gpu-test.txt

gvoskuilen · 2018-11-09T00:15:16Z

Thanks! We've never tested "network <-> L2 <-> memory" so there's a small change to the cache constructor that will be needed. I'll push it soon (today/tomorrow) and let you know, along with the right ports and memNIC parameters to set. Essentially you'll have L1 connect to the GPU via the "high_network_0" port and connect to the network via the "cache" port. Then you'll have L2 connect to the network via the "cache" port and to the memory via the "low_network_0" port. The memory will stay on the "direct_link" port. Then there will be a few memNIC parameters to set since you don't have a directory in that path.

mkhairy · 2018-11-09T17:19:38Z

Thanks a lot!

gvoskuilen · 2018-11-09T17:46:48Z

The fix is getting merged right now (the branch is "mh_ports" if you want to grab it in the meantime). For the GPU L1s, connect them to the GPU via "high_network_0" and to the network via "cache". Then add the following parameters to the L1s:
"memNIC.group" : 1, # Think of this as the cache level on the network so L1s get '1' and L2s get '2'
"memNIC.network_bw" : "2155GB/s", # Or whatever it should be

For the GPU L2s, connect them to the network via "cache" and to the memories via "low_network_0".
Then add these parameters to the L2s:
"memNIC.group" : 2,
"memNIC.network_bw" : "2155GB/s", # Or whatever it should be

Since you have one L2 per memory, is the L2 for each memory the "home" for that memory's addresses? If so, you'll need to set each L2 memNIC's address ranges so that the L1s can route requests correctly. As long as each memory is getting a contiguous block of addresses (so mem0 gets 0-X, mem1 gets X-Y, etc.), you shouldn't need to touch the cache "slice" parameters. Just add the region parameters to the L2s:
"memNIC.addr_range_start" : X, # starting address for the associated memory
"memNIC.addr_range_end" : X + memSize - 1, # Last address for the associated memory

Hope this solves the problem! If it does, go head and close the issue. Also, if you did want to interleave addresses across memories instead of do contiguous chunks, that is also possible but will probably require another change to the cache slice addressing policy.

mkhairy · 2018-11-09T22:27:25Z

Thanks! It is working! and simulation is running successfully.
Now, I am wondering how to interleave data on L2 every 256 bytes.
I found this parameter:
"memNIC.interleave_size" : "256B"
is this the right parameter to use? do i need change the start and end address in the interleaving or I can keep them the same? becuase when I just use the above parameter, it gives me this error:

[tgrogers-pc02:32089] Signal: Floating point exception (8)
[tgrogers-pc02:32089] Signal code: Integer divide-by-zero (1)
[tgrogers-pc02:32089] Failing at address: 0x7f3c0347819f

mkhairy · 2018-11-18T03:56:35Z

Hi, Could you please help me with the error above? What is the right way to do interleaving?

hughes-c · 2018-11-19T18:27:19Z

@mkhairy You need to set the interleave_step size. In your case, this should be (numL2s * interleave_size).

gvoskuilen · 2018-11-19T18:54:07Z

@mkhairy Also there will be an issue with cache sets because the set mapping policy is currently hard-coded, so some sets may not get valid addresses mapped to them. I'm fixing this - I'll let you know when it's ready. Just to clarify, the interleaving across L2s matches the interleaving across memories right? Make sure to set "memNIC." address range/interleave params for the L2s and "cpulink." (same parameter names, different prefix) for the memories so that they match. I'll push the fix to make sure the cache array picks up the interleaving and maps addresses into sets correctly.

mkhairy · 2018-11-20T00:09:38Z

@hughes-c thanks for your help, it's working successfully.
@gvoskuilen Yes, the interleaving across L2s are the same memories. I set memNIC for L2 and memctrl as described above. Now, I am wondering, what do you mean to set the interleaving parameters for "cpulink"?

gvoskuilen · 2018-11-20T16:17:19Z

That's the memory controller's link manager that faces the L2 (towards the 'cpu'). The memory and the L2 should have the same parameters on their link managers. So set on the memory controllers:
cpulink.addr_range_start
cpulink.interleave_size
cpulink.interleave_step
Set them to the same as what you set the corresponding "memNIC" parameters to in the L2s. This way memory knows it's only storing every nth block of 256 bytes instead of everything.

mkhairy · 2018-11-20T18:23:06Z

Hi @gvoskuilen , so to make it clear. I have the L2 cache interleaving parameters look like this:
"memNIC.addr_range_start" : memID * 256,
"memNIC.addr_range_end" : self.gpu_memory_capacity_inB - (256 * memID),
"memNIC.interleave_size" : "256B",
"memNIC.interleave_step" : str(self.gpu_mem_parts * 256) + "B",

and I set the Memory controller interleaving parameters to have the same parameters like above. Now, I connect the memory controller and L2 using a link as shown below.

link = sst.Link("l2g_mem_link_%d"%next_mem_id)
link.connect( (c0, port0, latency), (c1, port1, latency) )

So, you want me to update this link config to have the same interleaving parameters as L2 and MC, as shown below. Is that correct?

    "link.addr_range_start" : memID * 256,
        "link.addr_range_end" : self.gpu_memory_capacity_inB - (256 * memID),
        "link.interleave_size" : "256B",
        "link.interleave_step" : str(self.gpu_mem_parts * 256) + "B",

gvoskuilen · 2018-11-20T22:04:22Z

@mkhairy I didn't realize you'd set the memory parameters for address range/interleaving. As long as you either set those or set the same parameters prefixed with 'cpulink' in the memory controller, you'll get the correct behavior. The cache set address fix is located in a branch called 'mh_cache_interleave' and is currently going through the process of getting merged into the devel branch.

hughes-c added the Help Wanted label Nov 7, 2018

hughes-c assigned nmhamster, feldergast and hughes-c Nov 7, 2018

hughes-c closed this as completed Nov 26, 2018

sst-autotester mentioned this issue Jun 3, 2019

Mjleven/ember move gen tolib #1282

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU-like Memory Hierarchy using Merlin #1161

GPU-like Memory Hierarchy using Merlin #1161

mkhairy commented Nov 6, 2018 •

edited

Loading

hughes-c commented Nov 7, 2018

gvoskuilen commented Nov 8, 2018

mkhairy commented Nov 8, 2018 •

edited

Loading

mkhairy commented Nov 8, 2018

mkhairy commented Nov 8, 2018 •

edited

Loading

gvoskuilen commented Nov 9, 2018

mkhairy commented Nov 9, 2018

gvoskuilen commented Nov 9, 2018

mkhairy commented Nov 9, 2018 •

edited

Loading

mkhairy commented Nov 18, 2018

hughes-c commented Nov 19, 2018

gvoskuilen commented Nov 19, 2018

mkhairy commented Nov 20, 2018

gvoskuilen commented Nov 20, 2018

mkhairy commented Nov 20, 2018 •

edited

Loading

gvoskuilen commented Nov 20, 2018

GPU-like Memory Hierarchy using Merlin #1161

GPU-like Memory Hierarchy using Merlin #1161

Comments

mkhairy commented Nov 6, 2018 • edited Loading

hughes-c commented Nov 7, 2018

gvoskuilen commented Nov 8, 2018

mkhairy commented Nov 8, 2018 • edited Loading

mkhairy commented Nov 8, 2018

mkhairy commented Nov 8, 2018 • edited Loading

gvoskuilen commented Nov 9, 2018

mkhairy commented Nov 9, 2018

gvoskuilen commented Nov 9, 2018

mkhairy commented Nov 9, 2018 • edited Loading

mkhairy commented Nov 18, 2018

hughes-c commented Nov 19, 2018

gvoskuilen commented Nov 19, 2018

mkhairy commented Nov 20, 2018

gvoskuilen commented Nov 20, 2018

mkhairy commented Nov 20, 2018 • edited Loading

gvoskuilen commented Nov 20, 2018

mkhairy commented Nov 6, 2018 •

edited

Loading

mkhairy commented Nov 8, 2018 •

edited

Loading

mkhairy commented Nov 8, 2018 •

edited

Loading

mkhairy commented Nov 9, 2018 •

edited

Loading

mkhairy commented Nov 20, 2018 •

edited

Loading