-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-1372: [Plasma] enable HUGETLB support on Linux to improve plasma put performance #974
Conversation
cpp/src/plasma/malloc.cc
Outdated
} | ||
ARROW_LOG(INFO) << "mlocking pointer " << pointer << " size " << size | ||
<< " success " << rv; | ||
memset(pointer, 0xff, size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By getting rid of this memset, writing in my benchmarks takes 2x to 3x as long (that's the only performance difference I can measure from this commit); if there is no other way to achieve the same effect we will need to bring it back.
@atumanov Yeah, I think the best way to deal with the case where -h is specified and -d is not is to give an error I think; there is no canonical hugepagefs directory in linux in the way that there is /dev/shm (systemd seems to use /dev/hugepages and debian seems to prefer /hugepages). |
cpp/src/plasma/malloc.cc
Outdated
@@ -40,7 +42,7 @@ int fake_munmap(void*, int64_t); | |||
#define USE_DL_PREFIX | |||
#define HAVE_MORECORE 0 | |||
#define DEFAULT_MMAP_THRESHOLD MAX_SIZE_T | |||
#define DEFAULT_GRANULARITY ((size_t)128U * 1024U) | |||
#define DEFAULT_GRANULARITY ((size_t) 1024U * 1024U * 1024U) // 1GB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this behave reasonably on machines with less than 1GB memory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my plan is to change it back to the default if hugepages are not active (unfortunately this will be done only during compile time, since it is a macro and we can't set it at runtime)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As part of a separate PR we should implement reserving and mmapping as much of the configured memory capacity as possible on startup. This would alleviate the need for DEFAULT_GRANULARITY
altogether.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in fact, one improvement I've been thinking about is exponential backoff on default granularity + mmap retry on mmap failure. As files we attempt to mmap double in size, we can easily get stuck with only 50% memory capacity mmaped and unable to mmap the remaining 50% (due to ENOMEM).
cpp/src/plasma/plasma.h
Outdated
@@ -129,6 +130,8 @@ struct PlasmaStoreInfo { | |||
/// The amount of memory (in bytes) that we allow to be allocated in the | |||
/// store. | |||
int64_t memory_capacity; | |||
bool hugetlb_enabled; | |||
std::string directory; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two fields should be documented.
python/doc/source/plasma.rst
Outdated
|
||
``` | ||
sudo mkdir -p /mnt/hugepages | ||
sudo mount -t hugetlbfs -o uid=ubuntu -o gid=adm none /mnt/hugepages |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ubuntu
-> $USER
gid=adm
might need to be changed also, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there's some flexibility in setting up hugepages. We've been talking about creating an AMI with an entry in /etc/fstab, for instance. These instructions are just one example.
I can make it more generic if needed. For instance
gid=`id -g`
uid=`id -u`
sudo mount -t hugetlbfs -o uid=$uid -o gid=$gid none /mnt/hugepages
sudo bash -c "echo $gid > /proc/sys/vm/hugetlb_shm_group "
sudo bash -c "echo 2048 > /proc/sys/vm/nr_hugepages"
This is the benchmarks I'm running:
Here are the results (timing is in seconds, no error means it is very small,
Put performance is creating + writing The time spent in creating for MLP is the latency from memsetting, locking or populating pages in the store and the error is the influence from the size of the memory mapped file (we increase the size by 2x every time we map a new file). |
just to elaborate on Philipp's microbenchmark results, overall performance improvement with H + {M | L | P} on write performance is ~5x! (both for 20M and 200M objects) This is consistent with BTW, you may also try to increase the number of threads from 1 to 8 as another dimension. I think the default is 1. |
Yeah, good point! The default threadpool size is indeed 1, see arrow/cpp/src/arrow/io/memory.cc Line 122 in c9805d6
Thanks for the improvements, that looks great! I'll go ahead and fix the linting and the other points we collected above and then we can get this merged. |
Based on the observations so far here is another set of benchmarks. This time it is measuring the quantity that's the most important for performance in practice, namely the throughput in the steady state (when all the memory mapped files have been created): import numpy as np
import pyarrow as pa
import pyarrow.plasma as plasma
import time
client = plasma.connect("/tmp/plasma", "", 0)
create_times = []
write_times = []
data = np.random.randn(200000000)
tensor = pa.Tensor.from_numpy(data)
for i in range(200):
print("iteration", i)
object_id = plasma.ObjectID(np.random.bytes(20))
a = time.time()
buf = client.create(object_id, pa.get_tensor_size(tensor))
if i >= 20:
create_times.append(time.time() - a)
stream = pa.FixedSizeBufferOutputStream(buf)
a = time.time()
pa.write_tensor(tensor, stream)
if i >= 20:
write_times.append(time.time() - a)
client.seal(object_id)
print("create mean is", np.mean(create_times))
print("create stddev is", np.std(create_times))
print("write mean is", np.mean(write_times))
print("write stddev is", np.std(write_times))
I'll incorporate Robert's comment and put the DEFAULT_GRANULARITY back to 128k on mac and non-hugepage linux, remove the warmup (it makes only a difference if DEFAULT_GRANULARITY is large), the memset and mlock on the store, since these don't make a difference in the steady state. This makes the behavior of the system similar to before the PR if huge pages are deactivated. If desired we can create a follow up PR that allocates all the memory in advance as Alexey suggests. |
With huge pages we now get the following results for copying 1GB of data (with different number of threads and blocksizes):
The best setting corresponds to 24 GB/s import numpy as np
import pyarrow as pa
import pyarrow.plasma as plasma
import time
client = plasma.connect("/tmp/plasma", "", 0)
data = np.random.randn(1024 * 1024 * 1024 // 8)
tensor = pa.Tensor.from_numpy(data)
for num_threads in range(1, 16):
print("\nnum_threads =", num_threads, end=": ")
for blocksize in [1, 2, 4, 8, 16, 32, 64, 128]:
print(blocksize, end=" ")
times = []
for i in range(10):
object_id = plasma.ObjectID(np.random.bytes(20))
buf = client.create(object_id, pa.get_tensor_size(tensor))
stream = pa.FixedSizeBufferOutputStream(buf)
stream.set_memcopy_threads(num_threads)
stream.set_memcopy_blocksize(blocksize)
a = time.time()
pa.write_tensor(tensor, stream)
times.append(time.time() - a)
client.seal(object_id)
print("%.4f" % np.mean(times), end="; ") |
28362e9
to
1bfa7f8
Compare
cpp/src/plasma/CMakeLists.txt
Outdated
@@ -34,6 +34,14 @@ set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -D_XOPEN_SOURCE=500 -D_POSIX_C_SOURCE=20 | |||
|
|||
set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -Wno-conversion") | |||
|
|||
option(ARROW_PLASMA_HUGEPAGES |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we make this a command line option instead of compile time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, this only enables complication of related code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got rid of this now, it's not really needed
cpp/src/plasma/malloc.cc
Outdated
#define DEFAULT_GRANULARITY ((size_t)128U * 1024U) | ||
|
||
#ifndef ARROW_PLASMA_HUGEPAGES | ||
#define DEFAULT_GRANULARITY ((size_t)128U * 1024U) // 128KB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would making this a configuration option that can be changed ulon initialization of Plasma be expensive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually looking at this again, I think it can be done without too much effort, let me try.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code is refactored and this is set at runtime now. Let's not expose it to the user via a configuration option yet, it doesn't seem that there would be benefits that justify the additional complexity. I made it 1GB in the huge page case, it seems unlikely that anybody would use huge pages if they have < 1GB of memory. Happy to revisit this in the future if desired.
cpp/src/plasma/malloc.cc
Outdated
@@ -122,8 +130,18 @@ void* fake_mmap(size_t size) { | |||
|
|||
int fd = create_buffer(size); | |||
ARROW_CHECK(fd >= 0) << "Failed to create buffer during mmap"; | |||
#ifdef __linux__ | |||
void* pointer = | |||
mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_POPULATE, fd, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Document why we are using MAP_POPULATE
and what it does.
cpp/src/plasma/malloc.cc
Outdated
ARROW_LOG(FATAL) << "ftruncate error"; | ||
return -1; | ||
if (!plasma::plasma_config->hugepages_enabled) { | ||
if (ftruncate(fd, (off_t)size) != 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Document what ftruncate
does and why we are calling it.
cpp/src/plasma/plasma.h
Outdated
@@ -129,6 +130,11 @@ struct PlasmaStoreInfo { | |||
/// The amount of memory (in bytes) that we allow to be allocated in the | |||
/// store. | |||
int64_t memory_capacity; | |||
/// Boolean flag indicating whether to start the object store with hugepages | |||
/// support enabled. | |||
bool hugepages_enabled; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider also explaining why huge pages improve performance
cpp/src/plasma/malloc.cc
Outdated
return -1; | ||
} | ||
if (unlink(file_name) != 0) { | ||
ARROW_LOG(FATAL) << "unlink error"; | ||
if (unlink(&file_name[0]) != 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not related to this PR, but we should really document why we are doing this unlink
call
+1 The AppVeyor build failure is due to ARROW-1375 |
This PR makes it possible to use Plasma object store backed by a pre-mounted hugetlbfs.