-
Notifications
You must be signed in to change notification settings - Fork 526
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
curvefs: support curvebs as data backend #1207
Conversation
48cbede
to
f6202d8
Compare
recheck |
2 similar comments
recheck |
recheck |
4172780
to
85ec688
Compare
@@ -51,6 +51,7 @@ spaceserver.rpcTimeoutMs=1000 | |||
|
|||
#### bdev | |||
bdev.confpath=/etc/curve/client.conf | |||
bdev.threadnum=10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this threadnum used for? You can add some notes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't you say that you have used other learning interfaces before?
volumeBlockSize=4096 | ||
volumeName=volume | ||
volumeUser=user | ||
volumePassword=password | ||
volumeBlockGroupSize=134217728 | ||
# support |AtStart| and |AtEnd| | ||
volumeBitmapLocation=AtStart |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Each bolckGroup has a bitmap and the blockGroupSize is 128MiB and 4MiB is signed by one bit of bitmap. Only 4 bytes of bitmap on each blockGroup? 4B + 128MiB ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The minimum allocate granularity(block size) is 4KiB, so we need a 4KiB bitmap and each bit represents whether a block is used or not.
And 4MiB is an optimization for continuity of file data-block allocation.
you can see #959
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we change from AtStart to AtEnd?
curvefs/proto/metaserver.proto
Outdated
@@ -150,6 +150,7 @@ enum FsFileType { | |||
TYPE_FILE = 2; | |||
TYPE_SYM_LINK = 3; | |||
TYPE_S3 = 4; | |||
TYPE_VOLUME = 5; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the meaning of FsFileType here? Kind of backend S3 or Volume ? The file type of filesystem like regular file , dir , sym_link?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are TYPE_S3 and TYPE_VOLUME the same level as the other three?
curvefs/proto/metaserver.proto
Outdated
@@ -194,7 +195,7 @@ message Inode { | |||
required FsFileType type = 14; | |||
optional string symlink = 15; // TYPE_SYM_LINK only | |||
optional uint64 rdev = 16; | |||
optional VolumeExtentList volumeExtentList = 17; // TYPE_FILE only | |||
map<uint64, VolumeExtentList> volumeExtentMap = 17; // TYPE_FILE only |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
meaning of key
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix
@@ -67,15 +68,19 @@ class InodeWrapper : public std::enable_shared_from_this<InodeWrapper> { | |||
status_(InodeStatus::Normal), | |||
metaClient_(metaClient), | |||
openCount_(0), | |||
dirty_(false) {} | |||
dirty_(false) { | |||
BuildExtentCache(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will rebuild each time cost much when 'VolumeExtentList' is bigger in inode?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, so we should separate inode's attributes and s3chunkinfo/volumeextentlist later.
// 1. open a file with O_RDWR | ||
// 2. write "hello, world" | ||
// 3. unlink this file | ||
// 4. pread from 0 to 13, expected "hello, world" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why expected "hello, world" after unlink the file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the short answer is, unlink is delayed until all processes close the file.
you can see unlink's descriptions for more information.
curvefs/src/mds/space/usage.cpp
Outdated
availableSize_(availableSize), | ||
totalGroups_(size / blockGroupSize), | ||
allocatedGroups_(allocatedGroups.size()) { | ||
for (auto& g : allocatedGroups) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const auto& g
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix
@@ -43,7 +43,7 @@ DEFINE_bool(detail, false, "show more infomation"); | |||
DEFINE_string(partitionId, "1,2,3", "partition id"); | |||
DEFINE_string(metaserverId, "1,2,3", "metaserver id"); | |||
DEFINE_bool(noconfirm, false, "execute command without confirmation"); | |||
DEFINE_uint64(blockSize, 1048576, "block size"); | |||
DEFINE_uint64(blockSize, 0, "volume size"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Set default value same as the item in conf '4096' ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix
6f2e72c
to
fb221fd
Compare
@@ -43,7 +43,7 @@ DEFINE_bool(detail, false, "show more infomation"); | |||
DEFINE_string(partitionId, "1,2,3", "partition id"); | |||
DEFINE_string(metaserverId, "1,2,3", "metaserver id"); | |||
DEFINE_bool(noconfirm, false, "execute command without confirmation"); | |||
DEFINE_uint64(blockSize, 1048576, "block size"); | |||
DEFINE_uint64(blockSize, 4096, "volume block size"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The following ‘volumeBlockSize’ is the blocksize of the volume, not this ‘blockszie’.
The interface of createfs has a blocksize, and in fsdetail, s3 and volume have their own blockszie.
@@ -312,7 +312,7 @@ int Metaserver::PersistDataToLocalFile(std::shared_ptr<LocalFileSystem> fs, | |||
} | |||
|
|||
int writtenCount = fs->Write(fd, data.c_str(), 0, data.size()); | |||
if (writtenCount < data.size()) { | |||
if (writtenCount < 0 || static_cast<size_t>(writtenCount) != data.size()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why writtenCount < data.size() does not satisfy this requirement?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nope, according to the rule, when a signed integral compare with an unsigned integral, the signed value will be converted to unsigned.
for example, data's size is 4096, and fs->Write
returns -1, previous if (writtenCount < data.size)
will be false.
if (request.has_volumeextentlist()) { | ||
VLOG(1) << "update inode has extent"; | ||
old->mutable_volumeextentlist()->CopyFrom(request.volumeextentlist()); | ||
// FIXME(wuhanqing): turncate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why truncate cannot be processed now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
truncate needs release space, and we don't implement yet
@@ -51,6 +51,7 @@ spaceserver.rpcTimeoutMs=1000 | |||
|
|||
#### bdev | |||
bdev.confpath=/etc/curve/client.conf | |||
bdev.threadnum=10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't you say that you have used other learning interfaces before?
volumeBlockSize=4096 | ||
volumeName=volume | ||
volumeUser=user | ||
volumePassword=password | ||
volumeBlockGroupSize=134217728 | ||
# support |AtStart| and |AtEnd| | ||
volumeBitmapLocation=AtStart |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we change from AtStart to AtEnd?
required uint64 blockGroupSize = 6; | ||
required BitmapLocation bitmapLocation = 7; | ||
|
||
// TODO(all): maybe we need curvebs cluster's mds ip and port in here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can use the id to mark the curvebs cluster. The correspondence between the id and the curvebs cluster is recorded in the curvefs-mds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't you say that you have used other learning interfaces before?
it's because, for unaligned write operation, RMW is performed, so there might be a data consistent problem, like #985. and if this really happens, we should redesign the implementation here. so I just add a dedicated thread pool now.
Can we change from AtStart to AtEnd?
yep, both are supported, but they have no much difference.
Maybe we can use the id to mark the curvebs cluster. The correspondence between the id and the curvebs cluster is recorded in the curvefs-mds.
currently, it's only affects how we extend the volume, now we can only add an extra client.conf in mds.
and if we are going to support volume in different curvebs clusters, we may need ip/port or id.
@@ -306,5 +280,13 @@ void FuseVolumeClient::FlushData() { | |||
// TODO(xuchaojie) : flush volume data | |||
} | |||
|
|||
void FuseVolumeClient::SetSpaceManagerForTesting(SpaceManager *manager) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implement FuseOpFlush. If cto, persist metadata and data, otherwise only persist data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix
curvefs/src/client/inode_wrapper.cpp
Outdated
@@ -350,5 +396,34 @@ InodeWrapper::UpdateInodeStatus(InodeOpenStatusChange statusChange) { | |||
return CURVEFS_ERROR::OK; | |||
} | |||
|
|||
// TODO(wuhanqing): update extents incrementally | |||
CURVEFS_ERROR InodeWrapper::SyncVolumeExtents() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have an interface for SyncVolumeExtents, why SyncFullInode doesn't just update other parts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SyncVolumeExtents is unused and is been deleted.
But, I think this class is going to be unmaintainable, it has different responsibilities, and somewhere use this interface and somewhere else, but may do the same thing. And, it has to know the data backend.
@@ -148,6 +148,60 @@ void MDSBaseClient::AllocS3ChunkId(uint32_t fsId, | |||
curvefs::mds::MdsService_Stub stub(channel); | |||
stub.AllocateS3Chunk(cntl, &request, response, nullptr); | |||
} | |||
|
|||
// TODO(all): do we really need pass `fsId` all the time? | |||
// each curve-fuse process only mount one filesystem |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fsid is used to mark curve-fuse. The block group is uniquely identified by the owner,
the representation in the s3 scene is fsid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean we can store fsid as a global variable, so that, we can just use this variable, and don't have to pass it as a parameter all the time and everywhere.
<< ", len: " << len << ", block write requests: " << writes; | ||
|
||
ssize_t nr = blockDeviceClient_->Writev(writes); | ||
if (nr < 0 /*|| nr != len*/) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why comment out the latter part: nr != len
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for unaligned requests, the return value will be bigger than the origin read/write length.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mutable curve::common::RWLock lock_; | ||
|
||
// key is offset | ||
std::unordered_map<uint64_t, std::map<uint64_t, PExtent>> extents_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extents_ needs to be limited in size or length
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nope, it's the latest extents of an inode.
a9a9f52
to
a155b56
Compare
WORKSPACE
Outdated
@@ -20,7 +20,7 @@ load("@bazel_tools//tools/build_defs/repo:git.bzl", "git_repository") | |||
|
|||
git_repository( | |||
name = "com_github_baidu_braft", | |||
remote = "https://github.com/baidu/braft", | |||
remote = "https://gitee.com/baidu/braft", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to modify here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix
|
||
LockGuard lk(mtx_); | ||
for (auto& group : allocatedGroups_) { | ||
auto err = ClearBlockGroup(group.second); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mark as TODO here: etcdclient need to implement remove by prefix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix
return MakeNode(req, parent, name, mode, FsFileType::TYPE_FILE, rdev, e); | ||
VLOG(3) << "FuseOpMkNod, parent: " << parent << ", name: " << name | ||
<< ", mode: " << mode << ", rdev: " << rdev; | ||
return MakeNode(req, parent, name, mode, FsFileType::TYPE_VOLUME, rdev, e); | ||
} | ||
|
||
CURVEFS_ERROR FuseVolumeClient::FuseOpFsync(fuse_req_t req, fuse_ino_t ino, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix
<< ", len: " << len << ", block write requests: " << writes; | ||
|
||
ssize_t nr = blockDeviceClient_->Writev(writes); | ||
if (nr < 0 /*|| nr != len*/) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
5493b25
to
386fa73
Compare
@@ -298,13 +274,23 @@ CURVEFS_ERROR FuseVolumeClient::Truncate(Inode *inode, uint64_t length) { | |||
|
|||
CURVEFS_ERROR FuseVolumeClient::FuseOpFlush(fuse_req_t req, fuse_ino_t ino, | |||
struct fuse_file_info *fi) { | |||
// Todo: flush data and note cto | |||
return CURVEFS_ERROR::OK; | |||
VLOG(9) << "FuseOpFlush, ino: " << ino; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
curvefs::client::common::FLAGS_enableCto = true, FuseOpFlush need to sync inode for cto(close will call flush internal)
386fa73
to
846ec93
Compare
What problem does this PR solve?
Issue Number: close #1032
Problem Summary:
What is changed and how it works?
What's Changed:
How it Works:
Side effects(Breaking backward compatibility? Performance regression?):
Check List