Skip to content
This repository has been archived by the owner on Jun 6, 2024. It is now read-only.

Decouple hdfs storage from global storage, support config multiple folders for hdfs #1922

Merged
merged 5 commits into from
Dec 21, 2018
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions deployment/quick-start/services-configuration.yaml.template
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,12 @@ cluster:
# description: Default VC.
# capacity: 100

#Uncomment following lines if you want to customize hdfs
#hadoop-data-node:
# # storage path for hdfs, support comma-delimited list of directories, eg. /path/to/folder1,/path/to/folder2 ...
# # if left empty, will use cluster.common.data-path/hdfs/data
# storage_path:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the default value?

why not make it contains config default value and also give user flexibility to config multi path?
storage_path: /datastorage

Copy link
Member Author

@mzmssg mzmssg Dec 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@YanjieGao
Default value is cluster_config[common][data-path]/hdfs/data, which might not be /datastorage.
The logic is: if admin give a specific hdfs storage, then use it, if not, use global storage.
Ideally, if we allow introduce other services' config in yaml, then here we could set $cluser_config.common.data-path/hdfs/data, which would be more clear

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For user only view this config file can't find cluster.common.data-path position. It will be confused for user. Because not all user know to query github code base to find cluster_config[common][storage] cluster object model config.

It is better to tell user clear the default path is where.

We should assume user maybe only get the context of current config file.

Copy link
Member Author

@mzmssg mzmssg Dec 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I don't think user should see here, the default value is for advanced user or dev. In our design, user should overwrite this value in services-configuration.yaml, which contains the necessary context.

If we introduce some hard-code path here(even only in comments), it will couple this file with other service.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, misunderstanding this as end user yaml file & my intent is not hard code (intend is to tell user could find this config default value is this file's data-path config).

In my understand default value is for not advanced user and advanced user will know customized it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not big problem. Could continue



# uncomment following if you want to change customeize yarn-frameworklauncher
#yarn-frameworklauncher:
Expand Down
7 changes: 7 additions & 0 deletions examples/cluster-configuration/services-configuration.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,13 @@
# description: Default VC.
# capacity: 100

#Uncomment following lines if you want to customize hdfs
#hadoop-data-node:
# # storage path for hdfs, support comma-delimited list of directories, eg. /path/to/folder1,/path/to/folder2 ...
# # if left empty, will use cluster.common.data-path/hdfs/data
# storage_path:



#uncomment following if you want to change customeize yarn-frameworklauncher
#yarn-frameworklauncher:
Expand Down
44 changes: 44 additions & 0 deletions src/hadoop-data-node/config/hadoop-data-node.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
## Hadoop data node section parser

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the problem is the same with below, if this doc is just for config. Later could tell user config info at service-config doc or other part

- [Default Configuration](#D_Config)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lack
1 first deployed user how to do: step by step. [step's could refer some deployment doc]
2 upgrading user how to do: step by step. [step's could refer some deployment doc]

Copy link
Member Author

@mzmssg mzmssg Dec 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@YanjieGao
I think in our design, this doc shoud follow some standard items, only for per service configuration introduction. Not for maintance or other purpose.
i.e.
https://github.com/Microsoft/pai/blob/master/src/rest-server/config/rest-server.md
https://github.com/Microsoft/pai/blob/master/src/grafana/config/grafana.md

Of course, I will update hdfs docs after this PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

- [How to Configure](#HT_Config)
- [Generated Configuraiton](#G_Config)
- [Data Table](#T_config)

#### Default configuration <a name="D_Config"></a>

[hadoop-data-node default configuration](hadoop-data-node.yaml)

#### How to configure cluster section in service-configuraiton.yaml <a name="HT_Config"></a>

All configurations in this section is optional. If you want to customized these value, you can configure it in service-configuration.yaml.

- `storage_path` The hdfs storage folders, support comma-delimited list of directories.
if isn't specified, will use `cluster.common.data-path/hdfs/data`



#### Generated Configuration <a name="G_Config"></a>

After parsing, object model will be a comma-delimited string, every substring is a directory:
```yaml
storage_path: /path/to/folder1,/path/to/folder2,...
```


#### Table <a name="T_Config"></a>

<table>
<tr>
<td>Data in Configuration File</td>
<td>Data in Cluster Object Model</td>
<td>Data in Jinja2 Template</td>
<td>Data type</td>
</tr>
<tr>
<td>hadoop-data-node.virtualClusters</td>
<td>com["hadoop-data-node"]["storage_path"]</td>
<td>cluster_cfg["hadoop-data-node"]["storage_path"]</td>
<td>Str</td>
</tr>
</table>
1 change: 1 addition & 0 deletions src/hadoop-data-node/config/hadoop-data-node.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -15,3 +15,4 @@
# DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

storage_path:
16 changes: 14 additions & 2 deletions src/hadoop-data-node/config/hadoop_data_node.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,17 +22,29 @@

class HadoopDataNode:

def __init__(self, cluster_configuration, service_configuration, default_service_configuraiton):
def __init__(self, cluster_configuration, service_configuration, default_service_configuration):
self.logger = logging.getLogger(__name__)

self.cluster_configuration = cluster_configuration
self.service_configuration = self.merge_service_configuration(service_configuration,
default_service_configuration)

def merge_service_configuration(self, overwrite_srv_cfg, default_srv_cfg):
if overwrite_srv_cfg is None:
return default_srv_cfg
srv_cfg = default_srv_cfg.copy()
for k in overwrite_srv_cfg:
srv_cfg[k] = overwrite_srv_cfg[k]
return srv_cfg

def validation_pre(self):
return True, None

def run(self):
com = {}

# com["storage_path"] = self.service_configuration.get("storage_path") or \
# "{}/hdfs/data".format(self.cluster_configuration["cluster"]["common"]["data-path"])
com["storage_path"] = self.service_configuration.get("storage_path")
return com

def validation_post(self, cluster_object_model):
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -26,4 +26,5 @@ cp /hadoop-configuration/mapred-site.xml $HADOOP_CONF_DIR/mapred-site.xml
HOST_NAME=`hostname`
/usr/local/host-configure.py -c /host-configuration/host-configuration.yaml -f $HADOOP_CONF_DIR/hdfs-site.xml -n $HOST_NAME

sed -i "s/{HDFS_ADDRESS}/${HDFS_ADDRESS}/g" $HADOOP_CONF_DIR/core-site.xml
sed -i "s/{HDFS_ADDRESS}/${HDFS_ADDRESS}/g" $HADOOP_CONF_DIR/core-site.xml
sed -i "s#{HADOOP_DATANODE_DATA_DIR}#${HADOOP_DATANODE_DATA_DIR}#g" $HADOOP_CONF_DIR/hdfs-site.xml
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@

<property>
<name>dfs.datanode.data.dir</name>
<value>file:///var/lib/hdfs/data</value>
<value>{HADOOP_DATANODE_DATA_DIR}</value>
<description>
This property specifies the URIs of the directories where the DataNode stores
blocks.
Expand Down
17 changes: 13 additions & 4 deletions src/hadoop-data-node/deploy/hadoop-data-node.yaml.template
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@
# DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

{% set folders = cluster_cfg[ "hadoop-data-node" ][ "storage_path" ] or cluster_cfg["cluster"]["common"][ "data-path" ] + "/hdfs/data" %}
apiVersion: apps/v1
kind: DaemonSet
metadata:
Expand All @@ -35,8 +36,12 @@ spec:
image: {{ cluster_cfg["cluster"]["docker-registry"]["prefix"] }}hadoop-run:{{ cluster_cfg["cluster"]["docker-registry"]["tag"] }}
imagePullPolicy: Always
volumeMounts:
- mountPath: /var/lib/hdfs/data
name: hadoop-data-storage
{% set mount_points = [] %}
{% for folder in folders.split(",") %}
- mountPath: /var/lib/hdfs/data-{{ loop.index }}
name: hadoop-data-storage-{{ loop.index }}
{% set ignored = mount_points.append("file:///var/lib/hdfs/data-" + loop.index|string) %}
{% endfor %}
- mountPath: /hadoop-configuration
name: hadoop-data-node-config-volume
- mountPath: /host-configuration
Expand Down Expand Up @@ -64,15 +69,19 @@ spec:
value: datanode-generate-script.sh
- name: START_SERVICE
value: datanode-start-service.sh
- name: HADOOP_DATANODE_DATA_DIR
value: {{ mount_points|join(",") }}
imagePullSecrets:
- name: {{ cluster_cfg["cluster"]["docker-registry"]["secret-name"] }}
volumes:
- name: hadoop-tmp-storage
hostPath:
path: {{ cluster_cfg["cluster"]["common"][ "data-path" ] }}/hadooptmp/datanode
- name: hadoop-data-storage
{% for folder in folders.split(",") %}
- name: hadoop-data-storage-{{ loop.index }}
hostPath:
path: {{ cluster_cfg["cluster"]["common"][ "data-path" ] }}/hdfs/data
path: {{ folder }}
{% endfor %}
- name: hadoop-data-node-config-volume
configMap:
name: hadoop-data-node-configuration
Expand Down