Skip to content

Commit

Permalink
DRILL-8011: Add Dropbox File System to Drill (#2337)
Browse files Browse the repository at this point in the history
  • Loading branch information
cgivre authored Oct 20, 2021
1 parent c620af7 commit f4ea90c
Show file tree
Hide file tree
Showing 10 changed files with 693 additions and 0 deletions.
59 changes: 59 additions & 0 deletions docs/dev/Dropbox.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
#Dropbox and Drill
As of Drill 1.20.0 it is possible to connect Drill to a Dropbox account and query files stored there. Clearly, the performance will be much better if the files are stored
locally, however, if your data is located in dropbox Drill makes it easy to explore that data.

## Creating an API Token
The first step to enabling Drill to query Dropbox is creating an API token.
1. Navigate to https://www.dropbox.com/developers/apps/create
2. Choose `Scoped Access` under Choose an API.
3. Depending on the access limitations you are looking for select either full or limited to a particular folder.
4. In the permissions tab, make sure all the permissions associated with reading data are enabled.

Once you've done that, and hit submit, you'll see a section in your newly created Dropbox App called `Generated Access Token`. Copy the value here and that is what you will
use in your Drill configuration.

## Configuring Drill
Once you've created a Dropbox access token, you are now ready to configure Drill to query Dropbox. To create a dropbox connection, in Drill's UI, navigate to the Storage tab,
click on `Create New Storage Plugin` and add the items below:

```json
"type": "file",
"connection": "dropbox:///",
"config": {
"dropboxAccessToken": "<your access token here>"
},
"workspaces": {
"root": {
"location": "/",
"writable": false,
"defaultInputFormat": null,
"allowAccessOutsideWorkspace": false
}
}
}
```
Paste your access token in the appropriate field and at that point you should be able to query Dropbox. Drill treats Dropbox as any other file system, so all the instructions
here (https://drill.apache.org/docs/file-system-storage-plugin/) and here (https://drill.apache.org/docs/workspaces/)
about configuring a workspace, and adding format plugins are exactly the same as any other on Drill.

### Securing Dropbox Credentials
As with any other storage plugin, you have a few options as to how to store the credentials. See [Drill Credentials Provider](./PluginCredentialsProvider.md) for more
information about how you can store your credentials securely in Drill.

## Running the Unit Tests
Unfortunately, in order to run the unit tests, it is necessary to have an external API token. Therefore, the unit tests have to be run manually. To run the unit tests:

1. Get your Dropbox API key as explained above and paste it above into the `ACCESS_TOKEN` variable.
2. In your dropbox account, create a folder called 'csv' and upload the file `hdf-test.csvh` into that folder
3. In your dropbox account, upload the file `http-pcap.json` to the root directory of your dropbox account
4. In the `testListFiles` test, you will have to update the modified dates
5. Run tests.

### Test Files
Test files can be found in the `java-exec/src/test/resources/dropboxTestFiles`
folder. Simply copy these files in the structure there into your dropbox account.

## Limitations
1. It is not possible to save files to Dropbox from Drill, thus CTAS queries will fail.
2. Dropbox does not expose directory metadata, so it is not possible to obtain the directory size, modification date or access dates.
3. Dropbox does not maintain the last access date as distinct from the modification date of files.
5 changes: 5 additions & 0 deletions exec/java-exec/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -94,6 +94,11 @@
<artifactId>asm-util</artifactId>
<version>${asm.version}</version>
</dependency>
<dependency>
<groupId>com.dropbox.core</groupId>
<artifactId>dropbox-core-sdk</artifactId>
<version>4.0.1</version>
</dependency>
<dependency>
<groupId>org.apache.drill.contrib.data</groupId>
<artifactId>tpch-sample-data</artifactId>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,8 @@
import org.apache.drill.common.logical.StoragePluginConfig;
import org.apache.drill.exec.store.PluginHandle.PluginType;
import org.apache.drill.shaded.guava.com.google.common.base.Preconditions;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

/**
* Defines a storage connector: a storage plugin config along with the
Expand All @@ -28,6 +30,8 @@
*/
public class ConnectorHandle {

private static final Logger logger = LoggerFactory.getLogger(ConnectorHandle.class);

private final ConnectorLocator locator;
private final Class<? extends StoragePluginConfig> configClass;

Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,280 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

package org.apache.drill.exec.store.dfs;

import com.dropbox.core.DbxException;
import com.dropbox.core.DbxRequestConfig;
import com.dropbox.core.v2.DbxClientV2;
import com.dropbox.core.v2.files.FileMetadata;
import com.dropbox.core.v2.files.FolderMetadata;
import com.dropbox.core.v2.files.ListFolderResult;
import com.dropbox.core.v2.files.Metadata;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.fs.PositionedReadable;
import org.apache.hadoop.fs.Seekable;
import org.apache.hadoop.fs.permission.FsPermission;
import org.apache.hadoop.util.Progressable;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.io.ByteArrayInputStream;
import java.io.ByteArrayOutputStream;
import java.io.IOException;
import java.net.URI;
import java.net.URISyntaxException;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

public class DropboxFileSystem extends FileSystem {
private static final Logger logger = LoggerFactory.getLogger(DropboxFileSystem.class);

private static final String ERROR_MSG = "Dropbox is read only.";
private Path workingDirectory;
private DbxClientV2 client;
private FileStatus[] fileStatuses;
private final Map<String,FileStatus> fileStatusCache = new HashMap<>();

@Override
public URI getUri() {
try {
return new URI("dropbox:///");
} catch (URISyntaxException e) {
throw new RuntimeException(e);
}
}

@Override
public FSDataInputStream open(Path path, int bufferSize) throws IOException {
FSDataInputStream fsDataInputStream;
String filename = getFileName(path);
client = getClient();
ByteArrayOutputStream out = new ByteArrayOutputStream();
try {
client.files().download(filename).download(out);
fsDataInputStream = new FSDataInputStream(new SeekableByteArrayInputStream(out.toByteArray()));
} catch (DbxException e) {
throw new IOException(e.getMessage());
}
return fsDataInputStream;
}

@Override
public FSDataOutputStream create(Path f,
FsPermission permission,
boolean overwrite,
int bufferSize,
short replication,
long blockSize,
Progressable progress) throws IOException {
throw new IOException(ERROR_MSG);
}

@Override
public FSDataOutputStream append(Path f, int bufferSize, Progressable progress) throws IOException {
throw new IOException(ERROR_MSG);
}

@Override
public boolean rename(Path src, Path dst) throws IOException {
return false;
}

@Override
public boolean delete(Path f, boolean recursive) throws IOException {
throw new IOException(ERROR_MSG);
}

@Override
public FileStatus[] listStatus(Path path) throws IOException {
client = getClient();
List<FileStatus> fileStatusList = new ArrayList<>();

// Get files and folder metadata from Dropbox root directory
try {
ListFolderResult result = client.files().listFolder("");
while (true) {
for (Metadata metadata : result.getEntries()) {
fileStatusList.add(getFileInformation(metadata));
}
if (!result.getHasMore()) {
break;
}
result = client.files().listFolderContinue(result.getCursor());
}
} catch (DbxException e) {
throw new IOException(e.getMessage());
}

// Convert to Array
fileStatuses = new FileStatus[fileStatusList.size()];
for (int i = 0; i < fileStatusList.size(); i++) {
fileStatuses[i] = fileStatusList.get(i);
}

return fileStatuses;
}

@Override
public void setWorkingDirectory(Path new_dir) {
logger.debug("Setting working directory to: " + new_dir.getName());
workingDirectory = new_dir;
}

@Override
public Path getWorkingDirectory() {
return workingDirectory;
}

@Override
public boolean mkdirs(Path f, FsPermission permission) throws IOException {
throw new IOException(ERROR_MSG);
}

@Override
public FileStatus getFileStatus(Path path) throws IOException {
String filePath = Path.getPathWithoutSchemeAndAuthority(path).toString();
/*
* Dropbox does not allow metadata calls on the root directory
*/
if (filePath.equalsIgnoreCase("/")) {
return new FileStatus(0, true, 1, 0, 0, new Path("/"));
}
client = getClient();
try {
Metadata metadata = client.files().getMetadata(filePath);
return getFileInformation(metadata);
} catch (Exception e) {
throw new IOException("Error accessing file " + filePath + "\n" + e.getMessage());
}
}

private FileStatus getFileInformation(Metadata metadata) {
if (fileStatusCache.containsKey(metadata.getPathLower())){
return fileStatusCache.get(metadata.getPathLower());
}

FileStatus result;
if (isDirectory(metadata)) {
// Note: At the time of implementation, DropBox does not provide an efficient way of
// getting the size and/or modification times for folders.
result = new FileStatus(0, true, 1, 0, 0, new Path(metadata.getPathLower()));
} else {
FileMetadata fileMetadata = (FileMetadata) metadata;
result = new FileStatus(fileMetadata.getSize(), false, 1, 0, fileMetadata.getClientModified().getTime(), new Path(metadata.getPathLower()));
}

fileStatusCache.put(metadata.getPathLower(), result);
return result;
}

private DbxClientV2 getClient() {
if (this.client != null) {
return client;
}

// read preferred client identifier from config or use "Apache/Drill"
String clientIdentifier = this.getConf().get("clientIdentifier", "Apache/Drill");
logger.info("Creating dropbox client with client identifier: {}", clientIdentifier);
DbxRequestConfig config = DbxRequestConfig.newBuilder(clientIdentifier).build();

// read access token from config or credentials provider
logger.info("Reading dropbox access token from configuration or credentials provider");
String accessToken = this.getConf().get("dropboxAccessToken", "");

this.client = new DbxClientV2(config, accessToken);
return this.client;
}

private boolean isDirectory(Metadata metadata) {
return metadata instanceof FolderMetadata;
}

private boolean isFile(Metadata metadata) {
return metadata instanceof FileMetadata;
}

private String getFileName(Path path){
return path.toUri().getPath();
}

static class SeekableByteArrayInputStream extends ByteArrayInputStream implements Seekable, PositionedReadable {

public SeekableByteArrayInputStream(byte[] buf)
{
super(buf);
}
@Override
public long getPos() throws IOException{
return pos;
}

@Override
public void seek(long pos) throws IOException {
if (mark != 0) {
throw new IllegalStateException();
}

reset();
long skipped = skip(pos);

if (skipped != pos) {
throw new IOException();
}
}

@Override
public boolean seekToNewSource(long targetPos) throws IOException {
return false;
}

@Override
public int read(long position, byte[] buffer, int offset, int length) throws IOException {

if (position >= buf.length) {
throw new IllegalArgumentException();
}
if (position + length > buf.length) {
throw new IllegalArgumentException();
}
if (length > buffer.length) {
throw new IllegalArgumentException();
}

System.arraycopy(buf, (int) position, buffer, offset, length);
return length;
}

@Override
public void readFully(long position, byte[] buffer) throws IOException {
read(position, buffer, 0, buffer.length);

}

@Override
public void readFully(long position, byte[] buffer, int offset, int length) throws IOException {
read(position, buffer, offset, length);
}
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -89,6 +89,7 @@ public FileSystemPlugin(FileSystemConfig config, DrillbitContext context, String

fsConf.set(FileSystem.FS_DEFAULT_NAME_KEY, config.getConnection());
fsConf.set("fs.classpath.impl", ClassPathFileSystem.class.getName());
fsConf.set("fs.dropbox.impl", DropboxFileSystem.class.getName());
fsConf.set("fs.drill-local.impl", LocalSyncableFileSystem.class.getName());
CredentialsProvider credentialsProvider = config.getCredentialsProvider();
if (credentialsProvider != null) {
Expand Down
Loading

0 comments on commit f4ea90c

Please sign in to comment.