Skip to content

Conversation

@volmasoft
Copy link
Contributor

@volmasoft volmasoft commented Apr 22, 2020

Overview

Note this is a work in progress, still quite a bit of work to go but I thought I'd expose some of the work early.

I've tried to take the best of a few worlds and tried to keep consistency with both the accumulo-docker repo (https://github.com/apache/accumulo-docker) and the Apache guides here: (https://github.com/docker-library/official-images)

This also assumes that your instance is called "myinstance" and your zookeeper is available at "localhost:2181" from the container, this is NOT the case on some configurations such as Docker for Mac, but I'e left these as defaults for now until we figure out "(2) Configuration" below.

Note, I have left some commands commented out on the Dockerfile currently that make life easier to debug, these will be stripped before they get merged finally.

Currently I have a large number of TODOs and questions for people's input, each is numbered so feel free to answer some or all of them with your thoughts :)

If no one has any strong feelings about some of the questions below then I'll just press on.


(1) Repository

Given that accumulo has it's docker code in accumulo-docker do we need to do the same with this repo e.g. accumulo-proxy-docker or are we happy to merge it into the main code base?


(2) Configuration

I've been debating at length how to allow the default configuration to be overiden.

My current thought is we should allow users to volume mount in (or rebuild the container) with a new proxy.properties and we could document both approaches, this would allow full configuration to be done and versioned outside of our accumulo-proxy project.

However we probably ought to make it easy to 'get started' and therefore we could also provide a further approach to overriding:

A) Allow people to provide an environment variable to override properties e.g. PROXY_CONFIG="key:pair" and having the Java code parse this after the default proxy.properties and update the properties - This would allow any property to be updated.

B) Allow people to only override 2 properties via environment variables (instance.name, instance.zookeepers) since these are more than likely to meet 99% of use cases, we can always add additional ones in the future.

My gut feeling is a mixture of the volume mounting/rebuild guide, plus option (B) is probably the right balance for now and we can always change these later on.


(3) Software versions

Currently this change doesn't support selecting zookeeper/accumulo/hadoop versions, they're hard coded.

Do we want to allow this support?

Does this need doing on the initial commit or would it be acceptable to follow it up.


(4) Automated Build

We could use one of the maven plugins to automatically create the docker container with a new profile, e.g. instead of mvn package -Ptarball we could provide a mvn package -Pdocker or similiar.

This would be different to the standard accumulo-docker repo which details how to build using the docker build command.

I'm happy either way.


(5) Documentation in container

Looking at the accumulo-docker repo it was decided to put documentation into the container (README.md on this line)

I'm not sure it provides value but I've added both the README.md and DOCKER.md into this container currently.

Should we strip this out?


(6) CLASSPATH setting

When debugging this application I found that for some reason accumulo classpath provides a classpath which includes zookeeper, however it only includes (in my case) /opt/apache-zookeeper/ and we actually need /opt/apache-zookeeper/lib.

Look at my diff Dockerfile and look for the line ENV CLASSPATH=/opt/apache-zookeeper/lib/*

Anyone know if I'm missing something in my configuration?


(7) Install path shortcut

If you look at Dockerfile and look for the part where I add the dependencies to /opt/ (roughly line 61 starting with "# Install the dependencies into /opt/"

I take the versions of software and strip their first folder so that they are in folders similiar to /opt/hadoop/ in one step.

This may make the installed versions a bit less transparent if you were to jump into the container (e.g. it's not /opt/hadoop-3.2.1/)

Does anyone have any strong feelings about this?

--- COMMIT LOG ---

Initial commit of the work to stand up an accumulo-proxy inside a doker container.

I have only implemented support for Accumulo 2.x and by default this first commit contains:

  • Accumulo 2.0.0
  • Hadoop 3.2.1
  • Zookeeper 3.5.7

A new document (DOCKER.md) has been created to start to document the implementation and usage guide which should allow others to test this if they so wished.

A number of outstanding questions will be posted on the issue, there is also a number of TODOs still required to be implemented that I'm tracking.

…ker container.

I have only implemented support for Accumulo 2.x and by default this first commit contains:
* Accumulo 2.0.0
* Hadoop 3.2.1
* Zookeeper 3.5.7

A new document (DOCKER.md) has been created to start to document the implementation and usage guide which should allow others to test this if they so wished.

A number of outstanding questions will be posted on the issue, there is also a number of TODOs still required to be implemented that I'm tracking.
Copy link
Contributor

@keith-turner keith-turner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@volmasoft responses to your questions below

Given that accumulo has it's docker code in accumulo-docker do we need to do the same with this repo e.g. accumulo-proxy-docker or are we happy to merge it into the main code base?

I like putting it here for now, and I like what you did in this PR.

My current thought is we should allow users to volume mount in (or rebuild the container) with a new proxy.properties and we could document both approaches

I think allowing to mount in a props files is good for now, fairly easy, and your documentation for it is good. I would be in favor of follow on work to allow passing options on the command line like docker run --name accumulo-proxy accumulo-proxy:latest -o <proxy prop1>=<val1> -o <proxy prop2>=<val2>. The accumulo script supports this behavior.

Currently this change doesn't support selecting zookeeper/accumulo/hadoop versions, they're hard coded.

I think the version ARGS in the dockerfile would allow this. So if a user wants to use certain version, then they need to run docker build. This could be documented.

e.g. instead of mvn package -Ptarball we could provide a mvn package -Pdocker or similiar.

I am not sure what mvn package -Pdocker would do, so its hard for me to comment on it. That could be a follow on issue, or if you have something specific in mind it could be pushed as part of this PR for discussion.

Looking at the accumulo-docker repo it was decided to put documentation into the container (README.md on this line)

My memory is a bit fuzzy, but I think the readme copy in accumulo-docker was a bit of a hack. I think all of the other files on that copy command were optional. However, if the docker copy command has zero things to copy it would fail. So the readme was added just to ensure the copy command always had at least one file.

Anyone know if I'm missing something in my configuration?

I have not had a chance to look into the specific problem you are seeing, but I do know that the latest version of ZK has moved their jars around inside their tarball and this causes problems for the default Accumulo 2.0 config.. The recent PR apache/accumulo-docker#12 added sed command to update the Accumulo 2.0 conf for ZK.

This may make the installed versions a bit less transparent if you were to jump into the container (e.g. it's not /opt/hadoop-3.2.1/). Does anyone have any strong feelings about this?

I think this was done in accumulo-docker to simplify/shorten the dockerfile. That is a good point about someone jumping into the file and looking around. They could still find the versions by digging a bit deeper and finding Hadoop or ZK jars. However I would not be opposed to making the dir names have the version.

…lso tidies up a number of other aspects.

* These are now configured in the same mechanism as accumulo-docker see: apache/accumulo-docker@ff8cedf
* The CLASSPATH env override has been removed.

* The DOCKER.md and README.md file has been removed from the container.
* Updated the DOCKER.md to try and be more explicit about networking choices depending upon your environment.

* I've updated the install process for hadoop, zookeeper and accumulo to retain their version numbers in their install paths.
* Symlinks have been added to avoid the need to update references should dependency versions change.
…the description and a renaming of ZooKeeper to it's offical uppercase Z uppercase K format.
@volmasoft volmasoft changed the title WIP: accumulo-proxy docker container accumulo-proxy docker container Apr 29, 2020
Copy link
Member

@mjwall mjwall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good stuff here, thanks for the work

* Updated all references to "docker" to "Docker" where possible, similiar to ZooKeeper.
* Updated the process to verify network connectivity.
* Made a reference to the DOCKER.md from the main README.md so that people know it exists.
@volmasoft
Copy link
Contributor Author

Thanks for all the feedback 👍

@keith-turner keith-turner merged commit b32e9d7 into apache:master May 1, 2020
@keith-turner
Copy link
Contributor

Thanks for the contribution @volmasoft. If you would like to be added as a contributor on the people page, please submit a PR to add yourself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants