In the past week, my mentor and I tried to fix the dockerfile that sets up hadoop in a ubuntu container from scratch. Since that was becoming tidious, we tried setting up a mini hadoop cluster.
Apache has this mini mini hadoop cluster set up that gives a single node cluster. I tried building this using a maven docker image. The documentation has very little information on where hadoop is actually getting downloaded and the ports it'll be connecting to by default. My mentor and I debugged the dockerfile and tried to get this up and running but still there is a problem with ports and I'm working on it. Also, we figured out how to get the files from hdfs which can be either CSV or JSON type of files. I have implemented those changes as well.
Hopefully by next week I can finish this project.