Not logged in. Login

Project Ideas

I have decided not to put sample project topics here. I feel like if I give you a sample project and then you do it, that's no better than just another assignment. Part of the point of the project is to have you come up with the idea and carry it through. The collection of DataSets often provides good inspiration.

Instead, I'm going to point out a few things you might want to explore to give you a stronger resumé, which is also a reasonable goal for the project

Technology Ideas

Amazon Web Services

AWS is more than 90% of cloud infrastructure. It is simply the default choice for almost everyone who wants to rent computing power. That will include many employers.

Elastic MapReduce is their Hadoop infrastructure offering which can be used to spin up a cluster of any size when you need it.

Downside: it costs money. (At least remember to shut down and destroy VMs when you're done: it's easy to forget and get a very big charge at the end of the month.)

The GitHub Student Developer Pack contains many useful things, including $50 of AWS credit. That will help.

Scala

Scala is the primary implementation language for Spark. It is more complete than the Python API for Spark, in particular the new DataSets collection. It would be interesting and useful for you to have experience with both Python and Scala with Spark.

Pandas

Pandas is a Python library for data analysis. Pandas' DataFrames are very similar to Spark DataFrames, and were their inspiration.

It's not a big data tool by itself, but can be nicely paired with Spark to do data exploration and visualization.

Other NoSQL Databases

There are many other NoSQL databases, which have their own strengths and weaknesses. You may find that one of them is better at representing or manipulating the data you're using for the project.

Messaging Tools

There are several messaging systems that are designed to transmit messages between systems at phenomenal rates. These can be used as part of a streaming project, or to just move high-velocity data around. Apache Kafka, RabbitMQ, ZeroMQ.

Others

There are many more Apache big data projects than we could cover and many more big data tools out there.

Updated Thu Aug. 22 2024, 11:06 by ggbaker.