by BigBoards
In this 2-day workshop you will learn how to build a complete data processing pipeline. The workshop is hands-on using a BigBoards Hex. You will be touching several common Big Data technologies.
The accompanying repository, contains all the technologies, resources and solutions to complete the workshop.
During this workshop, you will
- ingest data from a rather large relational database that contains weather and sales data;
- store the raw data on distributed file system as your primary data;
- restructure the data for easier analysis;
- and finally apply machine learning to build a recommendation engine.
We have packaged all the required technologies for this workshop as a BigBoards Tint. With the click of a button you can install everything on a Hex, in the cloud or on your own servers. Just head over to the BigBoards Hive.
The technologies which you will be using for your end-to-end data pipeline, are:
- Apache Hadoop for distributed storage, processing and resource management,
- Apache Sqoop for ingestion of relational data,
- Apache Pig to write data transformations,
- Apache Spark for lightning fast data processing,
- Apache Spark SQL for uniform data access,
- Apache Spark MLlib for machine learning.
For now, we still host the data external to the big data clusters.
You will learn the basics on Big Data and cluster processing using 2 presentations:
- Big Data Basics - Common explains Big Data and it use cases.
- Big Data Basics - Building a Data Pipeline guides you through the practical exercises. This presentation covers all the technologies and the resources for this project.
The BigBoards Hex gives you everything you need to get your hands dirty.
You can login to jupyterhub with the default bigboards username (bb) and password (Swh^bdl)
Made with ♡ for data!
You are free to use the content, presentations and resources from the workshop. Do keep in mind that we have put an aweful lot of work in creating these artefacts: please mention us to spread the karma!
Big Data Basics by BigBoards CVBA is licensed under a Creative Commons Attribution 4.0 International Licence.
Based on work from https://github.com/bigboards/bb-stack-training