Overview | Develop | Deploy | Data |
DATA ANALYSIS
Google BigQuery, Zepplin Notebook, Google Cloud Dataproc, Apache Spark, Scala, JDBC
Use Spark and scala to perform general data analysis of the data in Zeppelin notebook on Dataproc cluster.
(1) Number of posts in each country
(2) Top interests of country
(3) Top interest of users
(4) Top interests of posts
Code flow:
- Load JDBC dependency
- Import everything needed and start Spark Session
- Setup connection to database
- Go through each of the four analysis
- save data from database into dataframe and perform aggregation
- use z.show() to display results with Zeppelin tool