Overview | Develop | Deploy | Data |
MATCH POSTS TO USER INTERESTS
Google CloudSQL, IntelliJ, Google Cloud Storage, Google Dataproc, JDBC, Scala, SparkSQL
Write a Spark Program in scala to run on Dataproc, that outputs a JSON file for every user and every country, with the posts ordered by most matching interests in descending order.
There are two versions of this program in which the programs runs; for all users (SparkPostsAll) and for single user, where enter UserID as arguments (SparkPostsSingle).
Code flow:
- Start Spark Session
- Define parameters
- Connect to database
- Create relevant lists and dataframes from database
- Select all posts for country
- Get users interests
- Select posts where the user is not the author
- Get these posts interests
- Select the posts with matching interests
- Count number of matches and order in descending
- Get the relevant information of these posts
- Wrote as single JSON file to Google Cloud Storage as ‘user_id/country’
Steps:
(1) Testing: Used DataBricks to write and test the code.
- Create community login and can use for free
- Create tables in storage to use for testing
- Wasn’t able to save json files etc just tables
(2): Connect to database: Connect to database via JDBC driver.
(3) Create jar file: Use IntelliJ to create jar file.
- Setup IntelliJ
- Get dependencies
Sbt package Sbt run
(4) Save on cloud: Save JDBC driver and new jar file in bucket on Cloud Storage
(5) Submit job: Go to Dataproc cluster and Submit job
- Make job name unique
- Set region to same as cluster (us-central1)
- Put in name of GS bucket where jar file stored (gs://shared-world-dataproc/spark-posts-single.jar)
- Enter any arguments for program (23)
- Under additional jar files put the address of the jdbc driver (gs://shared-world-dataproc/postgresql-driver.jar)
(6) Ouptput: Ouput as unique single JSON file to GCS as ‘order/user/country’ in Google Cloud Storage bucket
- Write single csv file using Spark
- Save Spark datafram to single csv
- Scala write to file with variable name