Overview Develop Deploy Data

MATCH POSTS TO USER INTERESTS

sw_matchposts_title

Google CloudSQL, IntelliJ, Google Cloud Storage, Google Dataproc, JDBC, Scala, SparkSQL

Write a Spark Program in scala to run on Dataproc, that outputs a JSON file for every user and every country, with the posts ordered by most matching interests in descending order.

There are two versions of this program in which the programs runs; for all users (SparkPostsAll) and for single user, where enter UserID as arguments (SparkPostsSingle).

Code flow:

  1. Start Spark Session
  2. Define parameters
  3. Connect to database
  4. Create relevant lists and dataframes from database
  5. Select all posts for country
  6. Get users interests
  7. Select posts where the user is not the author
  8. Get these posts interests
  9. Select the posts with matching interests
  10. Count number of matches and order in descending
  11. Get the relevant information of these posts
  12. Wrote as single JSON file to Google Cloud Storage as ‘user_id/country’

Steps:

(1) Testing: Used DataBricks to write and test the code.

(2): Connect to database: Connect to database via JDBC driver.

(3) Create jar file: Use IntelliJ to create jar file.

(4) Save on cloud: Save JDBC driver and new jar file in bucket on Cloud Storage

(5) Submit job: Go to Dataproc cluster and Submit job

sw_matchposts_submitjob

(6) Ouptput: Ouput as unique single JSON file to GCS as ‘order/user/country’ in Google Cloud Storage bucket

sw_matchposts_joboutput

Google Dataproc examples: