MATCH POSTS TO USER INTERESTS

sw_matchposts_title

Google CloudSQL, IntelliJ, Google Cloud Storage, Google Dataproc, JDBC, Scala, SparkSQL

Write a Spark Program in scala to run on Dataproc, that outputs a JSON file for every user and every country, with the posts ordered by most matching interests in descending order.

There are two versions of this program in which the programs runs; for all users (SparkPostsAll) and for single user, where enter UserID as arguments (SparkPostsSingle).

Code flow:

Start Spark Session
Define parameters
Connect to database
Create relevant lists and dataframes from database
Select all posts for country
Get users interests
Select posts where the user is not the author
Get these posts interests
Select the posts with matching interests
Count number of matches and order in descending
Get the relevant information of these posts
Wrote as single JSON file to Google Cloud Storage as ‘user_id/country’

Steps:

(1) Testing: Used DataBricks to write and test the code.

Create community login and can use for free
Create tables in storage to use for testing
Wasn’t able to save json files etc just tables

(2): Connect to database: Connect to database via JDBC driver.

(3) Create jar file: Use IntelliJ to create jar file.

Setup IntelliJ
Get dependencies
```
Sbt package
Sbt run
```

(4) Save on cloud: Save JDBC driver and new jar file in bucket on Cloud Storage

Download JDBC (postgresql) driver

(5) Submit job: Go to Dataproc cluster and Submit job

Make job name unique
Set region to same as cluster (us-central1)
Put in name of GS bucket where jar file stored (gs://shared-world-dataproc/spark-posts-single.jar)
Enter any arguments for program (23)
Under additional jar files put the address of the jdbc driver (gs://shared-world-dataproc/postgresql-driver.jar)

sw_matchposts_submitjob

(6) Ouptput: Ouput as unique single JSON file to GCS as ‘order/user/country’ in Google Cloud Storage bucket

sw_matchposts_joboutput

Shared World

Cloud application solution for overtourism

MATCH POSTS TO USER INTERESTS

Code flow:

Steps:

Google Dataproc examples: