Week 10: Lending Club Project - Part 2 ๐Ÿ’ธ

ยท

5 min read

Week 10: Lending Club Project - Part 2 ๐Ÿ’ธ

Hey there! ๐Ÿ‘‹

Welcome to the second and final part of our Lending Club project. In this blog, we focus on calculating loan scores based on loan payment history, financial health, and loan defaulters' history. We'll also create permanent and temporary tables for downstream analysis, identify and clean bad data, and finally, calculate and store the loan scores. This blog will guide you through the steps required to achieve a comprehensive and accurate loan scoring system.

Loan score

  • Loan score calculation depends upon three factors:

    1. Loan payment history (loan repayments history for previous loans)

    2. Customer's financial health

    3. Loan defaulters' history (delinquencies, public records, bankruptcies, enquiries)

  • For calculating the loan score based upon the above factors, two primary tables are required:

    1. Delinq

    2. A table consisting the details of public records, bankruptcies, enquiries

  • The first of these tables can be created using the following dataframe.

  • Now, we clean the records for creating a processed dataframe, out of which the second required table can be developed.

  • Firstly, we create a processed dataframe using the following command.

      loans_def_processed_df = loans_def_raw_df \
      .withColumn("delinq_2yrs", col("delinq_2yrs").cast("integer")) \
      .fillna(0, subset=["delinq_2yrs"])
    
  • After further cleaning of the dataframes, we create a temporary table 'loan_defaulters' on the processed data.

  • Thereafter, we create a detailed dataframe including the columns required for the calculation of loan score.

  • We write the processed data to the 'cleaned' folder, both in CSV and parquet format.

Permanent table creation on cleaned data

Business requirement: Some teams need to analyze the cleaned data, which can be achieved by the creation of permanent tables on top of the cleaned data. Since multiple teams need to access this data, we will create external tables.

  • We create the permanent table for 'customers' as follows.

  • Permanent table for 'loans'.

  • Permanent table for 'loans_repayments'.

  • Permanent table for 'loans_defaulters_delinq'.

  • Permanent table for 'loans_defaulters_detail_rec_enq'.

Business requirement: The downstream teams require a single consolidated view of all the datasets with the latest up-to-date data.

  • In order to achieve this, we create a view on the cleaned data that refreshes frequently.

  • The query will take time to execute as it joins multiple tables to get the desired data.

Business requirement: A team requires quick access to the data, without having to wait for the view results to be processed.

  • For achieving this, we pre-calculate the results by executing the join of tables beforehand and storing the results in another table.

  • In this case, a managed table gets created.

Criteria for loan score calculation

  • Higher the loan score, better the chances of getting the loan approval and vice-versa.

  • The three major factors and their % contribution for loan score calculation:

    1. Loan repayment history (20%): based on last_payment and total_payment_received

    2. Loan defaulters history (45%): based on delinq_2yrs, public_rec, public_rec_bankruptcies and inq_last_6mths

    3. Financial health (35%): based on home_ownership, loan_status, funded_amount, grade_points

Identifying the bad data and final cleaning

  • As we can see, there are repeating member IDs. Ideally, there should be only one record associated with a particular member ID.

  • The following member_id, for example, has multiple records associated with it, implying bad data.

  • Similarly, there is bad data (multiple member IDs) in loans_defaulters_delinq table as well.

  • Bad data (multiple member IDs) in loans_defaulters_detail_rec_enq.

  • There are as many as 3157 member IDs that repeat themselves in our data.

  • We consolidate all such member IDs into a single dataframe as follows.

  • We create CSV files corresponding to the three dataframes for consolidating all the bad data into a single 'bad' folder.

  • In order to create a common CSV file containing all the bad data, we perform a union of all the three bad dataframes and write the consolidated CSV file into the same 'bad' folder.

Segregating the identified bad data from the normal data

  • We create a final cleaned dataframe by excluding the member IDs present in bad data views and store the final cleaned data in 'cleaned_new' folder.

  • Cleaned data for customers.

  • Along similar lines, cleaned data for loans defaulters and loans defaulters details.

  • Then, we create external tables over the final new cleaned data.

  • Now, we have unique member IDs.

Processing and storing the final loan score

  • We define variables (user defined) for rated points and grade points, that will be used for calculating the final loan score.

Loan score calculation criteria 1: Payment history

  • Firstly, we create a temporary view out of the bad customer data.

  • Following query is used to determine the rated points and grade points on the basis of loan repayment history. It will contribute by 20% towards loan score calculation.

Loan score calculation criteria 2: Loan defaulters history

  • Following query is used to determine the rated points and grade points on the basis of loan defaulters history. It will contribute by 45% towards loan score calculation.

Loan score calculation criteria 3: Financial health

  • Following query is used to determine the rated points and grade points on the basis of financial health of the customer. It will contribute by 35% towards loan score calculation.

Final loan score calculation

  • Firstly, the loan score gets calculated using the following query. Note that corresponding percentages are considered with each loan score calculation factor.

  • We store this loan score inside the 'loan_score_eval' view.

  • Now, the final loan score, along with the corresponding grade gets calculated using the following query. Note that custom queries can be performed on the temporary table thus created.

  • We store the final loan score results under 'processed' folder in HDFS.

Conclusion

In this concluding part of the Lending Club project, we successfully calculated the loan scores by considering various factors such as loan payment history, financial health, and loan defaulters' history. By creating permanent tables and views, we ensured that the data is easily accessible for analysis. Additionally, we identified and cleaned bad data to maintain data integrity. The final loan scores, stored in the processed folder, provide a reliable metric for loan approval decisions.

Stay tuned for more insights and techniques as we continue our journey in data engineering!

ย