Week 9: Lending Club Project - Part 1 ๐Ÿ’ธ

ยท

7 min read

Week 9: Lending Club Project - Part 1 ๐Ÿ’ธ

Hey there, data engineering folks! ๐Ÿ‘‹

In this two-part series on the Lending Club project, we delve into the process of creating, cleaning and transforming various datasets derived from a large dataset containing over 2 million records. The first part of the series focuses on the creation and initial cleaning of four key datasets: customers_data, loans_data, loan_repayments, and loan_defaulters. These datasets are integral for performing subsequent analyses and operations. By using Spark SQL and various data manipulation techniques, we ensure that the data is well-organized, accurate, and ready for further processing.

So, let's get started!

Problem statement

Financial institutions like banks and lending clubs (peer-to-peer platforms) require to analyze their customer data for calculating the risk factor involved in approving the loan for a particular customer. With this analysis, they can take decisions regarding approving loans, increasing the interest rates and more.

Dataset creation

  • We have a very big file with 118 columns and 2+ million records, out of which we have to create four datasets: customers_data, loans_data, loan_repayments and loan_defaulters.

Creation of customers_data

  • The customers_data will contain information about the borrowers.

  • Using withColumn, we create a new column 'name_sha2' which will later be renamed as emp_id and used as the unique ID for customers.

  • 'SHA-2' (Secure Hash Algorithm 2) is a cryptographic hash function that produces unique hash values for each customer using the information in 9 existing columns of the dataset.

  • Following is the count of records in the whole table and the number of distinct records in the customer table.

  • As we can see in the following image, there are some discrepancies in some customers' data.

  • The outlier that was showing the count as 33 is nothing but some records with all values as null.

  • Using Spark SQL query, we select the information related to the customers and write the output to the required folder, using repartition(1) to save all the data in a single file.

  • The data gets written successfully.

  • We try to read the customers data and it looks as follows.

Creation of loans_data

  • Loans_data will contain details about the loans given by the institution.

  • Using Spark SQL query, we select the information related to loans and write the output to the required folder, using repartition(1) to save all the data in a single file.

Creation of loans_repayments

  • This table, as the name suggests, contains details about any particular customer's loans repayment history.

  • Again, we use Spark SQL to write the data into a single file.

Creation of loans_defaulters

  • Loans_defaulters contains the data about all the customers who have defaulted in paying their loans.

  • Again, we use Spark SQL and repartition to write the data into a single file.

  • Finally, all the four datasets get written into the 'raw' folder and we can perform further operations upon them as per our requirements.

Cleaning of customers_data

  • To give the correct schema, we explicitly mention the schema while creating the customers_raw dataframe.

  • The defined schema is as below.

  • In order to ensure accurate column names, we utilize the 'withColumnRenamed' function.

  • For mentioning the time at which we process the data, we use the 'current_timestamp()' function.

  • As we can see in the following image, there is some difference between the number of total records and the number of distinct records, indicating the presence of duplicate values.

  • To get rid of the duplicate records, we discard the customers whose annual_income is null and store them into a new dataframe.

  • As the 'emp_length' column is in string format and follows a pattern such as '9 years', we replace the non digit values with blank space using 'regex_replace()' function.

  • We cast the 'emp_length' column from string format to integer format.

  • Firstly, we calculate the average employee duration as follows.

  • Then, we replace all the nulls in the 'emp_length' column with the average employee length as follows.

  • After confirming that the above operation was successful, we create a new dataframe out of it.

  • As we can see, there are many address_state values that have more than 2 characters.

  • There are 254 such address_states and we need to get rid of them.

  • In order to achieve this, we replace all the values having more than 2 characters with 'NA', keeping the remaining values unchanged.

  • We can see the improvement in address_state values.

  • We store this cleaned data to the 'cleaned' folder using DataFrame writer API, firstly in parquet format.

  • Further, we store the same cleaned data in CSV format as well.

Cleaning of loans_data

  • To give the correct schema, we explicitly mention the schema while creating the loans_raw dataframe.

  • For mentioning the time when we process the data, we use the 'current_timestamp()' function.

  • The major discrepancy that exists in the current dataframe is the presence of excessive null values, that need to be get ridden of.

  • We define a list of columns which must not have any null values and then, we drop the records if any of the values corresponding to the specified columns are null.

  • We use 'regex_replace' function to replace ' months' in loan_term_months with blank space, then we cast it into integer. Afterwards, we convert the month into year by dividing the value by 12 and casting it to integer again.

  • As we can see below, there are some unnecessary loan_purposes.

  • We define a list, beyond the values of which we rename the loan_purpose values to 'other'.

  • As we can see, there is a clear improvement in terms of loan_purpose values.

  • As an alternative, we can also use the following syntax to achieve the same results.

  • The cleaned data is stored to the 'cleaned' folder, firstly in CSV format.

  • Finally, we store the loans_data in parquet format as well.

Cleaning of loans_repayments

  • To give the correct schema, we explicitly mention the schema while creating the loans_repay_raw dataframe.

  • For mentioning the time when we process the data, we use the 'current_timestamp()' function.

  • We define a list of columns which must not have any null values and then, we drop the records if any of the values corresponding to the specified columns are null.

  • There may arise some cases when total_payment_received is 0.0 but total_principle_received is not 0.0. indicating that the principle has been paid.

  • In such cases, we replace the total_payment_received with the sum of total_principle_received, total_interest_received and the total_late_fee_received, using a 'when' clause.

  • Now, to remove the columns where total_payment_received is zero, we select all the records where total_payment_received is not zero, using the filter condition.

  • Following are some illegal values for the columns that contain dates.

  • In order to replace the illegal values with null, we use the following queries.

  • Finally, we store this data into the 'cleaned' folder in both CSV and parquet formats.

Cleaning of loans_defaulters

  • Let's have a brief look at the loans_defaulters data first.

  • The values for 'delinq_2yrs' look as follows.

  • To give the correct schema, we explicitly mention the schema while creating the loans_def_raw dataframe.

  • As we can see, the values for 'delinq_2yrs' are in float format but we need these in integer.

  • Using below query, delinq_2yrs gets converted into integer and all the nulls get replaced by zero as time period should be either a positive integer or 0.

  • We store the data into two separate dataframes.

    1. The first dataset contains details of customers who missed or delayed payments, based on conditions delinq_2yrs > 0 or mths_since_last_delinq > 0.

  • The count for such customers is:

  1. The second dataset contains information for only those members for whom either there is a public record, public record of bankruptcy or enquiries in the last six months.

  • The count for such customers is:

  • The first dataset is stored into 'cleaned' folder in both CSV and parquet file formats.

  • The second dataset is also stored in the 'cleaned' folder in both CSV and parquet formats.

Conclusion

In this first part of our Lending Club project series, we successfully created and cleaned four essential datasets: customers_data, loans_data, loan_repayments, and loan_defaulters. Through a series of steps involving schema definition, data transformation, and null value handling, we prepared these datasets for future analysis. The careful cleaning and structuring of data ensure the reliability and accuracy of the insights we can derive.

Stay tuned for the next part, where we will delve deeper into analyzing these datasets and uncovering valuable insights.

ย