College Scorecard Analysis

March 22, 2023

As a submission for an internship for the Carroll and Milton Petrie Foundation, I did some exploratory analysis for this dataset containing countless information about colleges in the United States. The full documentation listed on College Scorecard can be found here. In the original dataframe that was derived from the dataset, there were too many columns to work with.

Skipping over loads of dimensional reduction and cleaning up missing values, I decided on the main columns that I wanted to work with.

df.sample.head()

	region	institution	state	city	fed_loan_percentage	pell_grant_percentage	median_student_debt	total_students	ugds_men	ugds_women	ugds_white	ugds_black	ugds_hisp	ugds_asian	ugds_aian	ugds_nhpi	ugds_2more	ugds_nr	ugds_unkn
4824	5	Charleston School of Law	SC	Charleston	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan
1379	2	Morgan State University	MD	Baltimore	0.7861	0.5582	18359	6263	0.3944	0.6056	0.0164	0.8258	0.0442	0.0057	0.0016	0.001	0.035	0.0398	0.0305
1425	1	Clark University	MA	Worcester	0.6256	0.2316	23000	2205	0.3787	0.6213	0.6177	0.0481	0.0971	0.0707	0.0005	0.0005	0.0327	0.0916	0.0413
2717	8	Portland State University	OR	Portland	0.3452	0.5362	16166	16596	0.44	0.56	0.5101	0.0398	0.1864	0.0987	0.0116	0.0066	0.0674	0.0375	0.0421
5070	6	The Art Institute of San Antonio	TX	San Antonio	0.5556	0.5694	19000	374	0.4492	0.5508	0.254	0.1925	0.4545	0.0134	0.0107	0.0027	0.008	0.0241	0.0401

Region key:

	Region
1	New England Division
2	Middle Atlantic
3	East North Central
4	West North Central
5	South Atlantic
6	West South Central
7	Mountain
8	Pacific
9	Virgin Islands, Puerto Rico, American Samoa

The first questions that I wanted to answer for this exploratory analysis is _what is the average race distribution for every student body in each region of the United States?

demographics = [ugds_white, ugds_black, ugds_hisp, ugds_asian, ugds_aian, ugds_nhpi, ugds_nr, ugds_2more]

for demograohic in demographics:
  df_all_race_demo_region_percentage = df.groupby('region')[df[demographics[i]].mean().reset_index()
  df_all_race_demo_region_percentage[demographics[i]] *= 100
  
 df_all_race_demo_region_percentage

	region	ugds_white	ugds_black	ugds_hisp	ugds_asian	ugds_aian	ugds_nhpi	ugds_nr	ugds_2more
0	1	57.603	11.6097	13.5798	4.1663	0.388457	0.124373	3.36936	3.49817
1	2	53.8537	19.2548	13.3173	4.35984	0.284763	0.190558	2.4026	2.78183
2	3	59.641	19.4984	9.67792	2.45622	0.785451	0.175403	1.46493	3.19042
3	4	66.6972	10.4764	8.03708	2.32747	3.91095	0.198337	1.95838	3.42687
4	5	46.6168	31.2633	11.474	1.78546	0.473513	0.23736	1.72444	2.67568
5	6	33.6464	16.2351	34.5133	3.00924	3.82658	0.248525	1.19305	3.7166
6	7	62.4206	3.97608	15.7429	2.92041	3.84135	0.704865	1.1418	3.60077
7	8	31.7099	7.25795	33.6544	10.5413	1.31536	1.1694	3.10716	5.22554
8	9	0.181812	0.772101	92.9425	1.05406	0.03	4.29493	0.245145	0.0807971

To be able to better visualize the student body race distribution in each region. I wanted to make a bar graph of each race and their presence percentage.

columns = ['ugds_white', 'ugds_black', 'ugds_hisp', 'ugds_asian', 'ugds_aian', 'ugds_nhpi', 'ugds_nr', 'ugds_2more']
plt.figure(figsize = (20, 9))
plt.suptitle('student body race demographics by region')
for i in range(len(columns)):
    plt.subplot(2, len(columns), i + 1)
    g = sns.barplot(data = df_all_race_demo_region_percentage, x = 'region', y = columns[i]).set_title(columns[i])
    plt.ylabel('percentage')
    g.axes.set_ylim(0,100)
    plt.tight_layout()

It’s to note that white students most often dominate the student body on average for almost every region of the United States, besides the US Virgin Islands/Puerto Rico/American Samoa region. In the region that contains a relatively less amount of white students on average, West South Central, you can can add up every single race demographic average, discounting Hispanic, and it still wouldn’t end up surpassing the percentage of white students.

This does beg further insight into the dataset. Therefore next up, I will be taking a look at if there is any correlation in financial aid availability and race demographics in each region.

df_financial_aid_by_region = df.groupby('region')['fed_loan_percentage', 'pell_grant_percentage', 'median_student_debt'].mean()
df_financial_aid_by_region.reset_index(inplace = True)

df_merged_finaid_allrace = df_all_race_demo_region_percentage.merge(df_financial_aid_by_region)
sns.heatmap(df_merged_finaid_allrace.corr(), annot = True, cmap = 'Greys')

Here are some interesting insights I can make according to the correlation graph:

1) I can see that the share of students that receive federal loans lie largely within white student bodies in the US. Which does make sense with the large correlation that the white student body has with median student debt.

2) Native Hawaiian/Pacific Islanders and Hispanic seem to be the only two main student bodies that benefit from Pell Grants.

3) The non-resident student body is 2nd most in federal loan share percentage which could be attributed to the low percentage of non resident students in total.