Analysis of The United States Automotive Market

Rahul Jasmin & Krishna Gajjala

Introduction

In America, automobile prices slightly fluctuate at all times throughout the year for various reasons, but significant changes in the automobile market can be analyzed and interpreted in order to determine the effects of large-scale events on the car market. Data on the automobile market is fairly extensive but is usually only compared when the cars are new and compared among each other to determine which car is superior in certain aspects. The goal of this project is to analyze used car data along with new car sales data to better understand the effects large scale events, especially the recent COVID-19 pandemic, on the car market and to see if any differences or trends in the analysis are noteworthy.

Why is this important?

Data from analyzing the car market is useful in determining what factors can affect the car market and finding any trends in both used and new cars sold. This information can be extremely important and helpful to both manufacturers in determining the best way to improve profits and to consumers in determining which car to purchase.

These are only some aspects of how data science is useful in the automobile industry, you can find some more information here: https://www.datacamp.com/community/blog/automotive-data-science

Goal

The goal of this tutorial will be to obtain used car information of previous years and sales data of recent years in order to generate graphs and models that can help visualize the direction and key aspects of automobile market in America

Data Collection

Used Car Data

To start this project, we will need to import some libraries in order to collect the necessary data. If any of these library packages are not installed already, just use following code to install them:

pip install {package name}

Then we will download the used car csv data from kaggle.com, specifically https://www.kaggle.com/austinreese/craigslist-carstrucks-data. This data is raw postings of used cars from craigslist that needs to be modified/proccessed before we can reliably use it for data analyzing.

We make use of Pandas and NumPy to manipulate our data in dataframes. Below is the documentation of these libraries that provide more in depth functionality notes:

Pandas: https://pandas.pydata.org/docs/index.html

NumPy: https://numpy.org/doc/stable/index.html

First, we read the data from vehicles.csv into a dataframe

New Car Data

Next, we will scrape www.goodcarbadcar.net using the requests library in order to gain sales records of cars in 2019 and 2020 and up to Sept 2021. Use following header to act as a browser when making request in order to have access to the site to scrape data

Data Processing

Used Car Data

We need to change the type of posting date to datetime, remove unnecessary columns, remove blank cells, and create a new column called supertype for better analysis We accomplish all of this in the following code blocks.

Sales Data

To process and clean the sales data, we need to remove junk rows, convert object type data into integers and strings and split up the car's name into its make and model. We will also be discarding the 2021 sales data past September, as those are still being uploaded and not completely accounted for. Currently, the data is formatted so that each row is a car, and each column is its monthly sales. Instead, we will be making each row a singular sale observation, so we will need to melt the data. Finally, we will merge all of the sales data into one master dataframe.

Exploratory Analysis

Used Car Data: Visualizing the Price of Used Cars Over Time

Looking at the average price of a used car posting per year, I expected an increase in price per year, but did not expect the average prices to be so high in the recent years. 2019 used car postings average price is above 30,000 dollars which seems very high for a used car.

New Car Data: Visualizing Sales Over Time

Now that we have our data properly processed, let's take a look at some charts to explore some unique characteristics of this dataset. The first chart with this dataset will visualize the sales data for new cars from 2019 until September of 2021.

From this graph, we can see that the Ford F-Series is consistently the most sold vehicle in the U.S. We can also see that there was a major dip in most vehicle sales in March and April of 2020, especially for the highest selling vehicles. This was about the time that the quarantine started and COVID-19 was officially recognized as a Pandemic and National Emergency.

More about that timeline can be found here: https://www.ajmc.com/view/a-timeline-of-covid19-developments-in-2020

Unfortunately, because of the sheer volume of data, it is difficult to discern the sales data of each individual car, especially for vehicles with lower sales. So, let's aggregate the data and see the total sales data for each individual brand over time.

From this graph, we can see that Ford, Toyota, Chevrolet, Honda, and Nissan are consistently the top 5 best-selling automotive brands in the U.S. It is also clear that the sales dip in March and April of 2020 affected pretty much every brand. Because this difference is so apparent in these higher selling brands, and the sheer volume of their sales make many other brands' sales indiscernible, it might be a good idea to focus on just these 5 major brands.

Regression and Analysis

New Car Data: Predicting Future Sales

Next, we will attempt to predict 2021 sales data using the 2019 and 2020 sales data. We will use linear regression to achieve this. Additionally, because each company only has one sale volume for each time period, and because the data we have is unnaturally affected by covid, we lack enough data to make accurate predictions for each company. Instead, we will predict average sale volume for 2021, and compare our results to the average of the real 2021 sale volume. We will focus only the 5 major brands discussed previously, and this way we will have a large enough sample size for the result to be considered statistically significant, while also reducing the variation of the data.

Now, let's generate a linear regression. First, we have to quantify the date into a numeric format, so that it can be used as X in our regression. We will use decimal years for each month, such as 2019.5 for June of 2019.

Next, we generate the regression using the new quantified dates and the sales. We will use linear regression using the least squares function (ols), from the stats library. Because we are using the 2019 and 2020 data to predict 2021 data, the master dataframe can be used for training.

Unfortunately, from the model summary, we can see that the p-value, noted by the P>|t| column, is just shy of 0.05. A p-value that is less than or equal to 0.05 signifies that there is strong evidence of a correlation, but unfortunately, we were only able to obtain a p-value of 0.06.

We can see in the residuals that most of the sales data varies significantly as each brand in the top 5 generally performs at its own level, such as how Ford always has much better sales numbers than Nissan. However, the residual for April 2020 is much smaller and tighter, while also being the farthest away from 0, demonstrating how unusual the sales figures were at the peak of the pandemic. Normally, we would want to remove such an outlier to improve the regression, however, removing the April 2020 sales data would reduce the statistical significance of the regression as it has comparativley much less variance than the sales at other dates.

Next, we will compare the predictions made from the regression to the real 2021 data.

As we can see from the plot, the predictions unfortunately don't correlate well with reality. However, this is to be expected, as we saw massive volatility in sales in the initial exploratory graphs in 2021. Unfortunately, this regression doesn't fit well with this data, however, it would likely perform much better with a dataset that includes total sales for each year rather than for each month, as there would likely be much less volatility, and a model would be able to make a proper prediction.

Additionally, the regression predicted that sales would likely increase based on previous sales, even though the pandemic caused a major dip in sales. Although this is the opposite of what has happened so far, it is likely that sales will continue to steadily go up in the future and for the year as a whole as long as another pandemic doesn't hit us.

The average for the predictions were relatively close to the average for the real 2021 sales. Additionally, we can see how volatile the real sales are with its massive standard deviation in comparison to the standard deviation of the predictions. This kind of regression would do much better with average sales, likely because of how varying the sales of each brand is.

Used Car Data: Predict The Cost of a Used Car

With the information in the vehicles data, we can predict the price of a used car based on its variables such as year, make, and supertype(car, truck, SUV, other). This is accomplished through multivariable linear regression with least squares that is available in the statsmodel library. The regression will be generated by using 90 percent of the data for training and the other 10 percent for testing.

From the model summary, we can see that these variables are a good indicator of price as their p-values, noted by the P>|t| column, are mostly 0 to 0.05 for each individual component of each variable suggesting that their impact on price is different from a 0-coefficient value. There are some NAN components but that is mostly just makes that don't contain certain supertypes, which is not ideal but fine since then the price will be predicted based on the variables separately making it less accurate but still within a certain range. However, the worst indicator for price seems to be the TRUCK supertype as all its p-values are very large, nearly 1. We can test this effect by running both with and without the supertype truck for our predictions.

The F statistic of our first model seems much higher than the one of model 2, therefore our first model was better for testing. The F values can be determined through the following code.

Now we will plot residuals of our models.

From the residuals we can see that as the year increases, the residuals are closer to zero which is ideal, but the variance in the residuals is still very high suggesting that the model is not able to accurately represent the high variance in the data itself.

Next, we test our regression model with the remaining 10 percent of data.

On average, how far off we are on our estimates from the actual price is -11.4 with a standard deviation of 7571.3. The price mean is 19088.03 with a standard deviation of 13045.16. Considering these factors, most of our estimates are good since at least 68% of them are within 8,000 of the actual price which is less than 1 standard deviation of the price. These values can be calculated from the following code block.

The biggest difference in our estimated versus the actual price was -60778.3. This corresponds to the 2005 dodge viper convertible which still hold a high value in comparisons to other cars in that make and year because its been discontinued and the only way to get one is by buying a used version. Its original MSRP was $81k according to cars.com. These statistics can be seen through the following code.

Used Car Data: Price vs Posting Date

We can also check how the prices of the same make and model car changes over later posting dates. First, we need to groupby the make, model, and year of a car and limit the rows to only rows that contain more than 1 posting date.

Then we can determine the change in price from the first posting date and last posting date over time by using the slope from those two points. We can generate a new column called slope to hold this value.

First, we remove impossible/extremely unlikely slopes, such as slopes over 100 and below -100 since these slopes would indicate price changes more than the total price of the vehicle within one year. With the remaining results, the average price of a used car put on sale barely changes at all with a 0.0016 dollars per day increase till its latest posting which does not seem that bad at first, but the standard deviation of the slope is 35.92, essentially putting most cars to either gain or lose within 36 dollars per day on their used car with an incredibly small amount gain more than lose. The gains are most likely cars that can stand the test of time in terms of quality and some higher end cars that become classics or have good reviews and are sold for higher profits later. The losses are just cars that gradually lose value over time as the miles on the car increase and the ratings of the car are not stellar. These values can be accessed through the following code block.

Conclusions

Overall, we found that the price and sales of vehicles are steadily going up, despite the major set back that was the COVID-19 pandemic. The car market is recovering and maybe the world will too.

With the new car data, we didn't have an effective model at predicting future sales because of the massive variation in sales. However, we found that the regression would worked well for finding average car sales and would likely work better for sales data in a longer time frame. We also saw the impact the pandemic had on all sales, as the residual data showed that April of 2020 significantly reduced the variance and average of all car sales.

Through used car data, we were able to determine important factors of a cars price, specifically make, year, and type of car contributing heavily to the price of the car. Specifically, the car type of SUV seems the hardest to predict with this dataset as it contains very high used car prices in later years compared to earlier years. With this analysis, one can determine what their current car could be valued on craigslist given its make, year, and type to determine whether it is worth it to consider selling their current used car for a new car.

The information about price of used cars that had multiple posting dates showed that on average the price does not change over the years for a car with the same make, model, and year. This is helpful for people on the fence about what time to put their used car up on sale because the analysis shows that on average, it does not affect the price and that you should check for reviews on its longevity and efficiency to determine if it will go up or down.

We hope that this tutorial taught you something about the data science pipeline. By looking at two unique sets of data, we wanted to demonstrate some of the variety in data science applications, and how some models may fail while others may succeed.