Recently QBA was asked to present a proof of concept (PoC) for a client in the utility industry with the end goal of providing insights from the data they provided us. Without getting into specifics, this ask is one of the “purest” forms of analytics we do: “Here is some data. What can you tell us about it?”.
With such an open-ended question, the capability of the team to quickly iterate and manipulate the data provided is crucial. Especially in industry there is not unlimited time for “science experiments”, so the faster data can be sifted and analyzed, the quicker we can provide value to the client.
A GOOD DATA PIPELINE
Of course, the foundation of rapidly iterating through data is having a good data pipeline, whether that means that the person doing the analysis has the skillset or (more ideally) there is a data engineer that can provide immense value by managing the infrastructure. At QBA, our data engineering team manages our database needs to let our data analysts do what they do best, which is analyzing data.
USING OUR SKILLSET TO GET RESULTS
The data analysis we do here is amplified by the decades of experience that QBA has working with utility companies and their data. This experience gives us an edge in identifying which data features are significant, making our process more focused and efficient. Additionally, our results-oriented modeling approach prioritizes creating a model that provides an answer to the client instead of using the cutting-edge for the sake of it. In this PoC, we compared different versions of a bagged Random Forest model using a variation of features. We chose Random Forest due to its robustness to outliers and fast training time since our focus was to discover insights in a quick turnaround. There was discussion of using other models, such as a neural network or time series analysis. Both of these models are useful but did not meet our client’s needs; a neural network is still a rather black-box operation, which contradicts the utility industry’s need for interpretability, while a time series model would have taken this project beyond the deadline without necessarily yielding additional insights.
Though there were many model parameters and data features that we did implement, there were also many ideas that were placed into the future work category. These include additional hyperparameter tuning or data feature manipulation to improve model accuracy. Finally, a different ensemble model such as Gradient Boosted Regression Trees could be considered in case the Random Forest model optimizations provide less accuracy than expected.
After being presented with the results, the client was both impressed and interested in expanding the project. As of this post, QBA is in talks to expand the initial proposal into a long-term effort that holds the promise of revealing insights of greater value to the client, by leveraging those future work ideas that were not implemented initially due to the constraints of the original PoC.