Project: Refining Your Questions

As project work starts, many students have difficulty working from their original idea to a real project.

First, remember that there will be limits to what you can explain (or predict or confirm or ...). Most real-world situations have a lot going on that isn't explained by any possible CSV file.

What to ask

When approaching a data set with many columns/variables/features, I'd suggest thinking about "inputs" and "outputs" in whatever world you're examining. If you are someone producing one of these values (the client, author, creator, policy-maker, ...), which ones are under your control or are independent variables? On the other hand, which aren't or are dependent variables (sales numbers, user ratings, largest result, ...) that you would like to get "good" values of?

If you look at your data with stats, you might ask if the dependent variables are related to independent: linear regression if there might be a linear relation, or a T-test if there are two approximately-normal dependent choices that can be compared, etc. What might that tell you about the choices that should be made by the "client"?

If you think of it as ML, is it possible to predict the independent variables from the dependent? You could think of this as a regression and try to predict them, or reduce them to "good" and "bad" and turn it into a categorical problem.

Asking More

For many projects, you will spend a lot of your time getting into the data set: figuring out the format, fields, limitations, etc. Once you have done that, answering "your question" might be a very small amount of work.

If that's the case, consider looking at the data another way and adding another question or two. This can be a very good cost-benefit ratio: you have already done the work to figure out the data, so analyzing it in another way isn't much more time.

Explaining ML Results

If you use any tree-based model (random forest, boosted forest), after training, the model will have a .feature_importances_ property which describes the weight of each feature in the predictions.

This can be very useful in explaining why decisions are being made, or what features actually matter. This might be very useful for your report and summarizing what's really going on.

Updated Fri Aug. 01 2025, 13:18 by ggbaker.

Simon Fraser University
Engaging the World

CourSys

Project: Refining Your Questions

What to ask

Asking More

Explaining ML Results