Data modeling project
Your work in this class will include a project in which you use the ML programming language
to model real-world data of phenomena.
You will pick some domain of information, write a set of ML datatypes that capture
that information, and write a set of ML functions that operate on that information.
As a quick example, in Sections 1.10 and 2.3 of the textbook,
a series of datatypes are used to model meals at a restaurant:
datatype bread = White | MultiGrain | Rye | Kaiser;
datatype spread = Mayo | Mustard;
datatype vegetable = Cucumber | Lettuce | Tomato;
datatype deliMeat = Ham | Turkey | RoastBeef | ExtraVeg of vegetable ;
datatype noodle = Spaghetti | Penne | Fusilli | Gemelli | Farfalle;
datatype sauce = Pesto | Marinara | Creamy;
datatype protein = MeatBalls | Sausage | Chicken | Tofu;
datatype entree = Sandwich of bread * spread * vegetable * deliMeat
| Pasta of noodle * sauce * protein;
datatype salad = Caesar | Garden;
datatype side = Fries | Chips | CarrotSticks | GarlicBread | Salad of salad;
datatype beverage = Water | Coffee | Pop | Lemonade | IceTea;
datatype meal = Meal of entree * side * beverage;
Operations on datatypes like this could include
calculating something from them (say, the calories or price of a meal),
comparing two values of this data type (if a sandwich could have an indefinite
number of layers, then we could check which of two sandwiches had more stuff),
or modifying a value of this data type (for example, substituting
meat with something plant-based to make a meal vegetarian).
The textbook contains many and diverse examples of datatypes modeling real-world
data and phenomena, including
- Plots of land as used on a farm, Sections 1.(10, 11, 13), 2.3, 3.7
- Library items, Section 1.10 exercises
- Chemical formulas, Section 1.10 exercises
- Structured song lyrics, Section 1.14
- Sentences in (a subset of) English, Sections 2.6, 3.7, 6.3
- Game configurations, Section 4.11
- Geographic entitites, Sections 5.(2, 6, 7, 11)
- Academic courses 5.9 (no code given)
There are others in Chapters 8 and 10, not covered in class.
You are permitted to expand on one of these (as long as you take it
in a new, interesting direction), but you are
encouraged to come up with your own domain of information, something that interests
you and that you already have knowledge about.
To get you started thinking, here are some
projects that have been done in previous semesters:
- Taekwondo kick combinations
- Nucleotide and protein sequences
- Class schedules and degree requirements
- Airline flight schedules and itineraries
- Soil types and uses
- The game Quarto
- Cliches in song genres
- Musical notation and other music-theory information
- Customizable options for (science fiction) space ships
- Baseball box scores and player stats
- Routes in a map
- Military equipment
- The characters in Super Smash Bros Melee
- Players and weapons in a dual game
- Baked goods (ingredients etc)
- Ultimate Frisbee teams
- Quarterback stats
- Menu items at a coffee shop
- Swim team meet results
- Weather data
- Organic chemicals and reactions
- NFL team rankings
- Personal scheduling
- Matchmaking
- Theatrical productions
- Orchestras (instruments, players, musical pieces)
- Options and prices in Tesla cars
- Celestial bodies
- Song play-lists
- Ancient Greek philosophers
(For what it is worth, the best projects I have
gotten have been the music theory ones and the chemistry ones.)
Ideally you should work on this with one partner (team of two), but I will
also allow solo projects and teams of three.
Work on this project will be spread throughout the semester
so that you can improve your model and make it more sophisticated as you learn more ML.
This shouldn't be a reason to put it off, however.
For most projects, most of what you need to know will be covered in the first few weeks of
the class, so most teams will be able to do most of the work early in the semester.
Your final submission will consist in four parts:
- The ML datatypes modeling this data
- The ML functions operating on this data
- Sample data, that is, values of the datatypes you designed
- A write-up (approximately one page), consisting of
- A desciprition of the real-world information of phemonenon you are modeling.
- A description of what you did to model/capture that information and decisions you made
in designing the datatypes and functions.
- The liabilities and weakness of your approach (for example, some things that could
exist in the real world that your model can't capture, or things that your model allows that
wouldn't occur in the real world).
- What you learned from this project.
The schedule for this work will be
- Team sends me a proposal by email no later than Fri, Feb 16.
I will make sure your proposal is at the right level of difficulty and suggest
what ML features and discrete math concepts throughout the semester will
be useful to you in this project.
It is likely that some proposals will need some modification or further development,
so sending a proposal earlier---even weeks earlier---wouldn't be a bad idea---Feb 16
should be the absolute latest.
I can provide some assistance in finding a project topic and "matchmaking"
team partners, but I don't want people coming to me on Feb 15 and saying
"I don't have a partner or a topic."
- Team sends me a prototype of the ML code by email no later than
Monday, April 2.
This should be everything except the write up, complete as far as you can tell.
I will respond with improvements or corrections I would like you to make.
If you want more time to work on those improvements, then send me the prototype earlier.
- Team sends the final version of the ML code and the write up (PDF is preferred) by
email no later than Fri Apr 27, which is the last day of classes.
If you don't want this project to interfere with the other things you'll need to finish
up that week, then send me the final version earlier.
Thomas VanDrunen
Last modified: Wed Mar 21 11:45:40 CDT 2018