Unit 5. Data Manipulation
Revision Date: Jan 12, 2020 (Version 3.0)Summary
Big Data has been defined in many different ways. Easy access to large sets of data and the ability to analyze large data sets changes how people make decisions. Students will explore how Big Data can be used to solve real-world problems in their community. After watching a video that explains how Big Data is different from how we have analyzed and used data in the past, students will explore Big Data techniques in online simulations. Students will identify appropriate data source(s) and formulate solvable questions.
Students are introduced to parallel and distributed programming and to their use with big data.
Journal: How can a computer gather data from people? (Think-Pair-Share)
Remind students of the mind guessing game: http://en.akinator.com/ or 20 questions http://www.20q.net/
Discuss: How can the computer learn from people when playing one of these games? How many different answers do you think it could possibly know?
Teacher note: students are not expected to actually play this game during class.
Read The Rise of Big Data in chunks: An Introduction to “Big Data” (20 mins) Reading can be found at: http://www.foreignaffairs.com/articles/139104/kenneth-neil-cukier-and-viktor-mayer-schoenberger/the-rise-of-big-data
Show the first 3-5 minutes of this clip. (It becomes a bit dry, so just show the amount that is appropriate for your students to get the idea): https://www.youtube.com/watch?v=7D1CQ_LOizA
Some examples of how big data is used appropriately:
Some examples of how big data was inappropriately used:
Students are to pick three topics they want to research that use big data. It is preferred that these topics relate to something learned this year in the course (e.g., the need for IPv6). Tomorrow, as the students enter class, they will sign up on a list with their chosen topic. Since the students will have three options, it is likely they will get one of their selected topics to research.
Journal: Think about your daily and weekly activities. What types of data are being stored about you?
Remind students to think about what they do online, in stores, while in a car, etc.
Review the steps to processing Big Data:
As a class, walk through these steps using the two files in the lesson resources folder (FailedBanklist.csv & Consumer_Complaints.csv)
Step 1.
Demonstrate how files such as these can be obtained at http://catalog.data.gov/dataset
Formulate questions such as:
Step 2.
Extract data source into a format supported by underlying tools
Open one of these files in Notepad (or some simple editing program such as Notepad++) and demonstrate how the actual data itself is separated by commas, thus the file name “csv” for comma-separated value.
Open both files in Microsoft Excel. Complete a find for the bank name “Banco Popular de Puerto Rico” on both lists. You may want to first sort the data by bank name to find this bank or you can use CTRL + F to find the bank name (see screenshots below).
Steps 3 & 4.
Normalize data (remove redundancies, irrelevant details)
In this step, there is technically no need to remove redundancies or irrelevant details but you can show the students how you could remove data or limit the data to a particular data set. For example, if were to want to look at only the banks from Maryland, you can use the filter tool to only view those banks from MD.
Make data uniform (user-entered data may include abbreviations, spelling errors, or inconsistent capitalization without changing the meaning of the data)
Cleaning data: Depending on how the data was collected, it may not be uniform. Therefore the data may need to be cleaned before it can be processed. Cleaning the data is the process that makes the data uniform without changing its meaning. For example, replacing all equivalent abbreviations, spellings, and capitalizations of the same word.
Step 5.
Import data into the tool
Right now the file type is as a csv file. By resaving the file as a .xlsx file it becomes a true spreadsheet file.
Step 6.
Perform analysis
We have determined that the bank “Banco Popular de Puerto Rico” is on both lists. Now ask the students “Why is this bank on both lists?” Note: On the Failed Bank list the Banco Popular de Puerto Rico is actually an acquiring institution. By looking more closely at the dates of the acquisition of the failed bank “Westernbank Puerto Rico” one can formulate some possible deductions that maybe the reason “Banco Popular de Puerto Rico” is on the complaint list is because they had recently taken over a failed bank. It could be possible that some of these complaints were related to this recent acquisition.
Step 7.
Visualize Results
Explain to students that they will learn more about visualizing their results in Unit 6. They can complete graph visualization in excel. Show them the website: http://www.gapminder.org/. Explain that even though a visualization in excel is not interactive like http://www.gapminder.org/, they can complete some form of visualizing their data by using a spreadsheet. Note: http://www.gapminder.org/ is VERY attention-grabbing. Only briefly show the students what they can do with it (see how data changes over time, look at many different data sets and download data in different forms - including csv and xlsx formats).
Students should research their selected topics from homework. Some possible websites for finding data are listed above under “Possible good resource(s) for data collection.”
Students are to get your approval for a topic and then use the Big Data Sets Worksheet in the Lesson Resource Folder to find big data sets that are related to the approved topic.
Students are to review using http://www.gapminder.org/ looking specifically at life expectancy. Students will write one question after “playing” the timeline of life expectancy using gapminder on an exit slip before leaving class. For example, one may write “Why is the life expectancy of countries such as Denmark, Sweden, & Norway typically higher than other countries throughout most of the timeline?”
Challenge students to work in teams of four to find:
After 2 min stop students and ask them to share how they approached the problem.
Say: You split the task into pieces, and each person worked at the same time to get the job done more quickly than would be possible by yourself. This is parallelism. In computing, parallelism can be defined as the use of multiple processing units working together to complete some task. There are many different kinds of hardware that can serve as a “processing unit”, but the principle is the same: a task is broken into pieces in some way, and the processing units cooperate on those pieces to get the task done. Basic of Processes with Python
Students launch a Python 2 IDE such repl.it or JDoodle and paste the following code adapted from the Basics of Processes with Python web page.
from multiprocessing import *
def sayHi():
print "Hi from process", current_process().pid
def procEx():
print "Hi from process", current_process().pid, "(parent process)"
otherProcess1 = Process(target=sayHi, args=())
otherProcess1.start()
otherProcess2 = Process(target=sayHi, args=())
otherProcess2.start()
otherProcess3 = Process(target=sayHi, args=())
otherProcess3.start()
### execute
procEx()
Students execute the code - and debug if needed.
Review the definition of scalability. Ask how the scalability of systems is an important consideration when working with data sets, as the computational capacity of a system affects how data sets can be processed and stored.
Ask: What do the numbers mean in each line of output?
Explain:
Discuss:
Use the questions below as prompts. Students share until the idea that parallel solutions scale more effectively is discussed.
Say: Computer scientists use the term scaling to mean how a process responds to increases in the size of the project.
Say: In the first activity we saw an example of a program where more than one computer processor on a computer could be used at a time. Sharing a task among the processors on many computers is an example of distributed problem-solving. Distributed computing shares components of a software system among multiple computers so a large task can be done in less time using the resources available to each of the computers - including both processing and storage resources. Students read the first 5 paragraphs of the OpenScientist.
Discuss: With elbow partners, discuss what are the advantages of distributed computing and choose the top three projects that interest you from the list on the OpenScientist.
Say: Parallel and distributed computing are powerful tools but they have their limits in terms of increasing efficiency.
Show this graph of Amdahl’s Law by Daniels220 at English Wikipedia, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=6678551.
Ask:
Explain: The “speedup” of a parallel solution is measured in the time it took to complete the task sequentially divided by the time it took to complete the task when done in parallel. The 4 different colored lines represent four different processes with increasing portions that can be done in parallel. At first, they all speed up as more processors are used. After a time, adding more processors no longer speeds up each process. The sequential parts of processing limit the overall time.
Discuss: With elbow partners, students discuss why they think this is the case. Students share until these key points are verbalized:
Wrap up:
Students return to the OpenScientist and choose the project that most interests them.
Say: Parallel computing consists of a parallel portion and a sequential portion.
Students identify a part of their chosen project that is done in serial and a part that is done in parallel.
Students are to submit a document stating their topic for research using Big Data. This document should answer the questions:
Topic:
How is Big Data used to solve or remedy the topic?
Link(s) used to find Big Data? (i.e. data.gov, etc)
How has the transformation of data storage affected how data itself is used?
Answer: Storage and processing of large digital data enables us to analyze large data sets quickly rather than small sampling sizes as used before.
How can a computer use Big Data to make predictions?
Answer: Computers can use smart algorithms, powerful processors, and clever software to make inferences and predictions for solvable questions.
How can parallel processing help scale a solution to a Big Data problem?
Larger problems often have more parallelizable segments. The more processes that can be done in parallel the faster the overall process will become.
Unit 5. Data Manipulation
Revision Date: Jun 24, 2020 (Version 3.0)Summary
Big Data has been defined in many different ways. Easy access to large sets of data and the ability to analyze large data sets changes how people make decisions. Students will explore how Big Data can be used to solve real-world problems in their community. After watching a video that explains how Big Data is different from how we have analyzed and used data in the past, students will explore Big Data techniques in online simulations. Students will identify appropriate data source(s) and formulate solvable questions.
Overview
Session 1- What is Big Data?
Session 2 – Where can big data be used?
Student computer usage for this lesson is: required
Reading assignment and video clips for Session 1:
Possibly useful resource(s) for data collection:
Big Data Concepts:
Sample data sets (both acquired from http://catalog.data.gov/dataset) :
Journal: How can a computer gather data from people ? (Think-Pair-Share)
Remind students of the mind guessing game: http://en.akinator.com/ or 20 questions http://www.20q.net/
Discuss: How can the computer learn from people when playing one of these games? How many different answers do you think it could possibly know?
Teacher note: students are not expected to actually play this game during class.
Read The Rise of Big Data in chunks: An Introduction to “Big Data” (20 mins) Reading can be found at: http://www.foreignaffairs.com/articles/139104/kenneth-neil-cukier-and-viktor-mayer-schoenberger/the-rise-of-big-data
Break students into groups or pairs and jigsaw the seven units of the reading. Each group is to summarize their section in a tweet sized comment (not more than 140 characters).
Share tweets with the class.
Explain to students that big data is impacting every area of life. By using more data and processing power we can make better decisions. As an illustration, show a clip from the movie Moneyball: (3 mins) https://www.youtube.com/watch?v=rMObWsKaIls
After students watch, they create a journal entries explaining at least two ways data was used to better manage the baseball team. Partners discuss journal entries. Share at least one observation with table groups and then share at least one observation from each group with the class.
Show the first 3-5 minutes of this clip. (It becomes a bit dry, so just show the amount that is appropriate for your students to get the idea): https://www.youtube.com/watch?v=7D1CQ_LOizA
Some other concepts to point out to students if there is time:
Some examples of how big data is used:
Some examples of how big data was inappropriately used:
Students are to pick three topics they want to research that use big data. It is preferred that these topics relate to something learned this year in the course (e.g., the need for IPv6). Tomorrow, as the students enter class, they will sign up on a list with their chosen topic. Since the students will have three options, it is likely they will get one of their selected topics to research.
Journal: Think about you daily and weekly activities. What types of data are being stored about you? What information (a collection of facts and patterns) is extracted from the data?
Remind students to think about what they do online, in stores, while in a car, etc.
Guided Activities (10 min) – Processing Big Data
Review the steps to processing Big Data:
As a class, walk through these steps using the two files in the lesson resources folder (FailedBanklist.csv & Consumer_Complaints.csv)
Step 1.
Demonstrate how files such as these can be obtained at http://catalog.data.gov/dataset
Formulate questions such as:
Are there any banks that are on both the complaint list and the failed bank list?
Can we make some deductions about banks that may be on both lists? If so, what deductions can we make?
Step 2.
Extract data source into a format supported by underlying tools
Open one of these files in Notepad (or some simple editing program such as Notepad++) and demonstrate how the actual data itself is separated by commas, thus the file name “csv” for comma separated value.
Open both files in Microsoft Excel. Complete a find for the bank name “Banco Popular de Puerto Rico” on both lists. You may want to first sort the data by bank name to find this bank or you can use CTRL + F to find the bank name (see screenshots below).
Step 3.
Normalize data (remove redundancies, irrelevant details)
In this step, there is technically no need to remove redundancies or irrelevant details but you can show the students how you could remove data or limit the data to a particular data set. For example, if were to want to look at only the banks from Maryland, you can use the filter tool to only view those banks from MD.
Step 4.
Import data into tool
Right now the file type is as a csv file. By resaving the file as a .xlsx file it becomes a true spreadsheet file.
Step 5.
Perform analysis
We have determined that the bank “Banco Popular de Puerto Rico” is on both lists. Now ask the students “Why is this bank on both lists?” Note: On the Failed Bank list the Banco Popular de Puerto Rico is actually an acquiring institution. By looking more closely at the dates of the acquisition of the failed bank “Westernbank Puerto Rico” one can formulate some possible deductions that maybe the reason “Banco Popular de Puerto Rico” is on the complaint list is because they had recently taken over a failed bank. It could be possible that some of these complaints were related to this recent acquisition. Is there any other information that can be gathered? What opportunities for identifying trends, making connections, and addressing problems does the data provide?
Step 6.
Visualize Results
Explain to students that they will learn more about visualizing their results in Unit 6. They can complete graph visualization in excel. Show them the website: http://www.gapminder.org/. Explain that even though a visualization in excel is not interactive like http://www.gapminder.org/, they can complete some form of visualizing their data by using a spreadsheet. Note: http://www.gapminder.org/ is VERY attention-grabbing. Only briefly show the students what they can do with it (see how data changes over time, look at many different data sets, and download data in different forms - including csv and xlsx formats).
Students should research their selected topics from homework. Some possible websites for finding data are listed above under “Possible good resource(s) for data collection.”
Students are to get your approval for a topic and then use the Big Data Sets Worksheet in the Lesson Resource Folder to find big data sets that are related to the approved topic.
Students are to review using http://www.gapminder.org/ looking specifically at life expectancy. Students will write one question after “playing” the timeline of life expectancy using gapminder on an exit slip before leaving class. For example, one may write “Why is the life expectancy of countries such as Denmark, Sweden, & Norway typically higher than other countries throughout most of the timeline?”
Students are to submit a document stating their topic for research using Big Data. This document should answer the questions:
Topic:
How is Big Data used to solve or remedy the topic?
Link(s) used to find Big Data? (i.e. data.gov, etc)
How has the transformation of data storage affected how data itself is used?
Answer: Storage and processing of large digital data enables us to analyze large data sets quickly rather than small sampling sizes as used before.
How can a computer use Big Data to make predictions?
Answer: Computers can use smart algorithms, powerful processors, and clever software to make inferences and predictions for solvable questions.
Unit 5. Data Manipulation
Revision Date: Jun 11, 2020 (Version 3.0)Pre-Lesson Preparation
Summary
Students investigate data organization, simulate linear and binary searches, and write pseudocode and Python for linear and binary search methods.
Outcomes
Overview
Session 1
Session 2
Source
Phone book presentation adapted from a lesson taught by Dr. Rheingans in CMSC 201 at the University of Maryland, Baltimore County
Students will:
Student computer usage for this lesson is: required
In Lesson Resources Folder:
Journal: What would be the best way to organize a collection of DVDs so that you could find the one you want very quickly? Would you need a different method for a radio station with thousands of DVDs? Discuss.
As a class, discuss the following questions:
What if you are looking for a specific song. What is the most effective way to look for something if it is in an unordered, unsorted collection?
What are some challenges to organizing large sets of information so they can be processed?
Why does it help to classify data when you are trying to sort it or search through it?
Possible Answers and suggestions for discussion:
Take out a dictionary (or phone book). Ask the class, how would you search for a particular word/name?
Steps for Binary Search in a book of items to demonstrate to the class:
Why would this not work for an unordered list?
Note: Skeleton code (SearchCode.py) is provided in the Lesson Resources Folder.
Think-Pair-Share
Questions in the AP Classroom Question Bank may be used for summative purposes.
Sixty of the 80 questions are restricted to teacher access. The remaining 20 questions are from public resources.
Questions are identified by their initial phrases.
A code segment will be used to swap the values of
A programmer is deciding between using a linear...
A programmer wrote the code segment below to di...
An algorithm will be used to identify the maximum
In the procedure Mystery below, the parameter n...
The program segment below is intended to move a...
The question below uses a robot in a grid of sq...
There are 32 students standing in a classroom. ...
For students that have difficulty understanding the concepts of searching for items in a set of data, pair those students with a student who has a firm grasp of the concept for the activities. Have the pair work together for 1A and then have them keep their own paper secure using the extra game sheet (1A'). Simiilarly for 1B - 1B' and 1C - 1C'.
Correctness of Python functions for linear search and binary search
"Searching Assessment Items.docx" in lesson folder
"Search Comparison Worksheet" in the lesson folder
Unit 5. Data Manipulation
Revision Date: Jul 22, 2019 (Version 3.0)Summary
In this three-session lesson, students explore and confront the difficulties of the problem of sorting data and the difficulties involved in expressing a clear and efficient algorithm for sorting.
Outcomes
Overview
Session 1:
Session 2:
Session 3:
The algorithmic techniques and analysis involved in sorting data are seen in a wide variety of contexts and applications. Sorting numbers in a list is challenging but foundational to many algorithms in computer science.
What makes a "good" algorithm?
What should be taken into consideration when comparing algorithms that complete the same task?
Student computer usage for this lesson is: optional
Journal: Have students respond to the following questions:
Teacher note: Having just finished the lessons on searching, students should recall that searching an ordered list allows using the binary search which is faster than searching an unordered list with a linear or random search.
Teacher note: The focus should be directed more toward the problem-solving technique than nitpicking about the language used. Although students are writing instructions for a human to manipulate a set of playing cards, they still need to be precise, because the assumption is that the person doesn't know what they are doing. This problem is challenging and will require creativity.
Suggest that it could be helpful to break the task down into parts. Using abstraction and collaboration can decrease the size and complexity of the task that each programmer has to solve. Abstraction allows you to build upon existing processes, collaboration allows you to break up the task once you agree on what the strategy is and what the separate tasks are that need to be solved. Example, cards need to be comared and swapped, one person could write the specific steps for how that should be done.
Journal:
Homework: Any pairs that did not finish the activity should complete it as a homework assignment before the next session.
In this session, students review the sorting algorithms they wrote in session 1. Students will follow the algorithms created by their classmates and discover a variety of sorting strategies. By analyzing the various algorithms, students will attempt to find the "best" sorting strategy.
Teacher note: There are two main difficulties in algorithm design to highlight: (1) It is very difficult to be precise with language without some agreement about what terms mean. (2) Solving the problem by determining the strategies and steps required to sort objects correctly, as well as efficiently, presents a second level of difficulty.
Journal: How do you think a sorting algorithm should be "measured" to determine if it is the "best"? Ask for ideas and discuss this during the group activity.
Create groups of four by joining the pairs who previously exchanged algorithms.
Teacher note: This "swapping algorithm" activity works especially well when students exchange algorithms with a group that has a fundamentally different approach. However, as a practical matter, this can be hard to arrange. From your observations during Session 1, you might have a sense of groups with different approaches that you can assign to swap algorithms.
Teacher note: It is possible that the nominated algorithm won't work perfectly. If you encounter any problems with the directions, give them the benefit of the doubt and simulate it as best you can to enable the class to understand the intent.
Teacher note: The students will perform an actual analysis in a later lesson, so it's okay at this point to simply guide the discussion to see how students are thinking. They will re-examine these ideas later.
Remind students of the two main issues in writing effective algorithms:
Homework: Assign students to write a final version of their algorithm, working out any ambiguities or other problems revealed during the activities.
In this session, we end the set of sorting activities by relating sorting to algorithms in the real world. A further exploration of algorithm analysis with some new algorithms will sharpen their intuition about what should and shouldn't be "counted" when analyzing algorithms, what is "hard" for a computer to do, or what takes a "long time."
Journal: discuss ideas and elements from the previous lesson:
Teacher should clarify that in general, we want these things from an algorithm:
Suggestion: If you have a mix of new and advanced students, challenge the advanced students to sort twice as many cards with a parallel processing algorithm of their own design. Each student on the team can perform one action at the same time.
Evaluation of algorithms
Convert actions into an algorithm
Unit 5. Data Manipulation
Revision Date: Jun 11, 2020 (Version 3.0)Pre-lesson Preparation: You should familiarize yourself with www.sorting-algorithms.com paying particular attention to the variety of algorithms and settings along the top of the page. For session 2, you should have the timedsorts.py code and data files (in the lesson folder) readily available for your students.
Summary
In this two-session lesson, students will explore algorithmic efficiency. They will understand the idea through discussion, manual analysis of simple algorithms, and data collection for implemented algorithms.
Outcomes
Students will be able to:
Overview
Session 1:
Session 2:
Student computer usage for this lesson is: required
Sorting:
Think-Pair-Share: Alternate Routes
Briefly discuss with your class the topic: what properties make for a good algorithm? What makes one algorithm better than another? Properties you may want to discuss if your students do not volunteer them:
A good analogy is purchasing a car, where people are concerned about:
Today's session will address the topic of efficiency.
Introduce the concept of algorithmic efficiency to your students by asking them if any can describe what algorithmic efficiency is, or what it means for an algorithm to be efficient. Briefly describe efficiency as how well an algorithm uses two resources, time and space (stored memory), to solve a problem. Some topics you may wish to discuss include:
A central idea of analysis of algorithms is that many algorithms will take more and more resources (time or space) as the size of their input increases. Algorithms can be different yet produce the same result. Algorithm's that require polynomial timesare considered to run in a reasonable amount of time. An algorithm that requires exponential or factorial times are considered unreasonably slow.
When comparing algorithms, time is not measured in seconds but rather in the number of computational steps needed for the algorithm to finish operation on a given input. Great algorithms grow linearly, at the same rate as their input, meaning the time it takes to finish is directly proportional to the size of the problem they are solving (amount of input data). For instance, an algorithm that takes 10 steps for an input of size 10 and 1000 steps for an input of size 1000 is said to be linear in its input. However, most algorithms take longer as their input gets larger. For instance, an algorithm that takes only 25 steps for an input of size 5 may take 100 steps for an input of size 10, 10000 steps for an input of size 100, and one million steps for a size of only 1000 (it is taking quadratically more time as the input gets larger).
When we analyze algorithms, we often talk about the algorithm's computational complexity, which is the order of magnitude of the algorithm's running time. An algorithms efficiency and be estimated by counting the nunber of times the code is executed hoever algorithmic efficiency is calcualted through formal reasoning rather than empirical measures.
Efficiency is typically expressed as a function of the size of the input. For example, if an algorithm finishes with the same number of steps regardless of the size of its input, it is called constant time, which is O(1) in mathematical form (read aloud as "big-oh one"). Constant time algorithms are the fastest in terms of computational efficiency, and any algorithm that takes a constant number of steps is considered O(1). An algorithm that takes 10 steps for an input of size 10 and also takes 10 steps for an input of size 1000 is likely O(1). However, very, very few algorithms are constant time because most algorithms necessarily take longer as the size of their input increases.
An algorithm that can finish by looking at each piece of its input only once is called linear time or linear order, and is written mathematically as O(n), where n stands for "the size of the input." An algorithm that takes 10 steps for an input of size 10 and also takes 1000 steps for an input of size 1000 is likely O(n). Very few algorithms are linear order, especially if they must compare pieces in their input, such as sorting algorithms. The best sorting algorithms are somewhere between linear time and quadratic polynomial time, written as O(n2), where n2 stands for "the size of the input, squared." Any algorithm that is O(n2) typically must compare each piece of its input with every other piece of input at least once. An algorithm that takes 100 steps for input of size 10 and a million steps for input of size 1000 is likely O(n2).
Algorithm's that require polynomial - not exponential or factorial times - are considered to run in a reasonable amount of time.
Most sorting algorithms are of an order between O(n) and O(n2) known as linearithmic time, written as O(n log n), where log is the logarithmic function. In fact, O(n log n) is the fastest possible order for a comparison-based sorting algorithm. It is impossible for such algorithms to be O(n) since they must make at least some comparisons of their input data.
Using the simulation tools at http://www.sorting-algorithms.com/, students will investigate, compare, and contrast sorting algorithms. Notice the grid in the center of the page. Each column is a particular sorting algorithm, and each row is an ordering of horizontal bars (either random, nearly sorted, reversed order, or few unique). Each algorithm will sort the bars in a given cell from top to bottom in increasing order by length.
Ask your students to interact with the website by clicking the green start icons and observing how long it takes each algorithm to sort its bars relative to the other algorithms.
Some questions to have them discuss or record in their journal could include:
Make sure your students understand that the size and order of input data can affect how long an algorithm takes. You should direct or help your students discover that bubble sort is a slow sorting algorithm that can be fairly fast for nearly sorted data. You may wish to discuss that bubble sort is O(n2) in the worst case, explaining why it takes so long for large input, but is O(n) in the best case, which is when input is already (or nearly) sorted. In contrast, selection sort is O(n2) in both the worst and best cases, and merge sort is O(n log n) in both the worst and best cases. In general, most sorting algorithms that we would want to use are O(n log n), since O(n2) is usually too slow. You may also want to mention that bubble sort is considered one of the most inefficient sorting algorithms and that quick sort’s worst performance is on already sorted data, so some quick sort implementations shuffle the inputs before sorting to avoid that situation.
Watch one or more of the available movie clips that compare the performance of sorting algorithms:
Suggested list of videos (Many more are available):
Introduction [5 min]
Journal: Remind your students about the sorting algorithms from the previous session and have them answer the following questions:
The students will measure and analyze the effect of sorting set size on execution time for a given sorting algorithm using Python code. Using the timedsorts.py file in the lesson resources folder as a basis, the students will perform an experimental analysis to compare sorting algorithms by timing them on input data of different sizees. They will hypothesize, design and code their experiment, collect results, and write a report for homework.
Sorting functions available in the Python code include: quick sort, merge sort, selection sort, insertion sort, and bubble sort. For advanced students or classes may, you may wish to have them implement additional sorting algorithms.
The sample code includes helper functions to generate random data, to load data from a file, and to time sorting functions on the data. Example code for invoking these functions is included at the end of the file. You can remove this example code before sharing it with your students if you wish to emphasize the programming and critical thinking required to do this project.
Each student (or pair or group) needs their own copy of the Python code to modify for their experiments.
Students will compare sorting algorithms by timing them with Python code on input data of various sizes. Have your students (individually or in pairs) make a hypothesis about what will happen as the size of data input increases, answering the following questions:
Have your students write out a description of the steps they will take to perform the experiment.
Have students modify their Python sorting code to implement the experimental steps they outlined. Students must:
Discuss the results with another student or group. What patterns can be seen in the relationship between the amount of data and the time to run the program?
Questions in the AP Classroom Question Bank may be used for summative purposes.
Sixty of the 80 questions are restricted to teacher access. The remaining 20 questions are from public resources.
Questions are identified by their initial phrases.
A programmer is writing a program that is inten...
The table below shows the time a computer syste...
Assign as homework, explain the difference between algorithms that run in reasonable time and those that do not. Address how reasonable and unreasonable times are identified and what imapct that difference has on program designmers.
Students complete a short research report on their sorting algorithm research procedure, results, and analysis of the results.
The teacher may decide to have the students choose how they want to organize the empirical analysis effort. Alternatively, scaffolding with a worksheet or checklist could be used to guide the students through the data collection and analysis tasks.
The following "Checks for Understanding" could be used to guide the students towards the three learning objectives:
Objective: SWBAT identify families of correct algorithms that have different efficiencies in their problem solving approach.
Objective: SWBAT demonstrate logical reasoning and metrics is used to describe an algorithm’s efficiency.
Objective: SWBAT to perform empirical analysis of sorting algorithms by running the algorithms on different inputs.
Students will complete a short research report on their sorting algorithm research procedure, results, and analysis of the results.
Unit 5. Data Manipulation
Revision Date: Jun 11, 2020 (Version 3.0)Summary: Students are introduced to the theory of computation, computability, the halting problem, and advanced algorithms. In particular, they will learn about heuristic search used by artificial intelligence (AI) programs to play games.
Objective:
Students will be able to:
Overview:
Session 1
Session 2
Student computer usage for this lesson is: required
Links to videos and online tools as indicated in the lesson plan.
Alternative instruction could include the Towers of Hanoi problem and discuss the algorithm for solving it. Some demonstrations are available here:
Think-Pair-Share: In pairs, think about and try to answer each of the following questions:
Note: just give them a few minutes to try the factoring, but round them up to continue and discuss: which operations were much harder to perform than their inverse? Can you just invert the steps, and why or why not?
Make a connection to the previous lesson by comparing these to sorting algorithms, where some are speedy and efficient like Merge sort and Quick sort, and others are unusably slow, like Bubble sort. Highlight the difference that different problems have different lower bounds on optimal solutions, and that some problems like integer factorization have solutions but take too long to be solved in a practical way.
Discuss the definition of computation (in a theoretical sense) with your students. Computation is input plus processing to get output. A computer is one system that is a "model of computation" since it takes input, processes it, and produces output.
Another model of computation is called a Turing machine, named after Alan Turing (one of the most famous computer scientists). A Turing machine is a theoretical entity that has a tape of symbols (a line of 0s and 1s), a head that can read only one symbol at a time, and an internal state that can change based on instructions as the head reads symbols. Turing and a mathematician called Alonzo Church are responsible for the "Church-Turing" thesis, which says that a Turing machine can compute anything that a digital computer can. This is a fundamental idea of the theory of computation, and has the implication that anything one computer is capable of doing is possible to be computed by another, given enough resources (time and memory).
Parallel computing is a computational model where the program is broken into multiple smaller sequential computing operations, some of which are performed simultaneously. Distributed computing is a computational model in which multiple devices are used to run a program. Comparing efficiency of solutions can be done by comparing the time it takes them to perform the same task.
Now discuss the idea of computability with your students. Ask your students to answer or think-pair-share: are there things it is impossible for a computer to compute? A decision problem that can be solved by an algorithm that halts on all inputs in a finite number of steps is decidable. The most classic "undecidable" (non-computable) question is called the Halting Problem. The Halting Problem is: make a program that can tell if any other program will halt (terminate at some point eventually) or will loop forever and never end. This can be done for specific algorithms (instances of the Halting Problem) but not for the general problem regarding all possible algorithms.
The Halting Problem is impossible for a computer to compute, which you can prove (informally) by paradox. Suppose you did have a program that solved the Halting Problem, called HALT(X), which takes the code for some program X as input and says "yes" if X terminates or "no" if X loops forever. Then you could write a new program that uses HALT inside it, which we will call PARADOX(X). First PARADOX(X) will run HALT(X) and if the result is no, PARADOX will halt, but if the result is yes, then PARADOX will loop forever. But here is the problem: what if we use the code for PARADOX as the input to PARADOX, running PARADOX(PARADOX)? If it says that PARADOX halts, then PARADOX runs forever, and if it says PARADOX runs forever, then PARADOX halts. This problem is a paradox and does not make sense because the premise, that a program called HALT could exist, must be wrong! Therefore, the Halting Problem is impossible for a computer to solve.
Video explanation with optional student simulation
Think-Pair-Share:
This session concerns advanced algorithms, in particular heuristic search, which is commonly used in artificial intelligence. Refresh your students' minds on the definitions of computation, computability, and undecidable problems. Additionally, mention the properties we consider when we compare algorithms:
Introduce the idea of heuristic search, which is a class of algorithms used in many artificial intelligence programs. A heuristic is something that is used to find a good solution in a reasonable time, and a heuristic search algorithm is an algorithm that uses heuristics to determine how to search through some space.
A great way to introduce heuristic search is first to discuss game trees. A game tree is a structure that is used to represent the "space" of a game that an algorithm wants to search through.
Think of a game like chess: you make a move, the opponent makes a move, and the process continues until the ending conditions have been met (one player in checkmate or stalemate). A game tree is a mathematical structure used by AI and heuristic search algorithms to model the moves made in a chess game. At any turn, we can make a "tree" by drawing the root node as representing the current state of the board and drawing one branch under it for every possible move. In Tic-Tac-Toe, if you are the starting player, then the root node represents a blank board, and there will be nine branches, one for each possible move (each space where you could place your mark). Following a branch in the game tree takes you to a new node that represents the configuration of the game that results from having taken that move. In Tic-Tac-Toe, if I am the first player and place my X in the center space, I have "followed" that branch down the tree to a new node that represents the board with an X in the center space. The opponent then uses this node as the root of their game tree, and has a branch for each of their possible moves.
Think-Pair-Share: Have your students pair off and play a game of Tic-Tac-Toe and try to draw the game tree as they play it, drawing the nodes for each move they made and every potential branch from those nodes. Bring them back into discussion and ask them what if they had to draw out every node followed down every branch? Now ask them to imagine the game tree for chess, which has 20 possible moves on the first turn, 400 on the second, and many, many more as the game goes on. How can an artificially intelligent program learn to play chess when there are so many (too many) options? Chess actually has around 35100 nodes in its tree and 1040 legal states.
Heuristic search on game trees is one way AI programs are able to play games like chess. How good are computer game players?
Typically games modelled with game trees are 2-person games, players alternate moves, and they are zero-sum (meaning one player's loss is the other's gain). More complicated elements in such games may have include: hidden information (like other players' hands), chance (dice), or multiple players.
How does an AI program use heuristic search to play a game? Typically in these steps:
The key problems are:
For evaluation, some function is typically coded or learned over time.
Refer to the "Advanced Algorithms" slides in the lesson resources folder for examples of uninformed search. For an activity, you may want to create a game tree for Tic-Tac-Toe and have your students walk through how each of the following algorithms would operate over it.
Uninformed Search are algorithms that work without a heuristic, using no information about the likely "direction" of the goal node. Algorithms include:
For any games with variety and complexity, certainly for chess and even checkers, uninformed search is simply too slow because it is exhaustive. This problem is another example, like with sorting, where the efficiency of our solution matters a great deal. To get programs to play games, we need them to be efficient and intelligent about the number and quality of moves they consider.
Informed Search algorithms each follow some heuristic that uses information about the game to determine smart directions to explore. Examples include:
Advanced classes may wish to discuss local search algorithms, such as hill-climbing and genetic algorithms (in the "Advanced Algorithms" slides in the lesson folder).
Thinking about game trees again, we want to select the branch that takes us to a node with the maximum evaluated state. But there is a catch: the opponent gets to make moves, too. That is, every other branch in our game tree is the opponent's turn. How does the AI program account for the other player?
Perhaps most logically, the way AI programs do so is to assume the other player will play optimally. Just as the AI will take the branch that leads to the state with the greatest evaluation, it assumes the other player takes the branch leading to the state that will maximize their position. In other words, the AI searches through their game tree by following the branch with the maximum value on their turn, and following the branch with the minimum value on the opponent's turn. This algorithm is called minimax and is the basis of nearly all AI that play 2-person zero-sum games.
Questions in the AP Classroom Question Bank may be used for summative purposes.
Sixty of the 80 questions are restricted to teacher access. The remaining 20 questions are from public resources.
Questions are identified by their initial phrases.
A certain computer game is played between a human
Under which of the following conditions is it m...
Which of the following programs is most likely to
Which of the following statements is true?
For the Halting Problem proof, it is important that students can translate the solution that is on the video into a representation that makes (some) sense to them. Acting out the inputs and outputs of the set of machines is an approach worth trying.
The following "Checks for Understanding" could be used to guide the students towards the three learning objectives.
Objective: Students will identify some Advanced Algorithms that Exploit Inverse Operations Efficiency.
Objective: Students will identify some Advanced Algorithmic Techniques.
Objective: SWBAT discuss at least one example of a computing problems that is unsolvable
Students will be able to summarize -- in their own words or with simple models -- the proof of the Halting Problem.
Students will be able to identify the sensitivity of cryptography to the difficulty of factoring large numbers.
Unit 5. Data Manipulation
Revision Date: Jan 12, 2020 (Version 3.0)Summary
Students collaboratively define, design and implement a programming project: a miniature version of the Create Performance Task. They create and share presentations and share with groups (of about six students).
Students present. the project goals, the development process they used, functional and dta abstractions they considered and used.
Outcomes
Overview
Session 1:
Sessions 2 and 3:
Session 4:
Students working in pairs practice choosing a project and planning how to implement it in a fixed time frame.
Students have just two days to plan and implement a project that uses a user created function and makes use of a list or other data structure. Since this is a practice task, teachers may provide help with algorithms, functions, data abstraction or other programming concepts. Remind students that during the actual task teachers may not provide this kind of support so as much as possible students should rely on the collaborative partners for assistance when needed.
Students may receive most or all of the credit from an incomplete project if the project demonstrates the required components.
For this practice task, teachers may want to provide program stubs. Stubs could include a suggested function and data abstraction. Functional abstraction should include a parameter that is used to make a selection in the procedure.
Briefly present an overview of the Create Performance Task. Explain that students will have 12 hours to complete the Create Task later in the course and they will four sessions for the practice version we are beginning today. Explain that the Full Create Performance Task Guidelines are provided as a resource and a guide for students however students are to complete the guidelines for the practice task. Requirementst from the full task that we are not doing are struck though in the lis below. For instance, in the practice Create Task we will not be producing a video or writing responses about the video.
General Requirements
You will be provided with a minimum of 12 hours of class time to complete and submit the following:
Submission Requirements
Your program must demonstrate:
Include comments or acknowledgments for any part of the submitted program code that has been written by someone other than you and/or your collaborative partner(s) including the origin of the code. The acknowledgement should include the origin or original author’s name.
Create a PDF file that contains all of your program code (including comments).
Your video must demonstrate your program running, including:
Your video:
Collaboration is not allowed during the development of your video. Your video must not contain any distinguishing information about yourself. Your video must not be narrated, but text captions are encouraged.
Submit one PDF file that includes your responses to each prompt below. Clearly label your responses 3a–3d in order. Your response to all prompts combined must not exceed 750 words, exclusive of the program code. Collaboration is not allowed when answering the written responses.
3a. Provide a written response that:
3b. Capture and paste two program code segments you developed during the administration of this task which contain a list (or other collection type) being used in your program. The first program code segment must show how data has been stored in the list1. The second program code segment must show the data in the same list being processed, such as creating new data from the existing data. Then, provide a written response that:
3c. Capture and paste a procedure from your program that you developed during the administration of this task which implements an algorithm used in your program. This procedure must:
Then, provide a written response that:
3d. Provide a written response that:
For this practice task, students will complete simpler project and a two to three minute presentation about it, rather than a video and a report. Students work in partners to select and develop projects.
After completing the project, students will create a presentation about it. The presentation should address:.
Projects are chosen by the students. If they wish, projects may be based on the following labs from How to Think Like a Computer Scientist. Before selecting a project students should consider how they might use a data structure (list) in their project.
How to Think Like a Computer Scientist Labs
Students choose and plan a project with partners and complete the project development table.
Project Development Plan |
Team Members: |
Project Name |
|
Project Goals |
|
Key algorithm Describe a selection the algorithm makes and describe how the algorithm uses iteration. |
|
Functional abstraction Describe at least two different tasks the function may perform. |
|
Data abstraction Describe the values that will be stored and how they will be used by the program. |
Students submit the Project Development Plan.
Students reflect on their project and making journal entry of goals for tomorrow.
Students complete a brief journal entry planning their team and individual work for today.
Students work to implement and test projects. Teachers may evaluate student performance based on student journal entries and their observations of student effort in implementing the project. In the Create Performance task, students should direct their questions with their partners and or consult their class notes and examples for help. Teachers may only help resolve technical issues (hardware or software) or provide clarification of the project requirements. In this practice task, students may receive limited help in developing the project.
Closing (2 min)
Students submit the key algorithm code developed so far.
Students make a journal entry about the progress they made and their goals for tomorrow.
Each student prepares a presentation. Partners may collaborate in the report presentation but each partner must prepare their own report.
Getting Started:
Explain the presentation requirements.
Format: A two to three minute presentation with partners taking turns in the presentation,
Required Contents:
Partner Names
Purpose
Data Abstraction
Function in the project
Code showing creation and use.
Functional Abstraction
Function in the project
Code showing definition and calls
Parameter it uses and alternative paths through the procedure
Key Algorithm
Purpose of the algorithm
Code and how the algorithm works (pseudocode can be used)
Use of selection and iteration
Students present their project to table groups. Time the presentations so that they do not exceed 3 minutes. Students discuss with table groups what they like about the project, what they learned and any questions or comments they have.
Closing (2 min)
Students submit their presentations and program code. Students create exit slips with any questions or comments they have about the Create Task.
For the practice task, if student projects are too big or too small in scope, teachers should provide guidance.
For the practice task, project descriptions and pseudocode/algorithm description for each proposed project should be assessed. Assessment can be done by collaborative partners first.
If partners have concerns, they should be brought to the teacher.
The project should be assessed using the latest rubric provided by the College Board.
The latest rubric (updated as of November 2018) is in the lesson folder. The practice project does not ask students to create a video so row 1 should net be used. The assessment can be the basis of a summative grade using the suggested scale below.
Complete presentation submitted 25%
Code with abstractions submitted 25%
Rubric Points (5) 5 * number of rubric points up to 50%
Unit 5. Data Manipulation
Revision Date: Jun 24, 2020 (Version 3.0)Summary
Students will define, design and implement a programming project: a miniature version of the Create Performance Task. They will create presentations and share with groups the projects they developed and how their project used abstractions.
Outcomes
Overview
Session 1:
Presentation (5 min) Students are introduced to the Create Task.
Activity (15 min) Students select a small programming project to model the Create Task.
Activity (10 min) Students identify an algorithm to use in their program and share within tabe groups.
Activity (10 min) Students identify an an abstraction to use in their program.
Wrap Up (5 min) Students share the algorithms and abstractions they will use with the class.
Session 2:
1. Journal (2 min) - What is your plan for today for the development of your project? What abstractions do you plan on using in your project?
2. Activity (43 min) - Students complete implementing their projects.
3. Journal (5 min) - Reflection. What abstractions did you use in your project?
Session 3:
1. Getting Started (5 min) - Students individually respond to three prompts about their projects.
2. Activity (20 min) - Students prepare one minute presentations of their projects.
3. Presentations (15 min) - Students present their project to table groups.
4. Wrap Up (5 min) - Students create exit slips with any questions about the Create Task.
Students working in pairs practice choosing a project and planning how to implement it in a fixed time frame.
Students have just two days to plan and implement a project that uses a user created function and makes use of a list or other data strucutre. Since this is a practice task, teachers may provide help with algorithms, functions, data abstraction or other programming concepts. Remind students that during the actual task teachers may not provide this kind of support so as much as possible studtns should rly on the collaborative partners for assistance when needed.Since an algorithm is a list of steps that comes to a conclusion, if students develop pseudocode for their projects they can refer to the pseudocode as their algorithm.
Students may receive most of the credit from an incomplete project if the project demonstrates the required components.
For this practice task, teachers may want to provide program stubs. Stubs could include suggested functions.
Present an overview of the Create Task.
Explain that students will have 12 hours to complete the Create Task later in the course and they will three 50-minute sessions for this practice. The actual Create Task will have a suggested collaborative component and be larger in scope.
Discuss the following guidelines for the full project and the practice project we will be doing.
Three components to create:
General:
One project - individual with collaboration in stages
12 hours of classroom time
Project must use functional and data abstraction.
Report: Written responses (response to all prompts combined must not exceed 750 words, exclusive of the Program Code.):
a. Provide a written response or audio narration in your video that: Identifies the programming language and identifies the purpose of your program.
b. Describe the incremental and iterative development process of your program
c. Describe how a selected algorithm functions.
d. Explain how an abstraction you developed helped manage complexity
For this practice task, students will complete simpler project and a one-minute presentation about it, rather than a video and a report.
Students work individually to select projects.
After completing the project, students will create a one-minute presentation about it. Presentation should address the following.
a. Identify the programming language and the purpose of your program.
b. Describe the incremental and iterative development process of your program
c. Describe how a selected algorithm functions.
d. Explain how an abstraction you developed helped manage complexity
The presentation must address at least points a and b of the above and c or d.
Projects are chosen by the student. If they wish, their projects may be based on the following labs from How to Think Like a Computer Scientist.
Labs
Students select a project and share their ideas with partners.
After collaborating with partners, students submit to their teacher a brief description of the project describing its most important features and how it will work.
Students identify an algorithm and an abstraction to use in their projects. The algorithm should be written as pseudocode and then shared with their partners.
Students should discuss in groups what abstraction they chose and how they think it would be helpful.
Students complete a brief journal entry describing:
Students work to implement and test projects. Teachers may evaluate student performance based on student journal entries and their observations of their effort in implementing the project.
Students reflect on their project and making journal entry of how they used abstraction in the project.
Students begin by individually responding to these prompts about their project:
b. describe the purpose, how your program code works and the most important features and algorithms
d. describe the development process
e. explain an abstraction and how it helped manage complexity
Students prepare one-minute presentation about their projects including their responses to prompts b, d and e.
Students present their project to table groups. Time the presentations so that they do not exceed 1 minute. Students share with table groups what they like about the project, what they learned and any questions they have.
Students create exit slips with any questions they have about the Create Task after viewing and discussing the presentations.
For the practice task, project descriptions and pseudocode for each proposed project should be assessed. Assessment can be done by collaborative partners first. If partners have concerns, they should be brought to the teacher. If student projects are too big or too small in scope, teachers should provide feedback.
The project should be scored using the latest rubric provided by the College Board.
The latest rubric (updated as of June 2016) is in the lesson folder.