Unit 5. Data ManipulationRevision Date: Jun 24, 2020 (Version 3.0)
Big Data has been defined in many different ways. Easy access to large sets of data and the ability to analyze large data sets changes how people make decisions. Students will explore how Big Data can be used to solve real-world problems in their community. After watching a video that explains how Big Data is different from how we have analyzed and used data in the past, students will explore Big Data techniques in online simulations. Students will identify appropriate data source(s) and formulate solvable questions.
Session 1- What is Big Data?
Session 2 – Where can big data be used?
Student computer usage for this lesson is: required
Reading assignment and video clips for Session 1:
Possibly useful resource(s) for data collection:
Big Data Concepts:
Sample data sets (both acquired from http://catalog.data.gov/dataset) :
Journal: How can a computer gather data from people ? (Think-Pair-Share)
Discuss: How can the computer learn from people when playing one of these games? How many different answers do you think it could possibly know?
Teacher note: students are not expected to actually play this game during class.
Read The Rise of Big Data in chunks: An Introduction to “Big Data” (20 mins) Reading can be found at: http://www.foreignaffairs.com/articles/139104/kenneth-neil-cukier-and-viktor-mayer-schoenberger/the-rise-of-big-data
Break students into groups or pairs and jigsaw the seven units of the reading. Each group is to summarize their section in a tweet sized comment (not more than 140 characters).
Share tweets with the class.
Explain to students that big data is impacting every area of life. By using more data and processing power we can make better decisions. As an illustration, show a clip from the movie Moneyball: (3 mins) https://www.youtube.com/watch?v=rMObWsKaIls
After students watch, they create a journal entries explaining at least two ways data was used to better manage the baseball team. Partners discuss journal entries. Share at least one observation with table groups and then share at least one observation from each group with the class.
Show the first 3-5 minutes of this clip. (It becomes a bit dry, so just show the amount that is appropriate for your students to get the idea): https://www.youtube.com/watch?v=7D1CQ_LOizA
Some other concepts to point out to students if there is time:
Some examples of how big data is used:
Some examples of how big data was inappropriately used:
Students are to pick three topics they want to research that use big data. It is preferred that these topics relate to something learned this year in the course (e.g., the need for IPv6). Tomorrow, as the students enter class, they will sign up on a list with their chosen topic. Since the students will have three options, it is likely they will get one of their selected topics to research.
Journal: Think about you daily and weekly activities. What types of data are being stored about you? What information (a collection of facts and patterns) is extracted from the data?
Remind students to think about what they do online, in stores, while in a car, etc.
Guided Activities (10 min) – Processing Big Data
Review the steps to processing Big Data:
Demonstrate how files such as these can be obtained at http://catalog.data.gov/dataset
Formulate questions such as:
Are there any banks that are on both the complaint list and the failed bank list?
Can we make some deductions about banks that may be on both lists? If so, what deductions can we make?
Extract data source into a format supported by underlying tools
Open one of these files in Notepad (or some simple editing program such as Notepad++) and demonstrate how the actual data itself is separated by commas, thus the file name “csv” for comma separated value.
Open both files in Microsoft Excel. Complete a find for the bank name “Banco Popular de Puerto Rico” on both lists. You may want to first sort the data by bank name to find this bank or you can use CTRL + F to find the bank name (see screenshots below).
Normalize data (remove redundancies, irrelevant details)
In this step, there is technically no need to remove redundancies or irrelevant details but you can show the students how you could remove data or limit the data to a particular data set. For example, if were to want to look at only the banks from Maryland, you can use the filter tool to only view those banks from MD.
Import data into tool
Right now the file type is as a csv file. By resaving the file as a .xlsx file it becomes a true spreadsheet file.
We have determined that the bank “Banco Popular de Puerto Rico” is on both lists. Now ask the students “Why is this bank on both lists?” Note: On the Failed Bank list the Banco Popular de Puerto Rico is actually an acquiring institution. By looking more closely at the dates of the acquisition of the failed bank “Westernbank Puerto Rico” one can formulate some possible deductions that maybe the reason “Banco Popular de Puerto Rico” is on the complaint list is because they had recently taken over a failed bank. It could be possible that some of these complaints were related to this recent acquisition. Is there any other information that can be gathered? What opportunities for identifying trends, making connections, and addressing problems does the data provide?
Explain to students that they will learn more about visualizing their results in Unit 6. They can complete graph visualization in excel. Show them the website: http://www.gapminder.org/. Explain that even though a visualization in excel is not interactive like http://www.gapminder.org/, they can complete some form of visualizing their data by using a spreadsheet. Note: http://www.gapminder.org/ is VERY attention-grabbing. Only briefly show the students what they can do with it (see how data changes over time, look at many different data sets, and download data in different forms - including csv and xlsx formats).
Students should research their selected topics from homework. Some possible websites for finding data are listed above under “Possible good resource(s) for data collection.”
Students are to review using http://www.gapminder.org/ looking specifically at life expectancy. Students will write one question after “playing” the timeline of life expectancy using gapminder on an exit slip before leaving class. For example, one may write “Why is the life expectancy of countries such as Denmark, Sweden, & Norway typically higher than other countries throughout most of the timeline?”
Students are to submit a document stating their topic for research using Big Data. This document should answer the questions:
How is Big Data used to solve or remedy the topic?
Link(s) used to find Big Data? (i.e. data.gov, etc)
How has the transformation of data storage affected how data itself is used?
Answer: Storage and processing of large digital data enables us to analyze large data sets quickly rather than small sampling sizes as used before.
How can a computer use Big Data to make predictions?
Answer: Computers can use smart algorithms, powerful processors, and clever software to make inferences and predictions for solvable questions.