Techniques and Concepts of Big Data
Note: The following are the notes from a class called Techniques and Concepts of Big Data on April 12, 2016.
Big data is an ambiguous and relative term. It may be best to define it by what it is not. It's not regular data. It's not business as usual. It's not something that an experienced data analyst may be ready to deal with. To put it another way, big data is data that doesn't fit well into a familiar analytic paradigm. It won't fit into the rows and columns of an Excel spreadsheet. It can't be analyzed with conventional multiple regression, and it probably won't fit on your normal computer's hard drive.
Three Vs
One way of describing big data is by looking at the three Vs; volume, velocity, and variety. These come from an article written by Doug Laney in 2001 (see 3D Data Management: Controlling Data Volume, Velocity, and Variety), and they're taken as the most common characteristics of big data, but they're certainly not the only ones. We'll talk about some other possible Vs to consider in big data later in this course.
Volume
In its simplest possible definition, big data is data that's just too big to work on your computer. Obviously this is a relative definition. What's big for one system at one time is common place for another system at another time. That's the general point of Moore's Law, a well known observation in computer science that physical capacity and performance of computers double about every two years. So for example, my Mac Classic two, which got me through graduate school, had two megabytes of ram and an 80 megabyte hard drive and so as far as it was concerned, big data is something that would fit onto a one dollar flash drive right now.
On the other hand, in Excel the maximum number of rows that you could have in a single spreadsheet has changed over time. Previously it was 65,000. Now it's over a million, which seems like a lot, but if you're logging internet activity where something can occur hundreds or thousands of times per second, you'll reach your million rows very, very quickly. On the other hand, if you're looking at photos or video and you need to have all of the information in memory at once, you have an entirely different issue.
Even my iPhone takes photos at two or three megabytes per photo and video at about 18 megabytes per minute, or one gigabyte per hour. That's on my iPhone. And if you have a Red Epic video camera you could do up to 18 gigabytes per minute. And instantly you have very big data. Now, some people call this lots of data, meaning it's the same idea of the data that we're generally used to, there's just a lot more of it.
Velocity
So for velocity, this is when data is coming in very fast. In conventional scientific research, it could take months to gather data from 100 cases, weeks to analyze the data, and years to get that research published. Not only is this kind of data time consuming to gather, it's generally static once it's entered, that is, it doesn't change. As an example, perhaps the most familiar data set for teaching the statistical procedure, cluster analysis, is the Iris data collected by Edgar Anderson and analyzed by Ronald Fisher, both of whom published their papers in 1936.
This data set contains four measurements: The widths and lengths of the petals and sepals for three species of Iris. It's got about 150 cases, and this data set is used everyday. It's one of the built in data sets in the statistical programming language, R, and it hasn't changed in nearly 80 years. At the other end of the scale, if you're interested in using data from a social media platform, like Twitter, you may have to deal with the so-called "fire hose". In fact, right now they're processing about 6,000 tweets globally per second.
That works out to 500,000,000 tweets per day and about 200,000,000,000 tweets per year. In fact, a neat way to see this is with a live counter on the web. At Internet Live Stats, it's showing us that there are about 341,000,000 tweets that have been sent so far today, and they're updating extremely quickly. Even a simple temperature sensor hooked up to an Arduino microprocessor through a serial connection, and is sending just one bit of data a time, that can eventually overwhelm a computer if left running long enough.
Now, this kind of constant influx of data, better known as streaming data, presents special challenges for analysis, because the data set, itself, is a moving target. If you're accustomed to working with static data sets, in a program like SPSS or R, the demands and complexities of streaming data can be very daunting, to say the least.
Variety
And now we get to the third aspect of big data, variety. What we mean here is that it's not just the rows and columns of a nicely formatted data set in a spread sheet, for instance. Instead you can have many data sheets in many different formats. You can have unstructured text, like books and blog posts and comments on news articles and tweets. One researcher has estimated that 80 percent of enterprise data may be unstructured, so it's the majority as the common case.
This can also include photos and videos and audio. Similarly, data sets that include things like networked graph data, that's social connections data. Or if you're dealing with data sets in what is called noSQL databases, so you may have graphs of social connections. you may have hierarchical structures and documents. Any number of data formats that don't fit well into the rows and columns of a conventional relational database or a spreadsheet, then you can have some very serious analytical challenges.
In fact, a recent study by Forester Research shows that variety is the biggest factor that's leading companies to big data solutions. In fact, variety was mentioned over four times as often as data volume.
Final Questions
Now, the final question here is, "Do you have to have all three V's-- volume, velocity, and variety-- at once, or just one, to have Big Data? It may be true that if you have all three V's at once, then you have Big Data, but any one of them can be too much for your standard approach to data. And really, what Big Data means is that you can't use your standard approach with it. As a result, Big Data can present a number of special challenges. We'll be discussing those later, but first, let's take a look at how Big Data is used and some of the amazing things that are already being accomplished by using Big Data for research, for business, and even for the casual consumer.
How is Big Data Used
Big Data for Consumers
Most of the time when you hear people talk about big data, they're talking about it within the commercial setting about how businesses can use big data in advertising or marketing strategies. But one really important place that big data is also used is for consumers, and what's funny about this is that while the data is there and the algorithms are there and as incredibly sophisticated processing it's nearly invisible. The results are so clean they give you just a little piece of information, but exactly what you need.
What I want to do is show you some common applications of big data for consumers that you may be using already without being aware of the sophistication of the big data analysis that's going into it. The first one is if you have an Apple iPhone or iPad is what Siri can do. So for instance, aside from saying what's the weather like, and Siri actually knows what it is you mean, and where you are, and what time you're talking about, it can do things like look for restaurants of a particular kind of food and see if they have reservations available.
It can do an enormous amount of things that requires the recommendation of other people, awareness of your locations, awareness of the changes over time of what is most preferable for people. Another one is Yelp. A lot of people use this to find a restaurant, and again, it draws on millions and millions of reviews from users and from other sources to make a very small recommendation. Here I'm searching for Thai food in Carpinteria, California, which is where Lynda.com is located.
I've got Siam Elephant and Your Place Restaurant as my first two hits. The next one, you might be familiar with recommendation engines. This is an idea of software that is able to make a specific suggestion to you. Yelp is an example, but people are more familiar with things like movies, and books, and music. Here's my Spotify account, and Spotify knows what I listen to when I'm on Spotify, and what I listen to all the way through, what I add to my list, what I skip through, and it's able to make specific suggestions to help me find new artists that I wouldn't otherwise know about.
I love some of the stuff that Spotify comes up with. Similarly, Amazon.com makes recommendations for books. For instance, here's a book, Principles of Big Data, one of my favorites, by Jules Berman, and if you scroll down you'll see that they have a list of other books that are recommended. This is generated by Amazon's recommendation engine, and you see several other books on big data, and in fact, it's a great list, I own about half of them. It's the same general principle here.
A lot of people use Netflix to get movies. Netflix makes specific suggestions for other movies you might like. What's interesting is that a few years ago they had a major contest called the Netflix Prize where they wanted to see if anybody could improve the accuracy of their predictions, meaning would you actually like this. If they could improve those predictions by 10% it was a million dollar prize and it was incredibly sophisticated analysis that went into this, but the end result again is a very simple thing, you get recommended a hand full of movies, and usually you pick one and you like it.
In another context there's the app called Neighborland, which is designed to help you collaborate with people to make your city work a little bit better. That's a simple goal, but what Neighborland uses is photos, and data, and APIs from Twitter, and Google Maps, and Instagram, and agencies that report on real estate parcels, it uses transit systems, it uses three one one complaints, an enormous set of data that really highlights the variety of big data. The other ones, for instance, with Spotify and Yelp, show the volume, but this one shows the variety of integrating data from so many different places and so many different formats to help people collaborate on something simple about working together to improve their neighborhood.
Finally, the last one I want to show you is Google Now, and what Google Now does is it actually makes recommendations before you ask for them, especially when it's linked up to your calendar and it's linked up to the location sensing on your phone. It knows where you are, it knows where you need to be, and it can tell you about things like traffic or the weather before you even ask for it, and this is based on, again, an enormous amount of information about the kinds of information people search for, and it provides it in a sort of preemptive manner.
So for consumers, big data plays an enormous role in providing valuable services, but again, with the irony that it operates invisibly by taking a huge amount of information from several different sources and distilling it into just two or three things that give you what you need.
Big Data for Business
We saw in the last movie that big data can provide important conveniences and functinality for consumers, but for the business world, big data is revolutionizing the way people do commerce. In this movie, I want to look at a few particular places where big data has proven to be particularly useful, or unusual and interesting. The first thing we're going to do is look at the place where most people have encountered big data in commerce, and that's in the results for Google ad searches.
Whenever you search for something on Google or any other search engine, you type in your term. You're going to get the results that you want, but you're also going to get ads. You see here, for instance, on the top I'm searching for big data. I have three ads on the top, and I have a series of ads down the right side. Those ads are not placed at random. They're placed there because they are based first on the thing that I am currently searching for, but also based on what Google knows about me. You can see in the top right that I am logged in to my own account, so Google is drawing on all of the things that I've searched for, and the other information it has about me, to try to place ads that it thinks I would be most likely to respond to.
That's something it gets by having a very large amount of data available, to tailor things to be most applicable to the consumer. Another interesting place is what's called predictive marketing. This is when big data is used to help decide who the audience would be for something before they actually get there. This is trying to predict, for example, major life events, like for instance graduating, or getting married, or getting a new job, or having a child, or any number of events that are often associated with a whole series of commercial transactions.
To do this, these companies can look at consumer behavior. They can look at how often you log onto their website. They can look at what credit cards you use. They can look at how often you look at particular items before moving on to something else. They can look at whether you've applied for an account at their organization. They can make a huge amount of information that they already have available to them. Similarly, they can use demographic information. This can include things like your age, your marital status, the number of children you have, your home address, how far you live from their store, your estimated salary, if you've moved recently, what credit cards you have, what websites you visit.
All of this information is potentially available in one form or another, to the company that's trying to make these predictions. Similarly, they can rely on additional purchased data. It's possible for the company to get information about your ethnicity, your job history, your magazine subscriptions, whether you've declared bankruptcy or been divorced, whether you've attended college, the kind of things you talk about online, and so on. There's an enormous amount of information here that again is potentially available. Now, this is going to lead into an important discussion.
In fact, we have an entire series of movies on ethics and big data, so that's going to come up on this one. But using this kind of information, it is possible for a company to predict that you're about to buy a new house, and that when you buy a new house you make an enormous number of purchases, and so they can link in to you before you start making the commitments to those. Another place we want to talk about big data is trying to predict trends. One of the really fascinating places for this is in fashion.
The company Editd has actually received awards for their use of big data in predicting fashion trends. So they can actually tell retailers what the most popular colors and styles and brands are going to be, when they're going to be popular, and they can help them price them. Obviously, this kind of information is enormously important to the companies that are going to be selling these products, and Editd is able to do this through their reliance on big data.
A final thing I want to mention about the use of big data in commerce is for fraud detection. Now it turns out that fraud is an enormous industry, that online retailers lose about $3.5 billion dollars each year to online fraud, and insurance fraud, not counting health insurance, is estimated to be more than $40 billion dollars per year. So fraud is a big issue. It turns out that there's a number of things that companies can do to lessen the prevalence of fraud, especially through online transactions.
They can look at the point of sale. That specifically means, how are you making the purchase? Are you online, and what website are you using? They can use geolocation. Where are you physically located in the world? They can look at the IP address. What computer are you using to access the website. They can look at the log in time. So are you somehow making a purchase at 4:00 a.m., when you've never before done anything after 11:00 p.m. Interestingly, they can also look at things like biometrics. I was talking with a colleague who works in computer security, and says that, for instance, the way that people move their mouse, or the time they take between pressing keys on the computer, are distinctive measurements of people.
When you hold your cell phone and you're looking at it, people of different heights hold the cell phones at different angles, as measured by the accelerometer in the phone. All this can be used to determine whether the person who is making the purchase is who they say they are. I've been saved by this. I remember a few years ago, American Express called me, and asked me if I had just booked $4,000 of hotel rooms in the Middle East. No, I had not. It turned out that there were a series of other small purchases that were not in the Middle East that showed my account had been compromised.
Fortunately, American Express was able to stop these charges beforehand, and they were able to help solve the problems and get things taken care of. But a lot of it was because of these particular details and the patterns that they have in their extraordinarily large data set, let them recognize these anomalies as potential fraud, which in this case, they actually were.
Big Data for Research
In the previous movies we looked at the role that big data can play in individual people's lives as well as in businesses. We also want to look very quickly at how big data has been revolutionizing aspects of scholarship and research. I want to show you a few interesting examples of where big data has influenced scientific progress. The first one we want to look at is Google flu trends where they were able to find that search patterns for flu related words were actually able to identify outbreaks of the flu in the United States much faster than the research that the Center for Disease Control could do.
Similarly, a more recent project found that Wikipedia searches could identify them with even greater accuracy. The National Institutes of Health created the Brain Initiative as a way of taking enormous numbers of brain scans to create a full map of brain functioning. Additionally, NASA'`s Kepler space telescope has been on a mission to find exoplanets, or planets outside of our solar system. As you can see over here so far it's identified nearly 1000 confirmed planets with over 4000 candidates.
Closer to home, psychological research has also been influenced through big data. Just last year, a paper published about personalities in the United States was able to identify clusters of personalities, in regions where we have the mid-western friendly and conventional region, the western relaxed and creative region, and a northeastern and somewhat southern temperamental and uninhibited region. Now, I'd like to say that this is not based on simply how do you feel about the places, but this is published in a journal from the American Psychological Association, a very high quality psychological research.
Similiarly, another group of reseachers created an application on Facebook that used a scientifically valid measure of personality. They got data several hundred thousand respondents and by combining those with the patterns of likes that each of those people had on Facebook, they found they were able to create a single question app, that really just asked for access to your likes that's able to give a surprisingly accurate evaluation of what your personality would be if you took the entire questionnaire.
Finally, the Google Books project. For the last few years, Google has been scanning books that were published over the last few hundred years. They currently have over 30 million books that they have scanned, they make them digitally accessible, and that allows people in Digital Humanitites to look at the changes in word usage over time. There's some interesting things, so for instance here, we have the last 208 years of the prevalence of the word, 'math' and 'arithmetic' and 'algebra' where 'arithmetic' shows a strong spike in the 20s and 30s, but it decreases over time, whereas the word 'math' has increased over the last 50 to 60 years with a peak right around 2000.
Now this is just one possible example of what can be done, but the idea here is that big data with both the volume of information that's available, the variety of information that's able to be combined, and with the velocity, especially with things like the flu trends where things are changing constantly, all of these are able to make good use of big data for scientific research and advancement. It's an exciting time to see what's happening, And to see what will happen in the near future.
Big Data and Data Science
Big data can be characterized by more than the three Vs. Those were volume, velocity, and variety. There are several practical differences as well. Jules Berman has a book called Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information. He lists 10 ways that big data's different from small data, and I want to go through some of those points here. The first is goals. Small data is usually gathered for a specific goal. Big data on the other hand may have a goal in mind when it's first started, but things can evolve or take unexpected directions.
The second is location. Small data is usually in one place, and often in a single computer file. Big data on the other hand can be in multiple files in multiple servers on computers in different geographic locations. Third, the data structure and content. Small data is usually highly structured like an Excel spreadsheet, and it's got rows and columns of data. Big data on the other hand can be unstructured, it can have many formats in files involved across disciplines, and may link to other resources.
Fourth, data preparation. Small data is usually prepared by the end user for their own purposes, but with big data the data is often prepared by one group of people, analyzed by a second group of people, and then used by a third group of people, and they may have different purposes, and they may have different disciplines. Fifth, longevity. Small data is usually kept for a specific amount of time after the project is over because there's a clear ending point. In the academic world it's maybe five or seven years and then you can throw it away, but with big data each data project, because it often comes at a great cost, gets continued into others, and so you have data in perpetuity, and things are going to stay there for a very long time.
They may be added on to in terms of new data at the front, or contextual data of things that occurred beforehand, or additional variables, or linking up with different files. So it has a much longer and really uncertain lifespan compared to a small data set. The sixth is measurements. Small data is typically measured with a single protocol using set units and it's usually done at the same time. With big data on the other hand, because you can have people in very different places, in very different times, different organizations, and countries, you may be measuring things using different protocols, and you may have to do a fair amount of conversion to get things consistent.
Number seven is reproducibility. Small data sets can usually be reproduced in their entirety if something goes wrong in the process. Big data sets on the other hand, because they come in so many forms and from different directions, it may not be possible to start over again if something's gone wrong. Usually the best you can hope to do is to at least identify which parts of the data project are problematic and keep those in mind as you work around them. Number eight is stakes.
On small data, if things go wrong the costs are limited, it's not an enormous problem, but with big data, projects can cost hundreds of millions of dollars, and losing the data or corrupting the data can doom the project, possibly even the researcher's career or the organization's existence. The ninth is what's called introspection, and what this means is that the data describes itself in an important way. With small data, the ideal for instance is what's called a triple that's used in several programming languages where you say, first off, the object that is being measured.
Here, I say Salt Lake City, Utah, USA, that's where I'm from. Second, you say what is being measured, a descriptor for the data value. In this case, average elevation in feet. Then third, you give the data value itself, 4,226 feet above sea level. In a small data set, things tend to be well-organized, individual data points can be identified, and it's usually clear what things mean. In a big data set however, because things can be so complex with many files and many formats, you may end up with information that is unidentifiable, unlocatable, or meaningless.
Obviously, that compromises the utility of big data in those situations. The final characteristic is analysis. With small data it's usually possible to analyze all of the data at once in a single procedure from a single computer file. With big data however, because things are so enormous and they're spread across lots of different files and servers, you may have to go through extraction, reviewing, reduction, normalization, transformation, and other steps and deal with one part of the data at a time to make it more manageable, and then eventually aggregate your results.
So it becomes clear from this that there's more than just volume, and velocity, and variety. There are a number of practical issues that can make things more complex with big data than with small data. On the other hand, as we go through this course we're going to talk about some of the general ways of dealing with these issues to get the added benefit of big data and avoiding some of the headaches.
Data Science
When people talk about big data, they nearly always talk about data science, and data scientists, as well. Now, just as the definition of big data is still debated, the same is true for the definition of data science. To some people, the term's just a fancier way of saying statistics and statisticians. On the other hand, other people argue that data science is a distinct field. It has different training, techniques, tools and goals than statistics typically has. Now, that's what we're going to talk about in this movie.
The first thing that we want to do, is we want to look at what's called the data science venn diagram. This is a chart that was created by Drew Conway in 2010, and what he's arguing here is that data science involves a combination of three different skills. The first is statistics. That's on the top right. The second, on the bottom, is domain knowledge that you actually know, for instance, about management or advertising or about sports recruiting. And the one on the top left is coding, or being able to program computers, and he's arguing that really, to get data science, a person needs to be able to do all three of these, so we're going to talk about each of these facets one at a time and in combination with each other.
The first component of data science is statistics, which shouldn't be surprising because we're talking about data science. The trick here is that a lot of things that go into statistics and mathematics can be really counterintuitive, and if you don't have the specific formal training, you can make some really big mistakes. An easy example of this is what's called the birthday problem in probability. That's simply trying to figure out what's the probability that two people in a room have the same birthday, month and day, and intuition suggests that to have a 50% chance of a match, you should have over 180 people in the group, because that's about half as many people as there are days.
On the other hand, the correct answer is a lot smaller than that. It's in the 20s, and that's all you need to have a 50% chance of a match, and because data scientists often are going to be looking for matches and associations, being able to get these probabilities correct is a really important part. That's why mathematical training is an important part of data science. The second element of data science is domain knowledge, and the idea here is that a researcher should know about the topic area that they're working in, so if you're working in, say for instance, marketing, you need to understand how marketing works, and that makes it so you can have more insight and better direct your analyses and your procedures to match the questions that you might have.
For instance, there's a wonderful blog post by Svetlana Sicular of Gartner Ink, where she writes, "Organizations already have people who know their "own data better than mystical data scientists - "this is a key. "The internal people already gained experience "and ability to model, research, and analyze. "Learning Hadoop, (and that's a common software "(framework for dealing with big data), learning "Hadoop is easier than learning "a company's business." And that really underscores the importance of domain knowledge in data science.
The third element of Drew Conway's data science diagram is coding, and this refers to computer programming ability. Now, it doesn't need to be complicated. You don't have to have a PhD in computer science. A little bit of Python programming can go a very long way. It's because this allows for the creative exploration and manipulation of data sets, and especially when you consider the variety of data that's part of big data. The ability to combine data that comes in different formats can be a really important thing, and that often requires some coding ability.
It also helps to develop algorithmic thinking, or thinking in linear steps by steps by steps to get through a problem. Fortunately, Lynda.com has a great set of tutorials on both Python and in working through the command line, which are listed in the last video of this course. Next, we'll talk about combinations of two of these elements at a time. The first one is statistics and domain knowledge without coding. Now, this is what Conway calls traditional research, and this is where a researcher works within their field of expertise, uses common tools for working with familiar data formats.
It's extremely productive, and nearly all existing research has been conducted this way. The American Psychological Association, that's in my field, specifically directs researchers, for example, to use the simplest methods possible that will adequately address their research questions. That's what they call a minimally sufficient analysis. So, these traditional methods are extremely important, but they aren't sufficient for working with big data, and we'll talk more about that. The second combination is statistics and coding without substantive expertise.
This is what Conway refers to as machine learning, and now, that's not to be confused with data mining. Machine learning is where an algorithm or a program updates itself and evolves to perform a specific analytical task. The most familiar example of this is spam filters in email, in which the user or a whole large group of users identify messages as spam or not spam, and the formula that the program uses to determine whether something is spam updates with each new piece of information to have increased accuracy the more you use it.
Now, a risk with machine learning is that you get the idea of a mystical black box. You don't actually know how the program is doing what it's doing. On the other hand, if what you're looking for is prediction only, this can be a very effective method. On the other hand, with Conway's model, it's not enough to constitute data science without an important element of substantive or domain knowledge. The third combination here is domain knowledge and coding without statistics. Now, Conway labels this a danger zone, with the idea being that you have enough knowledge to be dangerous.
While there are problems with this, I'll mention two things. First, Conway himself mentions that it seems very unlikely that a person could develop both programming expertise and substantive knowledge without also learning some math and statistics, so he says it would be a sparsely populated category, and I believe that's true. On the other hand, there's some really important data science contributions that come out of this combination, including, for instance, what are called word counts, which we'll talk about later. It's simple stuff. These are procedures that do not require sophisticated statistics.
You're just counting how often things occur, and you can get important insights out of that, so I wouldn't write this one off completely. I would say, though, that like Conway, it's not likely that a person could develop expertise in both coding and their domain without getting the math and statistics as well. And finally, of course, there's all three of these things, statistics, domain knowledge, and coding, at once, and that is the most common definition of data science.
Types and Skills in Data Science
When people talk about big data in newspaper articles or professional conferences, it's easy to get the idea that data scientists are not just people who have domain expertise, who understand statistics and can program. Instead, they often start to sound like omnicient and omnipotent super humans who can do anything and do it instantly and effortlessly. Of course, it's not the real picture. As with any other domain there is a large range of skills involved in data science beyond the three mentioned earlier. A great report called, "Analyzing the Analyzers: "An Introspective Survey of Data Scientists and Their Work" goes over this in detail.
It's a short 40-page book by Harlan Harris, Sean Murphy and Marck Vaisman that was published by O'Reilly Media in 2013. It's available in print or as a free e-book from O'Reilly or Amazon. In it, the authors surveyed about 250 data science practitioners. They asked them how they identified themselves and how they clustered skills that are relevant to data science. Each of these classifications was then subject to cluster analysis and then to a cross-classification. What they found is expected is that there's a high level of heterogeneity among the people in big data, not everyone is the same.
The respondents rated themselves in terms of 11 possible professional identities. In alphabetical order, they were artist and business person, developer, engineer, entrepreneur, hacker, jack of all trades, leader, researcher, scientist and statistician. This table shows how those 11 identities clustered based on the individual's responses. They fell into four basic categories: The data developer, that's developer and engineer, the data researcher, that's researcher, scientist and statistician, the data creative, that's the jack of all trades, the artist and the hacker, and the data business person is leader, businessperson and entrepreneur.
Next the respondents ranked themselves on a list of 22 possible skills, such as algorithms, visualization, product development and systems administration. These skills sorted into five general categories relevant to data science, business, machine learning or big data, that's ML, math or operations research, programming and statistics. Now what's important here is you see that the skills are not the same for all of them so for example, in the business category on the far left, we have product development in business whereas on programming near the right we have systems administration and back-end programming.
Not everybody needs to be able to do all of the same things. In fact, when the researchers crossed the self-identification categories with the skills they got rough profiles of the skills associated with each category of data science practitioner. Not surprisingly, the data business person is a person who has skill in working with data but views themselves primarily as a business person or a leader and entrepreneur. The data creative is interesting for being the one where the skills are most evenly distributed.
On the far right, for example, the data researcher sees their skills as lying primarily with statistical analysis. The most obvious thing here is again, that not everybody's the same. Each group had at least some skill in each area but the distributions differed dramatically. What this makes clear is that there's room for substantial variation in personal interest and skill sets within data science and, by extension, within big data. It's helpful for everybody to know at least a little bit about each of the five skill categories that the researchers mentioned, but diversity is really the name of the game here.
I encourage you to download the free report and take a closer look at what they found because it helps reduce some of the perceived barriers and self-imposed limitations to engaging in data science and working with big data.
Data Science w/o Big Data
If you argue that big data requires all of the three Vs: volume, velocity and variety at once to count as big data proper then it's entirely possible to be a data scientist. A person with domain expertise and statistical knowledge and encoding skills without touching big data. We'll look at a few of these possibilities. First, let's review our Venn diagram of data science. This again shows on the top right we have statistics, on the bottom we have domain knowledge and on the top we have coding.
Taken together those class to data science. We also have a Venn diagram for big data. Where we have volume and velocity and variety, and again, depending on who you ask you need to have all three of them at once to have big data. So let's take a look at data science for just one V at a time. So we're talking about statistics and domain knowledge encoding but with just velocity or variety or volume. The first example we want to look at is volume of data without any remarkable velocity or variety.
So, this would mean a very large and static data set with a consistent format. The data would generally be structured as well so we're not gonna have free text. A good example of this is genetics data as I'm showing right here from Nature Reviews. Genetics data is huge. It's enormous but it follows a well-understood structure and now there's an enormous amount that you have to do to work through it but it's consistent in that way. Another good example would be many instances of data mining or predictive analytics.
In the latter case you may be trying to predict a single outcome like whether a person will click on an ad or a web page. In this case you might have a data set with thousands of variables and maybe even billions of cases but all of the data is in the consistent format. The size of the data makes many common approaches impossible so the coding skills of a data scientist may be essential along with a statistical knowledge and domain knowledge. Next is data science for velocity without volume or variety. So, this is referring primarily to streaming data with a consistent structure.
Now by streaming data we mean that data is coming in consistently and very often you're not holding on to the data. You're just keeping a small window of it open. One interesting example is the earthquake detection systems of the United States Geological Survey. This is the Advanced National Seismic System which is simply looking to see whether there are earthquakes that are happening or really about to happen. And you don't necessarily need to hold on to all the data if what you're trying to do is trigger a response so that if an earthquake is imminent or if it's just starting, people may have enough time to respond to it.
Another term for this kind of data where it's coming in very fast but you're not necessarily keeping it, so it's relatively low in volume, and it may have a very consistent structure so it's low on variety. This can also be called data stream mining. And one possible example is what's called real time classification of string and sensor data. Finally, for one V at a time, let's talk about data science for a data that has variety. So a lot of different formats without velocity or volume. This is where you have a complex but small or static or relatively static data set.
A couple of examples could include facial recognition and personal photo collection, so, you don't have an enormous number of photos but you do have a lot of variety visual data is almost always very high in variety, and it may be static because you don't add to it constantly. Or you can talk about the data visualization of complex data sets. One of my favorite examples is from the site Visually. And we have here an exercise that shows 892 unique ways to partition a three by four grid. And this is something that you would not want to create by hand, but they would show a fair amount of coding to create the diagram.
Now these are examples of data science which again means statistics and it can mean domain expertise and it can mean coding, or you're doing just one of the V's of data at a time. You can also do two V's at a time. So for instance you may talk about data science for data where you have volume and velocity but not a lot of variety. So for instance, a lot of data is coming in very fast but it's in the same format. You can include for instance stock market data or here's an interesting one. It has to do with jet engines. And a surprising statistic here comes from this chart.
Now there's a little bit of math here but I think they may be multiplying more than they need to, but they say that a jet engine has sensors on it that generate 20 terabytes of information each hour. That's an enormous amount of information. So 20 terabytes per engine per hour then it says times two engines, times six hours cross country flight, plus 28,000 flights, although I don't think they're all cross country flights, and 365 days a year. What they're saying here is they have over 2 1/2 billion terabytes of data per year just from jet engines if all that math is correct.
But the point here is it's a lot of data and you would want to hold on to that because the failure of a jet engine is an extraordinarily important thing, and so you want to be able to find the patterns in it fully. Another possibility is data science applied to data that has velocity. So it's coming in fast and there's a lot of variety but not a lot of volume. So again, this is streaming data where you're not necessarily holding on to everything. One interesting example of this is surveillance video. Again, we could go ahead to the next chapter where we talk about ethics, but there is a lot of surveillance video.
And in fact if you look at the end of the second paragraph here, it says, "According to the International Data Corporation's "recent report that Digital Universe in 2020 "half of global big data, "the valuable matter for analysis in the digital universe "was surveillance video in 2012 "and the percentage is set to increase to 65%, "2/3 of it by 2015." And that's because surveillance video is moving into high definition as opposed to the normal really low res stuff that you often see. Now, if you're saving all the data then that's an enormous volume.
On the other hand, if you're not saving it but you're streaming it in, it's very fast because the information comes in very quickly. Maybe 20, 30 frames per second and it has a lot of variety because it's visual information. But if you're simply wanting to see does a person come through for instance, who's carrying a weapon, or does a particular event occur? You use a stream and you're just trying to trigger when something happens. Finally, let's talk about data science for volume and variety without velocity, and this can be any large historical data set that uses multiple formats or includes visual data.
A really good example of this is Google Books. We've looked at this before. Where they have 30 million books that they've scanned and they've digitized them, and you're dealing with a really complex information here. This is one of my favorites. It's a book called The Anatomy of Melancholy and I actually have the hard copy version of this as well but I love seeing it online. Similar examples include the Twitter archives where every single tweet that's ever been written has been saved. That's an enormous amount of information because its text is complex but because it's not updated constantly, it doesn't have the velocity.
Now what this examples show is that despite the strong association between big data and data science, the skills of data science, the statistical knowledge and domain expertise or knowledge encoding skills. Those apply even when the three major aspects of big data aren't all present at the same time. In the next movie we'll look at the flip side of all this. How to work with big data without requiring the full data science skill set.
Big Data w/o Data Science
In the last movie we talked about data science without actually having Big Data. In this movie, we'll take the compliment of that. We'll look at scenarios in which a person works with Big Data, but doesn't require the full data science skill set. As a way of reminding ourselves, Big Data usually involves unusual volume, and velocity, and variety in the data. Data Science also has this little diagram. It involves statistical skills, and domain knowledge, and coding ability.
And the three of them together gets data science. Now let's take a look at, can you do Big Data with just two of the data science skills? How about with just statistics and with coding? The answer of course is yes. Because that's where we have machine learning. Machine learning is a very important area of data science and this is where a computer program learns to adapt to new information as it comes in. The two most familiar examples are spam filters, where the computer program learns that a particular kind of email is spam or not based on your own individual responses and based on the responses of millions of other people who use the same email program, like Gmail.
Or, facial recognition and photographs where your computer program learns what face belongs to whom. Now here's an article from Nature that talks about artificial intelligence and machine learning in general. And actually if we scroll down a little bit it talks specifically about the problem of facial recognition and how the computers can learn to identify face. It's funny, it's something that's very easy for humans to do, but much harder for a computer. So machine learning is a good example of working with Big Data because it can have volume, it can have velocity, it can have variety, but you don't necessarily need to have domain knowledge because the computer is working without any knowledge whatsoever, simply, did it get it or did it not.
Another possibility is what Drew Conway, who created this Venn Diagram, originally called the Data Zone, that's the combination of coding, and domain knowledge without statistics. Now he says, "danger" here, but there are some very good examples of data science here that do not involve statistical knowledge. The most common of which are word counts and just parsing of natural language. The most common tool used for this is what's called, the natural language tool kit. This is a package that's used in Python, the programming language.
And it will ask people to do all sorts of amazing things. You can do things like word counts. The best known word count was the one that was used actually a few decades ago to identify authorship of the Federalist letters in American historical political science, all the way up to the things like, comparing vocabulary sizes of various Hip-Hop singers. And so, there's amazing things that can be done with natural language by simply counting how often the words occur, without requirng any statistical knowledge per se, because there's no influential procedure that goes into it.
And now those are two of the combinations of data science skills that we have two at a time without Big Data. That leaves a third one here, and that's statistics and substantive or domain expertise. And this traditional research, unfortunately as much as I personally value traditional research, I'm trained as an experimental social psychologist, you're not able to work with Big Data. Without the coding skills, you're not going to be able to deal with the volume, the velocity, or the variety of data that characterizes Big Data. Tremendous things are accomplished in traditional research, but Big Data is not one of them.
And then, it also goes to say that as far as I can tell, unless you have at least two of these, it's just a non-starter. You simply can not work with Big Data if you have just statistical knowledge, or just coding skills, or just domain knowledge. Instead, what you would have to do in that situation is collaborate with people who have that information. In fact, collaboration really is the rule in data science, as opposed to an exception. Because there is such a broad range of skills that are necessary, nobody is usually able to bring all of to it, but they have to work together.
And that actually is one of the wonderful things because I think that most of the interesting developments come about through collaboration. And that's something that data science encourages strongly. And so, when you look at the relationship between data science and Big Data, the situation isn't exactly even. It's possible to do data science with an incomplete version of Big Data, but it's much more difficult to do Big Data work without the triumvirate of data science skills. We'll get more into the specifics of working with Big Data in later movies, but first we need to talk about ethics and Big Data.
Ethics
Challenges with Anonymity
We've discussed some amazing things that can be done with big data especially when an individual's information is compared to a massive data set. Some of those examples however, can feel like they've crossed the line that separates the impressive from the creepy. That's because privacy is an important issue and a lot of these examples feel like they may have crossed the line into private information that people did not expect to have divulged. People don't want their personal information to be public and over the last several years there have been severe consequences to breaches of privacy either [axtenal] or intentional.
On the other hand people do want to have quality service and need to have good research, there's the rub. One possible solution is to anonymize data to make it anonymous. By removing identifiers like names, addresses, and other obvious bits of personally identifiable information. But many such attempts have failed dramatically. One of the major challenges of working with big data is that even when a person attempts to make the data anonymous by removing obvious identifiers, it's actually not impossible to de-anonymize the data.
A really telling example of this came a few years ago during the Netflix Prize, when Netflix provided people with an anonymized data set of user ratings on movies. Two researchers Arvind Narayanan and Vitaly Shmatikov were then able to take that information and compare it with identified user ratings on the internet movie database, IMDb. They were then able to match people who are identified by name on IMDb with anonymized ratings on Netflix and connected two data sets.
A bigger challenge, there was another contest that looked at social networks, and what this is, is the contest included information that said, this person who was identified just by a random number is connected to this person and this person, and this person, and this person. From that you get a social network graph. That's a picture that we've shown before where each circle here represents an individual and the lines are nodes that connect them with other individuals. What the researcher was able to do in this case was again, send out a computer [curology] go through several social network sites that were publicly available and they were able to find simply by matching shapes, they had no other information aside from the shape with a diagram.
They were able to identify a social network which actually came from Flickr, the photo sharing site. They were able to identify shapes in the Flickr social network that matched the data in the contest data, and they were able to then sort of de-anonymize that data set as well. Simply by matching by geometry. Now, even more dramatic example and researcher Latanya Sweeney, showed that she was able to purchase voter records for Cambridge, Massachusetts.
What this included on the left is the name and address, date registered, party affiliation, and date last voted along with the zip code, birthdate, and sex for each person in that voter list. Now on it's own that's not horrible information except you do find out the party affiliation when they voted. But what she also found is that with just three of those pieces of information: the zip code, the birth date, and the person's sex and that's it, she was then able to go to medical records that were publicly available and matched them up and find the person's ethnicity, their visit date, their diagnosis, their procedure, their medication, and their total charge.
In fact, she was able to identify the governor of Massachusetts from that data set, and get his medical information through this round about procedure. Now, this wasn't done as a way of showing that there, she wasn't trying to dig up any dirt on anybody but what she was trying to show that small pieces of readily available information zip code, birthdate, and sex. She was able to correctly identify 97% of the individuals in the medical data. Now this was done about 20 years ago and the nice thing is that regulations had changed since then, about the kind of information that's available.
Now we have the health insurance portability and accountability act better knows as HIPAA which has a lot to do with privacy regulations in medicine. HIPAA requires that a lot of different information, about 17 major variables need to be anonymized or aggregated. For instance, you can't report a person's age, or you can't give their birthdate, you can simply say how many years old they are. If they're over 89, you just have to say they're over 89. You can't give the zip code, you can only give the state and so it becomes much larger groups and much harder to identify people.
Other research by Latanya Sweeney has showed that when you remove the HIPAA-protected information, and so that's only four one hundredth of a percent of the people. For comparison purposes, the probability of getting strucked by a lightning is about one out of 10,000, so it's in the same vicinity as that at risk. Similarly, law professor Paul Ohm has shown that re-identifying people in anonymized data set is enormously difficult. It requires, what he calls massive statistical and data management skills.
The point here is while it can be done, it's very hard to do it. As Professor Sweeney has shown, if the information is properly anonymized, it's nearly impossible to identify people. The point of these examples is not to encourage paranoia but rather to point out that care needs to be taken when we're working with big data, especially when personally identifiable information is present, in order to ensure privacy. Anonymization is a start but it requires some thought and care to be done well, and if it is done well, it is still possible to provide services and conduct research without crossing the line into creepiness or into legal trouble.
Challenges with Confidentiality
Anonymity means that individuals cannot be identified. Another important element of privacy, however, is confidentiality for what's called nonpublic information. In its simplest form, confidentiality means that regardless of whether individuals can be identified within the data, their data will not be shared with people they did not specifically allow to see it.
Confidentiality's an issue of trust that makes interactions possible. For instance, I have given my credit card information to several online companies because one, it makes interactions with them much easier if that information is already stored, and two, I feel confident that they'll keep that information private. There are several exceptions and limits to confidentiality, however, that need to be discussed. The first is that in conducting transactions, companies do share limited amounts of information with third parties. For example, if a person goes to make a large purchase, it's not uncommon for the vendor to call the bank and ask whether the person has enough money in the bank to make that purchase.
They're just checking for what's called sufficient funds. And all the bank does is tell them yes or no. And so there's a sharing of a very small amount of information, but the bank does not pass along the person's account numbers, their ID information, doesn't give them their actual balance; just says whether they have enough to cover a particular transaction. On the other hand, it is also sometimes the case that information is stolen from companies. Several companies have had their data stolen, including credit card information, addresses, and other important personal information.
And in some cases, companies have lost hundreds of millions of dollars in business because consumers were no longer confident that their information would be safe. Similarly, another limitation to confidentiality is that companies sometimes had to give their information to courts or government regulators as part of law suits that could've put them out of business completely if they did not provide the information. It's a very awkward situation, but it does come up occasionally. The trick with that one is while it is a legal process, it is not something that the users originally agreed to, and so there is a violation of trust even if what's happening is technically a legal process.
Now these exceptions don't call for a complete lockdown on data because the data does provide very important services, but it does call for more care and attention. So for example, the National Science Foundation and the National Institutes of Health have instituted policies that researchers have to present a plan to provide access to their data they use in their studies. Additionally, there are insurance companies that now provide insurance for the cost of data breaches and companies should consider whether they want to purchase that kind of insurance.
It's still relatively uncommon, but it is available. And finally, companies should very carefully consider whether they even need to have the confidential information in the first place. The point here is, you can't lose something if you don't have it to begin with. And so a company should consider the actual services they provide and whether that information's important. I can example a scenario, for instance, in which a new life insurance company may want the access to medical records, maybe. But it's much harder to imagine a scenario in which it would be appropriate for a credit card company to have that kind of information.
And so it's clear that things can, and occasionally do, go very wrong when data is supposed to be confidential. But again, this doesn't call for a complete avoidance of data with non-public information in it. After all, that's what makes things like restaurant and movie recommendations work so well, and it's certainly important for using big data to make medical discoveries. Some of these benefits are of personal conveniences and could theoretically be forgone, but others have life and death consequences and deserve to be maintained. For these reasons, it's important for companies to consider how they're going to deal with non-public information.
That way, the trust of regular people, who are the consumers and people who benefit from big data research, the trust can be maintained and big data can provide its full benefits.
Sources and Structures of Big Data
Human-Generated Data
Big data can come from several different sources. Now, one way to think about it is whether the data was produced by a human or whether it was produced by a machine. We humans generate a lot of data whether we mean to or not. This first thing I want to talk about is intentional data. This is data that you know you are creating, so for instance, if you take photos, videos, or record audio or put text on a social network, you know you're doing it. You can also click "like" if you're on Facebook.
When you do web searches, a record of the web pages that you have viewed are bookmarked. Your emails and your text messages. Your cell phone calls. If you read an eBook, the highlights, the notes in the bookmarks and online purchases. All of these are kinds of data that do not exist until the person deliberately makes them happen, so these are records of human actions. What's interesting about it is, in addition to these intentional pieces of information, there's also meta data. Now, meta data is data about data.
You might call this second order human generated data. Now, that's my own term. I made that one up. But the idea here is this is data that accompanies the things that you do, and you may not be aware of it. What's funny is that the meta data, first off, can be enormous, sometimes larger than the actual piece of data you created, and most significantly for the big data world, meta data, because it's computer generated, is already machine readable and searchable. I want to show you some of the things that show up in meta data and what can be done with it.
A really simple one is in photographs. So, for instance, if you take a picture with your phone, you not only get the picture, you also get what's called the EXIF data. That stands for exchangeable image file format, and this is the meta data that comes from a picture on an iPhone. Now, aside from the name of the file which you see at the top left and that it's 3.1 megabytes, and the time that it was taken, near the bottom on the right side, you see the GPS altitude, latitude, longitude and position, or partway up from that, you'll even see the GPS image direction.
It knows which way you're holding the camera. This is an enormous amount of information that accompanies the photo that you get. Another interesting thing is cell phone meta data. Now, this is information that is not normally publicly available, and so this really is talking about sort of an academic interest here, but the idea is that there's a lot of information that accompanies your phone call, and without even knowing who you're calling or what you said on the phone call, two pieces of information: knowing the time of the phone call and the location.
What's interesting is that with four of these pieces of data, or rather, four calls where you know the time and the position, in an anonymized data set, that's enough information to identify 95% of individuals. Again, I'd like to emphasize that this is information that is not publicly available, but the point here is, is that it's possible to tell a lot about people by the meta data. Another one is email meta data. Now, there's a lot of information that accompanies each email, but four very common pieces are: who's it from, who's it going to, did they CC anybody, carbon copy anybody, and the time that it was sent.
Now, one really interesting thing about this that you can do for yourself is that MIT has created a web app called Immersion that allows you to do a quick analysis of your own social network via your email account. I'm going to show you the Immersion website. It's immersion dot media dot mit dot edu, and what it does is, it takes those four pieces of information about your emails and it puts together an image of the network, and I've done this for myself, and it's absolutely fascinating.
It takes a few minutes and you have to refresh it to get it up to date, but what I'm going to do for this example, is I'm going to come down here to the Immersion demo, so this is fictional data, but this is what yours would look like if you did it. Tony Stark is not at MIT. What here we have is a fictionalized analysis of what his information would look like with six years of emails, 20,000 emails to 159 collaborators, and what you can see is who is in his group. We have more to Travis than anybody else.
It looks like Travis is connected with Tabitha, and you can change, for instance, how things are distributed over here. The number of notes that show, now we have more and more people showing up. We can make that a little bit smaller, and the strength of the links between them, see, so now nobody's connected, and now everybody is connected, but what's interesting about it is, I know, for instance, when I do it, I have a distinct group that shows for my family. I have another distinct group that shows up for my job. I have another distinct group that shows up for the people I work with here at Lynda.com, another group for a nonprofit I worked with, and it's funny to see, for instance, that there are sometimes people who connect several of those groups at once, but the interesting thing again, this is just based on who you sent it to, who it was from, the CCs, the time and the date, and that's how it's able to reconstruct your own personal social network.
The last thing I want to show you is about Twitter. Now, one of the interesting things about Twitter, aside from the fact it's a very popular social network is, tweets are very small. They're limited to 140 characters. What's interesting about this from a research point of view, is there's an enormous amount of meta data that accompanies each tweet, so I'm going to go to an article here called "What's (technically in your tweets?", and scroll down a little bit, and I'm going to zoom in on this image. Now, this was put together by an employee at Twitter, Raffi Krikorian.
I'm going to make that a little bigger here. Let's go up to the top here. This information in the red, that is the content of the tweet. That's the stuff that you meant to put there, but look at all this other information that accompanies it in terms of whether there was a reply, a truncated version, the author's screen name and URL, the location, I like this rendering information, the creation data, the account, the number of followers, the time zone, their language, whether they have a verified badge, the place ID, the URL, the country, the bounding box, the application they sent it.
It's an enormous amount of information. Several times, the meta data in this case is several times greater than the actual content that it's about. This is one of the most interesting things, and this is why Twitter in particular is a very rich data set for people who are doing marketing research or social connection research. Anyhow, the point of this is that these are all sources of big data, and one of the interesting things about it is that the meta data in particular does not have to be processed. It's already computer readable.
It's searchable. It's mindable, and you can start to get information about it immediately to reach your big data analysis purposes.
Machine-Generated Data
In T.S. Eliot's poem, The Love Song of J. Alfred Prufrock, he has the line I have heard the mermaid singing, each to each, I do not think they will sing to me. I bring this up because it has been estimated that as much as 95 percent of the world's data will never be seen by human eyes. Much of this unseen data is called M to M data or machine to machine. The machines are talking to each other not to be heard by the humans. Just like Prufrock's mermaids. Now, let me talk about some of the sources of machine generated data.
There's a very long list, this is not meant to be comprehensive, but things like when the cell phones ping to the cell towers to check where they are, when the satellite radio and the GPS connect to locate the car or your phone, the RFID or radio frequency identification tag readings on billions of small objects, readings from medical devices and web crawlers and spam bots. It's especially fun to think about the electronic spam bots being stopped by the electronic spam filters, each working against another.
Perhaps the most interesting part of the machine to machine communications falls under the rubric of the Internet of Things, sometimes just called IoT. Now, it's estimated that by 2020, which is only a few years from now, as many as 30 billion uniquely identifiable devices may be connected to the internet. This actually requires a major change in how the internet works it's addressing system for all of these things to fit, but basically, everything will have a chip and it will be connected to the internet and they'll be talking to each other, sharing information.
So when you hear people talk about smart sensors, in your home or in your city or on your air conditioning or the smart home, which knows when to turn lights on or change the temperature, or the smart grids where the city generates and sends out the electricity or the smart city itself which coordinates all of the traffic and the utilities as a way of being more efficient and more economical in providing better service. All of this would be enabled by small objects communicating one with another in the internet of things, they communicate directly and not with a human an intermediary.
Some of the uses for this can include putting sensors on production lines to monitor systems for when they need maintenance, or the smart meters on utility systems to shut them off at peek times, if they can do it without interrupting service. My dog has a little chip and it's under it's skin as a way of identifying pets, also for farm animals, tracking them through systems, thermostats and light bulbs, you can actually get your iPhone controlled light bulbs now, or just environmental monitoring, that can include things like air and water quality, atmospheric or soil conditions, movements of wildlife, earth quakes and tsunami early warning systems.
Also things like infrastructure management, that talks about control and monitoring of bridges, railways, wind farms, traffic systems. Industrial applications like manufacturing process controls, supply chain network, having predictive maintenance or integration with a smart grid to optimize energy consumption. There's energy management, that's the switches and outlets and bulbs and TVs and screens and heating systems, controlling ovens, changing lighting conditions. And then, medical and healthcare systems.
Building and home automation, transport sytems. Basically anything that's mechanical can be eventually hooked up and communicating through the internet of things. All of which generates an enormous amount of data as they talk one to another and coordinate their own activities. So, in looking at the differences between machine generated data and human generated data, which we talked about in the last movie, the most obvious difference is that the machines don't generally post selfies on Facebook, they don't make silly videos on YouTube, they don't write job applications and put them on LinkedIn, and so, the content is different, but what is interesting is that content may not be the most important difference.
Instead, the most important distinguishing feature of machine generated data may be that all of the machine generated data is machine readable. It can be immediately searched and read and mined. It's high on volume and velocity, two of the characteristics of big data, but low on variety, which makes it easier for machines to deal with it. And that brings up the important distinction between what's called structured, unstructured and semi-structured data, and that's what we'll discuss in the next set of videos.
Structured Data
Data is said to be structured when it's placed in a file with fixed fields or variables. The most familiar example of this kind of structured database is a spreadsheet. Where every column is a variable and every row is a case or observation. In the business world however, large data sets are usually stored in databases. Relational databases to be specific, which share some characteristics with spreadsheets such as rows and columns, but allow for much larger data sets, more flexibility, and more constancy. A recent survey of database users found that nearly 80 percent use some form of relational database with Microsoft SQL server, MySQL, and Oracle as the most common options.
Now, before I say anymore, I need to give you a very brief history of SQL databases. In the early 70's researchers at IBM wrote a paper describing the Structured English Query Language or SQL written as the word. Name was later changed to SQL because of a copyright issue, but it's still generally pronounced “sequelâ€. IBM however did not commercially launch SQL. That happened in the late 70's by a relational software inc, which later became Oracle, which is still one of the biggest providers of database software in the world.
Oracle is also well known for making one of it's co-founders Larry Ellison, one of the richest people in the world. Now, what SQL does, is it makes it easy to extract, count and sort data, create unions in the intersection between sets. It's also used to add, update and delete data. And it does this all in a language that's much easier to manage than the mouse-driven clicks and selections of a spreadsheet. For example, if a university has a database of student information, than a SQL command to get the gender, college, department and major of the students might be written this way, Select, that's a function, and then you say the four fields or variables that you want, gender, college, department and major.
And from, another function, is coming from a table called students. If you want to make it slightly more elaborate, you can specify students over the age of 25 using the where statement or command. So again, we have a single sentence here. Select, these four fields, gender, college, department, major, from the table students, where the value in the field age is greater than 25. And again finish with a semicolon. Now there's a whole lot more that can be said about structured data and SQL databases, but it'd be better to direct you towards some of the excellent courses that lynda.com has on these topics.
These include, Foundations of Programming Databases, Foundations of Programming Data Structures, SQL Essential Training and MySQL Essential Training. And with that, we'll turn to the compliments of structured data and SQL, unstructured and semi-structured data, and NoSQL databases.
Unstructured Data
If you were to write this in a report, you might say something like, "Two prominent soccer teams in Europe "are Manchester United and FC Barcelona." And "Two NFL football teams in America "are the Oakland Raiders and the Tennessee Titans." Now, while this data is easy for a person to understand, it's much more difficult for a machine to understand. It's not easy to sort this kind of data. It's hard to rearrange it. It's hard to count the values, and it's hard to add more observations. So this is an example of unstructured data, that's data that's not in fixed fields and text documents, presentations, images, video, audio, PDFs, what have you, all go into this.
And it may be the majority of data in business settings. I've seen estimates anywhere from about 45 percent up to as much as 80 percent of business data may be unstructured. It's a little hard to deal with. You may have to convert it to text and then use a text-mining program to try to get structure out of the sentences of data, but that's difficult, and it's time consuming to do. On the other hand, data doesn’t have to be either structured in a spreadsheet or unstructured in text. A third option is available, and that's semi-structured data.
Now, semi-structured data is data that's not in fixed fields. So it's not the rows and columns of a spreadsheet, but the fields are still marked, and the data are still identifiable. Now two common formats for semi-structured data, they're not the only ones, but the most common are XML, which stands for Extensible Markup Language, and JSON for Javascript Objection Notation. Let me show you what the sports data would look like in XML. What we need to do is have a series of brackets that indicate what information is being shown at the moment.
The first set of brackets indicates that we're gonna talk about sports. And then the value in that field is soccer, and then we're going to break it down to two teams in which we would have an opening bracket for each one that says team, and then a closing bracket with the slash to indicate we're done talking about that team. And if you've seen HTML, it looks similar. And it goes through sort in a nesting and winding format. What this shows is that the semi-structured data format of XML or others is really good for nested data or hierarchical structures, whereas trying to show it in a spreadsheet, you end up having to repeat a lot of the information.
This avoids the repetition, and so it's actually more efficient in that sense. On the other hand, XML is a slightly older format, and it's a little wordy. A more recent development is JSON, or the JavaScript Object Notation, and JSON has a similar feel to it, but there's not as much text. We still have the brackets. We have curly brackets to start our thing, and then we put the names of the fields and the values in quotes, but you can see the same kind of in-and-out structure that we had before.
Now, a nice thing about this is you can also write it as a single line. And now you can see that it's more compact, and this allows you to indicate, again, the hierarchical or nested nature of the data, without having to repeat information the way that we did in the original spreadsheet. The next thing we wanna talk about, once we've discussed unstructured and semi-structured data, is the databases that you can store the information in, as opposed to the SQL databases that are used for structured data with rows and columns. Semi-structured and unstructured data usually go into go into what are called NoSQL databases.
That used to mean not SQL, but now it means not only SQL, because NoSQL databases are extremely flexible and can handle a wide range of data formats. So most of them use a semi-structured format. For instance, the most common NoSQL database is MongoDB that uses JSON. It's nice because it's a flexible structure, and for certain tasks, a NoSQL database can be much faster than a SQL database. On the other hand, they haven't been adopted as widely as relational databases.
For instance, the survey that I mentioned in the last movie said that about 79 percent of companies had used relational databases. Whereas only about 16 percent had adopted NoSQL databases. On the other hand, if you're looking at the sheer volume, because Hadoop, which we'll mention a little bit later, is a NoSQL database, and almost all big data is installed in Hadoop or Mongo database. Even though there's fewer by head count, an enormous amount of data is installed in that format. Now one of the big problems is that whereas the SQL databases all use at least relatively standardized versions of the SQL query language, there is no standardized query language for NoSQL databases, and it means as you switch from one to the other, you may have to learn all over again how to work with it.
That's a problem, but on the other hand, the NoSQL databases are an area of huge development, and I imagine that will get resolved pretty quickly. In the mean time, if you'd like to learn more about unstructured or semi-structured data or NoSQL databases, lynda.com has a great set of courses you can work with. Up and Running with NoSQL Databases, XML Essential Training, Working with Data on the Web, Real-World XML, and JavaScript and JSON. And those will give you a good feel for what's possible with these more flexible and more recently developed formats, especially as you can apply them to web data and big data.
Storing Big Data
Distributed Storage and the Cloud
Big data is defined by volume, velocity, and variety. The first of these characteristics, volume, makes it difficult to store data on a single drive. As a result, distributed storage, which means storing data across more than one computer, has become a necessity. In a traditional storage system, data on a disk is stored in blocks. Each block contains the 0's and 1's that make the bytes of letters and numbers. Storage data also includes error correction code, which you use to verify the integrity of storage data. If your data set's become too large for your computer drive, then adding an extra drive is an easy solution.
For example, I have four external drives routinely hooked up to my laptop, each of which gives me extra space, and each of which serves a particular function. While this approach can work to a certain extent, it has limitations. A) They're only attached to your computer. B) They're all in the same physical space, so if your house burns down you lose everything at once. C) There's usually only one copy, that is, the information is not necessarily redundant. And D) it doesn't do anything about sharing the documents with other people. Also, if a drive goes bad, like my main backup drive did last month, then it can stop everything for a while while you sort it out.
As a result, simply adding external drives to your own computer is not the ideal solution. For many years, the best solution to the one computer problem, was to create a storage area network, or S-A-N, or SAN. SANs are large and expensive collections of disk drives on racks. They have a lot of nice features, though. They can be spread out across a large geographical area. They include redundancy, so information is always written onto more than one drive, and they're relatively easy to service, by allowing you to replace a faulty drive without shutting the whole thing off.
On the other hand, SANs can be very expensive to set up, and can take a lot of skilled labor to maintain. As a result, many companies have looked for workable alternatives. Over the last few years, cloud storage, or storage over the internet, has become not just a viable option, but the preferred method. Cloud storage is attractive, especially for big data, because the storage space is scalable. That is, it can be rented on an as-needed basis. There's no real up-front cost for installation or maintenance, and the level of redundancy that cloud providers can offer makes data nearly impossible to lose or damage.
You can think of it as an industrial-strength version of services such as Box or Dropbox or Google Drive. Now, perhaps the most popular cloud storage vendor, but far from the only one, is Amazon, and they have the Simple Storage Service, better known as Amazon S3. With services like Amazon's, it's possible to store essentially an unlimited amount of data. For instance, that's where Netflix stores all of their movies. It's also possible to tailor cloud storage for things like scalability, how much storage do you need; redundancy, how many backup copies do you want to have in case something goes wrong; or speed, how fast you want to be able to pull the information out.
That can include, for instance, whether it's stored on hard disk drives, or whether it's stored on solid state drives. Obviously, faster, more secure, means more money. On the other hand, research has also shown that most data, once it's written, is never accessed again, and this makes some of the more economical strategies attractive. For instance, there are deep freeze solutions. Amazon has something called Glacier. Other people have similar programs that are much cheaper, but the data needs a few hours notice before it can be pulled out. Now, the point of this is that as data grows from conventional formats to become big data, the storage problems multiply.
It becomes necessary to store the data across multiple drives, multiple computers, as well as to work to make the data secure. The costs and physical trouble can become overwhelming. So big data companies have turned to cloud storage providers to solve these problems. What's interesting is that cloud providers are able to do more than just provide storage, and that's what we'll look at in the next movie.
Cloud Computing: IaaS, PaaS, SaaS, and DaaS
Cloud computing means, computer services that are delivered over the internet instead of on your own computer. The most common services in cloud computing are infrastructure as a service, or IaaS, platform as a service, PaaS, software as a service, Saas. And because this course is about Big Data, I will also mention data as a service, or Daas.
Infrastructure as a service, or IaaS, is an online version of the physical hardware of computer, which is why it's also sometimes called hardware as service, It can be thought of as the hosting layer.
It includes the disk drive, servers, memory and network connections. For example, in the last movie we talked about online storage. And that would be one element of IaaS On the other hand, that part where we discussed storage didn't include the availability of computer chips, or RAM. If you need access to hundreds of terabytes of storage space, hundreds of gigabytes of RAM, and super high speed network connections, then an IaaS service would be a great way to go. This can save an organization an immense amount of money and time on purchasing and maintaining their own machines.
What makes IaaS possible is virtualization, that is software that allows one computer to run multiple operating systems at the same time, similar to how software like Parallels Desktop makes it possible for me to run Windows and Linux on my Mac. These are called virtual machines. Amazon, Microsoft, VMWare, Rackspace, Red Hat, those are the biggest players in infrastructure as a service.
The next step us from this is PaaS, or platform as a service. This can be thought of as the building layer because this is something that developers use to build applications that run on the web. PaaS includes the middle layer software, like the operating system, and the other components like the Java run time in the middleware, that allow higher level software to run, like web based applications that use the dot net, that's the Microsoft program, or the Java. PaaS also gives access to databases like Oracle, Hadoop, and application servers like, Web Logic, Microsoft IIS, and Tomcat. Force.com, which supports sales force, a very common sales application, and Google's App Engine are well known examples of PaaS. PaaS is a low-cost way to get started, plus it provides for limitless growth for a new company, or an existing company as it ramps it's application up without having to buy their own hardware or pay software vendor licenses. It is however, a level of computing that most users will never touch.
The next step up from that is SaaS, or software as a service. This is the top layer for most people. It can be thought of as the consumer layer of crowd computing. It includes web applications that run entirely through the browser like Gmail and Google Driver Docs, Office 365, Salesforce, Quicken.com, and Mint.com, and so on. The point with this market is that it's easier to use SaaS, which is ready to go, than install your own software, which could take hours, days, or even longer. It also makes it possible to use cheap netbooks like Chromebooks, where everything's done over the web. I have a couple of Chromebooks. I love them.
The next thing I want to talk about is DaaS, or data as a service. This is the final layer of cloud computing that we'll discuss in this movie. Please note, it's not to be confused with desktop as a service, which is another more common SaaS. Data as a service is an online service like the others, except instead of providing online access to computing hardware, operating systems, or applications, it provides access to data.
For this reason, DaaS providers are similar to what are called data market places or data marts. And the idea is that DaaS providers can provide important services by allowing customers to get access to data quickly and affordably, while assuring data quality. For instance, in the previous movie I talked about how market researchers can buy demographic information about their clientele and match it up with what they have in their own records. That's an example of DaaS. Or for instance, there are companies that have a record of every tweet ever sent and you can buy access to that.
That's another example of DaaS. Other companies include for instance, Factual.com, and Infochimps, are in the same market, although it still looks like the market's kind of wide open in terms of what is available and what can be provided. The collection of cloud computing services that we discussed in this movie, infrastructure as a service, platform as a service, software as a service, and data as a service, they can all play important roles in Big Data projects. Either by providing the physical resources necessary to store and process the data, the software to interact with the data, or even the data itself.
What they all have in common is the ability to shift the load off the consumer business and allow them instead to dedicate their own resources, their money, space, time and energy, to working with the data and getting the insight they need for their own projects and progress.
A Brief Introduction to Hadoop
Any discussion of big data will invariably lead to a mention of Hadoop. Hadoop's a very common, a very powerful platform for working with data, but it can be a little hard to get a grip exactly on what it is and what it does. This movie is designed to give the briefest possible introduction to Hadoop, which could benefit from several courses all on its own, and with that, here is the bare minimum on Hadoop. The very first question is, what is Hadoop? It sounds like it's a untranslatable word for big data or for transformative business practice.
Instead, Hadoop was the name for the stuffed animal that belonged to the son of one of the developers. It was a stuffed elephant, which explains the logo as well. But what is Hadoop, and what does it do? Most significantly, Hadoop is not a single thing. It's a collection of software applications that are used to work with big data. It's a framework or platform that consists of several different modules. Perhaps the most important part of Hadoop is the Hadoop distributed file system, or HDFS, and what this does, is it takes a piece of information, it takes a collection of information, and spreads it across a bunch of computers.
It can be dozens or hundreds or tens of thousands, in certain cases, so, it's not a database, because a database usually implies a single file, especially if you're talking about a relational database, it's a single file with rows and columns. Hadoop can have hundreds or millions of separate files that are spread across these computers and all connected through the software to each other. MapReduce is another critical part of Hadoop. What this is, is it's a process consisting of mapping and reducing, and it's a little counterintuitive, but here's how it works.
Map means to take a task and to take the data and to split it into many pieces, and you do that because you want to send it out to various computers and each one can only handle so much information, so let's say you have 100 gigabytes of information and each of your computers has 16 gigabytes of RAM, you're going to need to split it up into 60 or 70 different pieces, and send it out to each of those different computers that you're renting from Amazon Web Services or wherever. Map splits it up and sends it out to work in parallel on these different computers.
The reduce process takes the results of those analyses that you've done on each of these dozens of different computers and combines the output to give a singe result. Now, the original MapReduce program has been replaced by a patchy Hadoop YARN, which stands for Yet Another Resource Negotiator. Sometimes people just call it MapReduce, too, and YARN allows a lot of things that the original MapReduce couldn't do. The original MapReduce did batch processing, which meant you had to get everything together at once, you split it out at once, you waited until it was done, and then you got your result.
YARN can do batch processing, but it also can do stream processing, which means things are coming in as fast as possible and going out simultaneously, and it can also do graph processing, which is social network connections. That's a special kind of data.
Next is Pig. Pig is a platform in Hadoop that's used to write MapReduce programs, the process by which you split things up and then gather back the results and combine them. It uses its own language. It's called the Pig Latin Programming Language.
Probably the fourth major component of Hadoop that is most frequently used is called Hive, and Hive summarizes queries and analyzes the data that's in Hadoop. It uses a SQL-like language called HiveQL for query language, and this is the one that most people are going to use in terms of how to actually work with the data, so between the Hadoop distributed file system and the MapReduce or YARN, and Pig and Hive, you've covered most of what people use when they're using Hadoop. On the other hand, there are other components that are available. For instance, HBase is a no SQL database, so a nonrelational database, or not only SQL database for Hadoop.
Storm allows the processing of streaming data in Hadoop. Spark allows in memory processing. This is actually a big deal because it means you're taking things off of the hard drive and putting them into the RAM of your computer, which is much, much faster. In fact, in memory processing can be a hundred times faster than on disk processing, although you do have to get through the process of putting the information into the RAM, which usually isn't counted when people are doing these statistics. Spark is often used with Shark, something that enables the in memory processing.
And then there's Giraph, spelled like graph with an i, which is used for analyzing the graph for the social network data. Now, there are maybe 150 different projects that can all relate to Hadoop. These are just some of the major players. So the question also is, where does Hadoop go? It can be installed in any computer. You can put it on your laptop if you want. In fact, a lot of people do so they can sort of practice with it and get things set up, and then send it out to the cloud computing platform, and in fact, that's where it usually is.
Cloud computing providers, Amazon Web Services is the most common, but Microsoft Azure has a form of Hadoop that they use, and there are a lot of other providers that allow you to install Hadoop and run it on their computer systems. Who uses Hadoop? Basically anybody with big data. Yahoo!, not surprisingly, because they developed it, is the single biggest user of Hadoop. They have over 42,000 nodes running Hadoop, which is sort of mind-bogglingly huge.
LinkedIn uses a huge amount. Facebook uses a bunch, and Quantcast, which is an online marketing analysis company, has a huge installation as well, and there's a lot of others. Finally, it's worth pointing out that Hadoop is open source. While it was developed by engineers at Yahoo!, it's now an open source project from Apache. So you'll often hear it called Apache Hadoop, or Apache Hive, or Apache Pig. One of the things about open source projects like this is it's free, which explains in part its popularity.
Also, anyone can download the source code and can modify it, which explains so many of the modifications or the extensions or the programs that work along with Hadoop that make the most of its capabilities. The takeaway message of this presentation is Hadoop is not just one thing, but a collection of things that collectively make it much easier to work with big data, especially when its used on a cloud computing setup. Hadoop is extremely popular in the big data world, and there's very, very active development for Hadoop, but there's also very stiff competition for the market.
Not everybody is just willing to stand by and let Hadoop have everything. This should make it a very exciting situation for companies and consumers who want the best tools for working with their big data projects.
Prepareing Data for Analysis
Challenges with Data Quality
In any data project, the quality of the data play a critical role. In big data, however, quality is even harder to deal with because the data often originated outside of the immediate project and will have a life of its own beyond it. The researcher doesn't have complete control over the data in the same way that they would with a standard small data project. Consider for a moment that much of the data that goes into a big data project may have once come from spreadsheets. Some researchers have found that nearly 95% of spreadsheets they examined had errors in them.
And it falls back on the familiar expression from computer science GIGO, for garbage in, garbage out. Data can have these problems, and even if you have Hadoop, even if you've got this tremendously sophisticated analysis, if you're starting with bad data you're going to end up with bad results. Now, this gets back to our concept. We want to make sure we have good data to start with so we can have a defensible and informative result. Let me go over very quickly some of the things that could be challenges in terms of the quality of the data.
The first is incomplete or corrupted data records. This can lead to what's called NULL pointers where the computer is pointing or looking in a space where there's nothing there. This can lead to attempts to divide by zero, which can actually cause the computer to crash. You can have duplicate records, which means the same person, for example, is appearing in more than one place and that explains why I get three postcards from the same organization each time they do a mailing.
You can have typographical errors in text and numbers. So for instance, you may accidentally enter a number incorrectly with the wrong number of zeros, too many or too few and it can throw things off dramatically. You can have data that'ss missing context or is missing measurement information, such as the units of measurement. You may recall the Mars climate orbiter that NASA launched back in 1998 at a cost of nearly $300 million and it crashed when it approached Mars a year later because some of the flight programming data had not been converted to metric units.
It's very embarrassing to not accurately specify the units that you're dealing with and can avoid very costly and time consuming errors. And there's also the issue of incomplete transformations of data. Some of the information may have, for instance, I know in psychology sometimes we reverse code information if people are answering it manually. You flip it around, but it's not always indicated whether it got flipped around and that's a horrible mess because then you usually have to throw out the data and that's something that you usually cannot afford to do in a big data project. Now all of these issues are problems in a small data project, but they become even bigger in a big data project.
They become qualitatively different animals. Now part of the problem with big data is that normal methods for checking the accuracy of the data may not be present because, for instance, the person who gathered the data isn't the same person who's analyzing or presenting it. Also if the data stays in Hadoop the whole time, it may not need to be pulled out and converted. We're going to talk about that in the next movie. And if it's not converted, it may not be examined, and so there's a potential of a space that gets missed in terms of what you're looking for. All of these issues are problems in a small data project, but they become much more significant in big data.
And what all of this leads us to is the importance of carefully examining the data at each step, especially if you're bringing any of it out and transforming it, which leads us to the extraction, transform and load step which we'll talk about in the next movie.
ETL: Extract, Transform, and Load
ETL stands for extract, transform and load. This is a term that developed from data warehousing, where data typically resided in one or more large storage systems or data warehouses, but wasn't analyzed there. Instead, the data had to be pulled out of storage, that's the extract stage, and then it had to be converted to the appropriate format for analyses and especially if you're pulling data from several different sources, which may be in different formats, so you transform it. And then once it's ready, you then have to load it as a separate step into the analyses software. Each of these steps involve significant time. In fact there are several software vendors who made most of their money from selling software that facilitated the ETL process, when dealing with data warehouses. And just think for a second of documents that may be stored in several different formats. So for instance, if you're looking at text documents, they can be in Microsoft Word files, they can be in HTML webpages, they can be in e-mails, or PDFs, or any number of formats that you get. What you might have to do though is if you want to deal with the text, is you have to transform all of them into a single common format so you can do all your work on them simultaneously. So for instance, one common one with text is to convert it to plain text, a .txt
file. That's wonderful because basically anything in the world can read a text file. But the problem is that you lose all the formatting, and formatting sometimes carries important information. Let alone the fact that you lose pictures and things like that. So you may want to use a different format. Markdown formatting which is based on text but has codes or indicators for formatting, may be an acceptable alternative. But is something that you just have to consider how are you going to combine things, what information is vital to keep and what information can you afford to lose in the process?
Now the funny thing about Big Data is the ETL extract, transform and load process just works differently in Big Data. And this has to do with the effect of Hadoop. Because the data, for instance, starts and ends in Hadoop. It is not taken from one system to another. It's always in the same system. Now it's true that you're moving it from one sort of part of the system to another, but the fact that it is staying in the same general system, really changes the extracting and loading process. Moreover, Hadoop can handle different data formats. It can handle unstructured data in a lot of different sources. And so the transformation process, is also very different when dealing with Hadoop. Now, a funny thing about this, aside from the fact that it makes life a lot easier, is that when you're dealing with Hadoop, you don't have to be so aware of the extract, transform, load process because it doesn't really happen in quite the same way. So there's not so much inspection of the data, you don't have to think about it so deliberately to solve these problems. It doesn't force you to think about it. On the other hand, what's funny about that is you really miss some opportunities to better understand your data and to check for errors along the way. And so Hadoop has the ironic situation of making your life much easier, but sometimes you need to make your life a little harder and make it a point to deliberately choose to inspect the data for quality and make sure you understand what's going into it. Hopefully, Big Data users will take that opportunity and will audit their data for quality.
Additional Vs of Big Data
At the beginning of this course, I mentioned that big data is typically defined by three Vs. And those are volume, velocity, and variety. So volume meaning a very large amount of data, velocity meaning it's often coming in quickly, might be streaming data, and variety means a lot of different formats and especially not in the regularly structured rows and columns of a spreadsheet or a relational database. On the other hand, there have been a series of other Vs proposed.
Now the thing is we kind of got the V's rolling and people are throwing everything they can in it like volcanic and verisimilitudeness and who knows what else. But the point here is the V's that I'm going to list all do have a legitimate reason for being discussed even if the V word itself is a bit of a stretch. But these are all factors to consider with big data research. The ones I'm going to mention here are veracity. Now, veracity means will it give you insights into the truth about your research questions? It really has to do with does the data that you have contain enough information at a sufficiently micro level for you to be able to make accurate conclusions about larger groups of people? Also, validity, is the data clean? Is it well managed, does it meet the requirements, is it up to the standards of the discipline? Does it have value? And in a business setting that's going to specifically trnaslate to the ROI or return on investment.
Is it worth your time to engage in a big data project? Because you know what? It's not always going to be. Big data is still an expensive, time consuming, major undertaking in most situations. It's getting better, but for right now you need to think about whether that particular analysis really is going to further your organizational goal. Number seven is variability. And the idea here is that the data can change over time and you can usually analyze that. It can also change over a place. And there's a lot of uncontrolled factors that may introduce noise into your data unless you specifically measure and account for them.
A few more V's are eight, venue. And this means where's the data located and how does that affect access to the data and how does that affect its formatting? Number nine is vocabulary. And this refers to the metadata that's used to describe the data, especially when you're combining data from very different sources. When they're talking about the same variable, the same kind of information, they may be using very different terms to describe it. It may not be clear that what's going on there is the same thing and so that becomes one of the challenges in combining data sources in a big data project.
And the number ten and last one we're going to talk about is kind of a funny one, it's vagueness. And that really is, do you know what in fact you're talking about? What do you mean by big data? What are your goals? What are you trying to accomplish? Because you have to have clarity of purpose or you have the potential of wasting an enormous amount of time and energy chasing the wrong thing. Big data is there to serve a purpose. It's there to give you insight that you cannot get otherwise and that gives you an ability to function much more efficiently and intelligently, but you need to be clear about what you're doing or your time might be wasted.
So having big data does not automatically solve organizational questions or overcome research challenges. You can have Hadoop, but that doesn't mean you're going to understand what's going into it. You may have the most sophisticated predicitive analytics equation, but if it's based on something that's irrelevant, you've wasted your time. If anything, big data introduces more things to be aware of because there's more places for things to go wrong or to be confused. And responsibilities tend to be spread out in a Big Data project.
What that is is you usually have different groups of people who prepare the data, who analyze it, who visualize it, who apply it, who form the new sets of questions, find other information to bring in. Because you can have hundreds or even thousands of people involved in a single big data project. No one person's overlooking everything. And so the responsibilities to answer these questions about things like value and vagueness, they become incumbent on everybody understanding what's going on. There's a greater need to think about the quality because there's so many more opportunities to let things slip through the crack.
And there's a greater need to think about the meaning of the project as you go through it. If you do that, then you're going to be in a much better situation. And if we assume, also, that the quality of the data in the project has been verified, then our next step is the actual analysis of big data.
Big Data Analysis
Monitoring and Anomaly Detection
Big data can be helpful for letting people know when unusual things happen or possibly, when they're about to happen. These kinds of notifications can fall into two general categories, although there are other systems for describing notifications. They are monitoring and anomaly detection. At the risk of making the differences between these two procedures sound bigger than it is, here's how I describe each one: Monitoring can be very helpful when you know what you're looking for and you need a notification when that thing occurs.
It detects when a specific event occurs. So you need to be able to specify the criterion in advance. For example, a manufacturer needs to know when one of their machines need maintenance, so they may look at temperatures. They may look at vibration levels. They may look at a number of factors that let them know that breakdown is imminent. Take care of it now. A doctor or nurse needs to know when one of their patients' is sick. They may be monitoring, for instance, for temperature and pulse, and, if possible, for white-cell counts to indicate infection, and a credit-card company needs to know when a charge is potentially fraudulent.
In these cases, it may be possible for a user to specify the particular criteria they need in order to trigger the event, and what's interesting about it is with the monitoring, because you can be very specific, it may even be possible in certain situations, to set up an automatic response that really says, "If X occurs, then Y results," and we take care of it automatically. And so, monitoring is a specific thing. You know what you're looking for. You're waiting for it happen, and possibly, even an automatic response.
Anomaly detection, on the other hand, can describe a situation in which the user wants to know when something unusual happens. They're looking for a notification of unusual activity without necessarily knowing in advance what that something might be. As a result, it needs to be based on flexible criteria, and it says, "Let me know when something "that is out of the ordinary happens, "maybe not just on one factor, "but, like, on a combination of several different factors." And the flexible criteria usually exists to draw a person's attention to something.
So, for instance, in security cameras, they might be able to say, "Look, we don’t know what's going on here, "but we do know that it's out of the ordinary." Or in a stock-trading situation, they might say, "We don’t know what's going on here, "but it needs to be examined." And so it does usually trigger an automatic response, but instead, it invites inspection. But anomaly detection can notice patterns that, for instance, may be too far spread apart in big data or may be too fine for humans to notice on their own. What big data allows here is not that these things can occur, because monitoring anomaly detection have occurred for forever.
Now both of these approaches, monitoring and anomaly detection, are common practices and predate not just big data but computers as well. What big data adds to them though is the possibility to watch for extremely rare events or combinations of factors. So, for instance, if you have an event that's a one-in-a-million. It only occurs one time out of a million observations, that could be really hard to spot if you're doing it by hand or if you're doing it, for instance, even a hundred cases at a time, but if you have ten billion cases that you're sorting through, this one in a million event is going to occur 10,000 times.
And suddenly, that's not a small a number. It's not so rare. In fact, 10,000 is a pretty large number, and it allows you to do statistical modeling. It allows you to break it down by sub-categories. It allows you to figure out exactly what's associated with it and what's causing it. As for anomaly detection, the big data advantage is similar, especially when you look for rare combinations of events. That is, it may be possible to measure a thousand different things at once instead of just, you know, 10 or 12. This allows the machine learning algorithm that identifies anomalous cases to become much more specific and have a much better chance at avoiding both false positives and false negatives.
To take a relatively trivial example of this, let's look at email spam filters. Now spam is a tricky situation because spam is a very fast-evolving sort of virus-like mechanism. It's never the same. It has to change all the time because there's this little arms race between spam and spam filter. So you can't just write a single rule that says, "This is spam," because the spam will adapt to circumvent that rule, and so we have this very quick evolution, and so you can't give clear rules for spam, or it's very hard to do.
And what you find is that if you have a spam filter that looks only at your email and what you say, "This is spam; this is not," you get a lot of false results. You get false positives. You get false negatives. And I remember when I first started with email, that was the case. I used an email client, and I had to indicate what was spam. And I have to admit, it was semi-useless. It didn’t work very well. On the other hand, when you hook up to a big data collection, when you're not categorizing spam just on your own, but when you combine the data from millions or hundreds of millions of users, like, for instance, if you use Gmail or Hotmail or Yahoo, then it combines the collective wisdom of the crowd to determine what is spam and what's.
I've gotta say, my Gmail filter is super at identifying spam, and that's because we're going onto big data, and millions, or truthfully, billions of emails that are sent every day. So big data makes it possible to perform these two kinds of watching, the more specific monitoring and the more flexible anomaly detection with much greater power, by being able to sort through much larger data sets and to look for more diagnostic signs at each point.
Data Mining and Text Analytics
One of the most powerful and common applications of big data is in Data Mining and it's close cousin Text Analytics. Data Mining covers a large and diverse field of activities but the most basic idea is this, use statistical procedures to find unexpected patterns in data. Those patterns might include unexpected associations between variables or people who cluster together in unanticipated ways. For example, managers in a supermarket team might find that people who visit their stores in a particular region, on a particular night of the week are generally different from people who come at other times and places.
The market can then change where coupons are displayed, or if at all, and they can change where certain items are found from day to day to build on those differences. Or an investment company may find that when certain stocks move up together but certain others go down, then a particular stock will generally follow. And that allows them to invest in that one and hopefully make a profit. Or a medical researcher may find that patients who exhibit a very particular pattern of symptoms at one time, even if they don't meet the criteria for a diagnosed illness, are more likely to check into the hospital in the next six weeks, for example.
Perhaps the most common application of this kind of Data Mining is with online advertising because the data base is so large and because it's so easy to adapt the results for each specific viewer. In fact, that's one of the biggest promises of Data Mining, the ability to tailor services to the preferences and behaviors of each individual person once enough data has been gathered. Text Analytics is closely related to the standard kind of Data Mining that deals exclusively with numbers, however Text Analytics is sufficiently distinct to be it's own field. The goal here is to take the actual content of Text data, such as tweets or customer reviews and find meaning and pattern in the words.
This is different from the Meta Data Research that we discussed earlier because that research which can be shockingly informative, dealt just with the numerical information that the computers created on their own and didn't even need to deal with the content of the information. When researchers look at the text itself the interpretive and computational problems become really enormous and that's because human language is so flexible and subtle. Where phrases that sound very similar can have very different meanings, as in the familiar joke, time flies like an arrow, fruit flies like a banana. It's attributed to Groucho Marx sometimes. Now, this is a difficult phrase for humans to understand immediately. It's nearly for possible for computers to understand and explains why the field that's called Natural Language Processing, has had, you know, so many challenges to overcome, and why it's such an active area of research. But perhaps in Text Analytics, the most common task is probably what they call sentiment analysis, or determining how people feel about something. That makes sense if you're thinking about advertising or marketing point of view, you definitely want to know if people feel good or bad about your particular product.
The most basic task in sentiment analysis is determining if a person's feelings are positive or negative. This is referred to as Polarity, in the Text Analytics world. Now, in, in my field, Social Psychology we call it valence. Fortunately, because this distinction is such a common task, there are many programs and packages that you can use in familiar languages like Python or R, that have been developed to help with this. Of course, Sentiment Analysis and Text Analytics are generally much, much more sophisticated than just good, bad, but this gives you the basic idea.
Now, there's much, much more that could be said about these topics, but I hope to make it clear that Data Mining and Text Analytics work best when they have very large and diverse data sets to work with. And that's what big data does best. As researchers continue to develop and refine methods for Data Mining and Text Analytics, the ability to find patterns in numerical data and derive meaning from textual data, they'll become faster, simpler and more nuanced.
Predictive Analytics
Predictive analytics is the crystal ball of big data. That is, it represents a range of techniques that are adapted to work with big data to try to predict future events based on past observations. And while people have been trying to predict the future ever since there have been people, the raw resources of big data and the sophistication of modern predictive modeling have fundamentally changed the way that we look into the future. In the popular world, there are a few well-known examples of predictive analytics.
The first is in baseball, as shown in the book and the movie, Moneyball, where statistical analysis is used to assist to identify an offensive player's scoring ability. And the standard criteria that have been used by people for a hundred years in baseball is to look at things like batting averages and RBIs or runs batted in, stolen bases, and what happened is, baseball has an enormous data set, because it's very easy to count discrete events that occur, and so, you're able to go back and deal with an extraordinarily large data set for sports.
And researchers found that no, the batting averages and RBIs are not the best predictors but on-base percentage, because you can get on base by getting a hit or getting walked or getting hit by a ball or any number of ways, and slugging percentage, which has to do with how many bases you get, are better predictors. The second example is from Nate Silver's remarkable accuracy predicting the results for every single state in the 2012 U.S. Presidential election. Now, what Nate did here is, he has a blog called FiveThirtyEight, which has to do with the number of representatives in congress.
He took data from a wide range of polls, combined that data, and weighted them by their reliability, and he was able to come up with an accurate prediction for every single state in the election. It was remarkable. Now, I just want to show you. He has a website, FiveThirtyEight. The other thing that Nate Silver does is sports statistics. In fact, he's better known for most people for his baseball statistics, like Moneyball, and his website, FiveThirtyEight, which used to be primarily political, has been purchased by ESPN, so here's his current website, and there's Nate down in the bottom right corner, where he is doing all sorts of predictions. Also his site, FiveThirtyEight, is very well known for its college basketball brackets, when you get to the playoffs.
Netflix Prize
The next thing I've brought up before is the Netflix Prize. This is from a few years ago, when Netflix provided a million dollar prize to anybody who could improve the quality of their recommendations by 10% using an anonymized data set that they had provided. Now what's happened with that one is there were some really remarkable statistical analyses that came out of it.
Perhaps the biggest thing that came out of the Netflix prize was the efficacy of what are called ensemble models, and the idea here is, you don't build a single predictive model. You don't try to say, well these are-- this is our regression equation or this is our random forest model to predict. You build as many different predictive models as you possibly can, and then you basically average the results of them. Because it turns out that when it comes to predictions, the average prediction is usually more accurate than any one individual prediction.
It actually comes from a thing about guess the number of jelly beans in a large jar. If you take everybody's predictions and average them, that's usually going to be closer to the real number than any one individual's actual guess. Now, a place that has taken up on this, and the Netflix Prize was a few years ago, but there's a website now called kaggle.com, which is for hosting similar kinds of predictive analytics competitions. I'm going to show you their website for a moment. Kaggle.com, if we come down here, you can see, for instance, right now they're having a machine learning challenge on identifying the Higgs boson.
It's an amazing thing. Let's go back up to the competition, and right now you can see they're hosting a number of competitions from companies that have data, and they're looking to find good predictive models, and they're paying prizes here of up to $25,000. Now, in the past they've had prizes of half million dollars for what they're doing, and they also have a number of free ones that are actually there to teach you how to do predictive analytics. So, for instance, these ones down here on the bottom, the Titanic is an educational one about how to do machine learning in Python and R and other things.
So, Kaggle is a fabulous source for what's available. Predictive analytics is an enormous area of interest because, especially if you're in the business world, trying to predict what is going to happen and having a little bit of foreknowledge can get you a huge competitive advantage. It's an area of incredible growth, and it really is one of the most fascinating things about statistics, because there's always a very clear criterion, which is something that's often lacking. You can tell, if you wait just a little bit, you can tell whether your model is good or whether it wasn't, and the progress in the field makes it possible to learn more and more, and especially with the raw material from big data, there's so much more to work with to build more new ones, more refined models, and to get better predictive abilities for more competitive advantage.
Big Data Visualization
Up to this point, we've been talking about big data, and the things that the computers are able to do for the humans. On the other hand, it turns out there are certain things that humans still do better than computers, and visualization is one of them. Humans are visual animals. We work on sight, and we get a huge amount of information that way. Computers are very good at spotting certain patterns. They're also very good at calculating predictive models and doing data mining in a way that humans would have a hard time doing in a thousand lifetimes.
But, humans perceive and interpret patterns much better than computers do, and so human vision still plays an important role in big data. Humans can see the patterns, and they can see the exceptions to the patterns or the anomalies very quickly. They can also see those patterns across multiple variables and groups. They're also much better at interpreting the content of images than computers are. So for instance, here are some familiar examples from what's called Gestalt patterns.
It's a German word meaning a pattern or a whole. What you see here for instance, is in the top left the three circles or the three arcs, that together suggest, imply a triangle in the middle. The triangle is not there. It's created by the absence. It's very easy for humans to see this, because we're looking at something that is suggested through negative space. It would be much harder for a computer to see it. Similarly, the arrangement of circles and squares is easy for people to follow that on the top right. On the bottom left, we see four squares separately.
Then we see squares arranged in pairs, and then squares all arranged as a single line. Easy for humans to perceive and interpret. Hard for computers to make sense of. In the bottom right in D, it's easy to see rows of dots, and then columns of dots, because humans are built for this kind of visual processing, and it's very hard to describe to a computer how to do it. Now, I want to show you an interesting example from the National Science Foundation, who has their Vizzies. Even though it says the most beautiful visualizations, these are also very informative ones.
I'm going to scroll down just a little bit here, and look at the one on the right, which is about video. I will look at the 2013 winners, and I'll click on the first one here. Now, this is a still frame from a video visualization. What it's showing is weather data from satellites about the earth. What's clear here is that these are circulation patterns of the ocean and the wind, and it's really easy for humans to see the patterns of the swirling shapes that form a continuous flow, and also the circles.
Humans can see these very easily, and this is based on an enormous data set. It's absolutely big data. But it's very hard for the computers to see it. Now, I do need to bring something up. Because visualization is important, a person should not assume that anything goes. There's a lot of things that don't work well. When I look at visualization on the web, most of the examples I get are very pretty, visually arresting pictures, but to me they're not very informative. The important thing here is that a pretty graph is not always better.
It gets back to the rules from Excel. Never use a false third dimension. Don't separate things from the axis. You've got to be able to read them clearly. Also, in many situations, animated or interactive graphs, they can be more informative, but they can also sometimes just be distracting, and a person starts messing around with it, instead of understanding exactly what the message is. So you've got to think very carefully about whether you're going to include those. The goal of data visualization in any kind of graphics is insight. You want to get to the insight as quickly and clearly as possible, and anything that distracts from that, or heaven forbid, gives a person the wrong impression, is a mistake and should be eliminated.
So data visualization is still an area where humans can make an important contribution to big data analysis, and the computers can contribute all of the other models that we've talked about so far. So it's important to remember this human element when planning a big data project, that there is still a need for the human perception and interpretation to make sense of the data in addition to what the computer is able to provide.
The Role of Excel in Big Data
Before we leave our discussion of big data analytics, I want to talk about the role of Excel. Now this is important because a lot of people think that to do big data you have to use rocket science equipment all the way through, and that Excel because it's installed on every computer in the whole entire world certainly isn't special enough and doesn't qualify. That's not true at all. There's a few things about Excel. First off, is as a general principle you want to go to where the people are. The analysis is there to serve a purpose, it's there to inform other people.
And even if you have to store the data and dupe and you've got to use other programs to access it there, Excel is still going to be a really good way to share it because it's where people know how to work. Far and away the most common data tool. There are hundreds of millions, perhaps billions, of copies of Excel floating in the world. Millions of people use it on a daily basis. Even professional data miners use it. I saw a recent survey of data miner software. The third most common application that data miners use in their professional projects is Excel. Now big data and data science have an interesting connection with Excel.
For one thing, Excel, entirely on its own, just the application, is able to do real data science. The best presentation of this is in the book Data Smart: Using Data Science to Transform Information into Insight by John W. Foreman. And he goes all the way through the really advanced capabilities of Excel that make it possible to explore and manipulate data in ways that you probably never thought were possible. But more interestingly using what are called open database connectivity interfaces or ODBC interfaces, you can hook Excel directly to Hadoop and do queries and analyses from the Excel interface.
Let me just go to Excel for a moment. Here I am and we've got the home. If you come over here to data and bring up that menu, and then go to from other sources, you'll see that our very first thing is from a SQL Server, which is a relational database and a lot of information is going to be there, but you can go down to the Windows Azure Marketplace, so that's going to connect you up with Hadoop and these, the data connectivity wizard and the queries and the Odata, these are methods for connecting to big data. And now Microsoft has their own solutions and other vendors have other ways of hooking up Excel into big data and to Hadoop to make it possible to control the analysis, or at least do the queries and the sortings, right here from the single most familiar interface for working with data.
Finally, I want to mention that Excel is also a great way for sharing the results of the analysis. You can make interactive PivotTables, a great way for exploring the complexities of the data and people know how to work these, and sortable worksheets and the graphics and charts are familiar, they communicate, they're clear, it's a great way to go. In fact, I would say that putting the final results into Excel, which provides a degree of exploration and manipulation to your viewers, is probably the most democratic way of sharing the results of a big data analysis.
And again, the point of any analysis is to provide insight that people can work on, that is actionable, that they can use to improve their own businesses and their own projects.
Next Steps
And so we've come all the way around and we're at the final video for techniques and concepts of big data. Before we leave, I wanted to give you some ideas for further learning. We're going to base this on the data science Venn diagram that we talked about earlier, which talks about coding and statistics and domain knowledge. At Lynda.com you can find a really wide range of coding courses. If you're new to coding, a really nice way to do this might be either with Bash Scripting through the command line or with Python.
Similarly, statistics, there's a range of courses available at Lynda, these are actually on my courses about using the programming language R, which is for statistics, or processing for data visualization. I also have courses on SPSS, another statistical package. Although it's not explicitly part of the data science Venn diagram, databases are absolutely central in this sort of split between coding and statistics and there's a great selection of database courses at Lynda.com, such as the foundations courses and the SQL and MySQL and Up and Running with NoSQL Databases.
As far as domain knowledge goes that's going to of course depend on what your specific interests are. For that I'm just going to go to the Lynda.com website. If you go to Browse the Library you see of course that there are hundreds and hundreds of choices. For instance, I'll just scroll down a little bit here to let you see that if you're looking at business in general there are 555 courses that are of interest on Lynda.com. You can find very specific marketing and advertising segments, you can also find app development and web design and education.
There's a huge range of choices here that can give you the specific domain knowledge that you need to combine with your statistical knowledge and your coding skills. And then you can start implementing your own big data solutions in your field.