February 2017 |
[an error occurred while processing this directive] |
Fishing for Actionable Intelligence in our Rapidly Growing Data Lakes We are collecting data with a current need in mind; we are collecting for future purposes, and we are collecting it just to have it, in case we find a way to use it later. |
James McHale, Managing Director, Memoori |
Articles |
Interviews |
Releases |
New Products |
Reviews |
[an error occurred while processing this directive] |
Editorial |
Events |
Sponsors |
Site Search |
Newsletters |
[an error occurred while processing this directive] |
Archives |
Past Issues |
Home |
Editors |
eDucation |
[an error occurred while processing this directive] |
Training |
Links |
Software |
Subscribe |
[an error occurred while processing this directive] |
With
the advent of the internet of things, the amount of data being
collected around the world is growing exponentially. We are collecting
data with a current need in mind; we are collecting for future
purposes, and we are collecting it just to have it, in case we find a
way to use it later.
In a smart building, occupant tracking data is switching on/off
lighting and HVAC systems to reduce energy consumption. That data is
stored to help artificial intelligence systems predict future use and
flow around the facility. We collect and store all manner of data from
every system around the building believing that it will one day help us
answer a question we are yet to consider.
Each new data stream feeds into a data lake and data used remains
stored, so this lake will grow and grow forever, it seems. Like a
natural lake, this digital lake needs space to exist, and each year we
hear about the latest, biggest data center in the world. The growing
problem is that it will become increasingly difficult to find, combine
and use information in this endless sea of data we are creating.
“You need a mechanism that can be controlled, governed
as a process, to deliver exactly what you need into the data lake. And
not just dump information in there,” says Chuck Yarbrough, the senior
director of solutions marketing and management at Pentaho.
Without some smart management for the data going into the lake you’re
going to end up with a “toxic dump,” suggests Yarbrough. However, we
cannot currently envisage every use we may have for this data. So any
smart management for the data going into the lake will inevitably
discard data that may well be valuable one day. Until it is valuable,
however, it is costly, both in terms of physical storage and how much
it disrupts the ability to find information when it is needed.
There is another problem too; data collected from a plethora of smart
building and IoT systems arrives in different volumes, varieties,
velocities, and veracities. Data must be kept in its raw form in order
to be able to use it in all the unimagined ways it may one day be
useful. This makes data storage less efficient, and categorization of
many data sets much more challenging.
It was Albert Einstein who said, “we can’t solve problems by using the
same kind of thinking we used when we created them.” Having
categorized, classified and tabulated data in sequential databases for
many years, we realized that we had created a problem when faced with
multifaceted IoT data. The solution was NoSQL (Non Structured Query
Language); non-sequential databases that are searchable and usable
despite the diversity of data formats present.
While not as functional as SQL databases, NoSQL allows us to store all
the raw IoT data we collect more efficiently and in a way that we can
use. Dan Kogan, the director of product marketing at Tableau, says we
are shifting toward “faster databases like Exasol, MemSQL, and Kudu.”
He was speaking specifically about cloud platforms, and new innovations
like Amazon Athena, that are allowing us to turn a giant S3 “data
lakes” into actionable analytics without investing new infrastructure
or tools.
[an error occurred while processing this directive]
NoSQL solutions such as MongoDB and Redis grew significantly in
popularity in 2016, spelling the end for SQL, or so we thought. Lloyd
Tabb, the founder and chairman of analytics firm Looker, says that
Google BigQuery is “essentially infinitely scalable and fully ANSI
compliant,” suggesting it will reinvigorate the case for SQL.
“You can’t just plan your lake as a data repository. You also need to
plan the toolage around it,” says Philip Russom, the senior research
director of data management at TDWI.
Whichever way you look at it; SQL or No, keeping all
data or some; we must accept that data lakes will continue to grow
rapidly and endlessly. We probably also need to accept that some data
will have to be discarded before entering the lake. We need to find a
balance between these factors and we need to develop better ways to
store and access multifaceted IoT data.
In 2017, we can expect more data owners around the smart building and
IoT space to be facing the reality that the situation cannot continue
along this path. They must get rid of some data, find a potential use
for it, and or find a better way to store it, in order to justify the
cost of keeping it.
It’s like fishing, but the final catch you want (the actionable
intelligence) requires an unknown number of different types of fish
from different parts of a rapidly growing lake, to make a perfect meal
that you are yet to imagine. So how do you best manage your lake?
[an error occurred while processing this directive]
[Click Banner To Learn More]
[Home Page] [The Automator] [About] [Subscribe ] [Contact Us]