Monday, October 06, 2014

Five Trends in Big Data Analytics

Big data are described as having big volume, complex structures and being updated frequently. Analytics, the technology that extract meaningful information from the "raw" data to support decision, is the ultimate driver of the value of the big data. After extensive research, deep-data-mining.com has identified the following five trends in big data analytics.

1. SQL Based In-Database Analytics

Data analytics functions are built within relational database engine. Users take advantage of SQL to perform data mining/predictive analytics task. All processes including data extraction, data preparation, predictive model building and validation, and model deployment are done within the database. SQL based in-database analytics will become a trend, particularly in the enterprise environment. This observation is based on the following facts. Most of the core enterprise data are stored in relationship databases. A lot of business logic that support the daily operation of an enterprises are written in SQL. There are huge number of SQL developers around the world and many of who will be able to perform data analytics task without learning other scripting languages.

2. Apache Spark

According to The Apache Software Foundation, Apache is a fast and general engine for large-scale data processing. Apache Spark runs programs much faster. Its machine learning library MLlib include SVM, logistic regression, decision tree and k-means clustering, etc.

3. the Proliferation R Language

R is a statistical analysis language that is much more "natural" to data scientists than other programming languages such as Java, C or SQL. For example, the R scripts to manipulate vector/matrix or build predictive models are very similar to the mathematics equations found in textbooks. For example, the following R script builds a logistic regression model that predicts y based a, b and c.
  glm(y ~ a+ b + c, data = trainset, family = binomial(link = "logit"))
There is a big community of R users who develop R algorithms and share them in the format of R packages. Thus, we can find almost all data analytics algorithms in R packages.

4. Real Time, In Memory Data Analytics

Traditionally, raw data are that are collected in real time in computer memory but are transformed and loaded into disc-based data warehouse periodically. The data are analyzed offline in a delayed fashion. For example, it is not unusual for a large enterprise to take weeks, if not months, to build a predictive model, test and eventually deploy it. Due to the slowness in identifying new useful patterns in the data, opportunities are lost in the case of new sales or risks are realized in the case of fraud prevention. Thus, there will be a trend to shift traditional offline, disk-based data analytics to online, in-mermory, real time environment.

5. Innovative Data Analytics Applications

As we know, the ultimate purpose of big data is to provide data-driven decision support to solve problems. There will be more and more innovative applications of big data analytics. For example, polices will use model to predict potential repeated criminals. Colleges predict in advance if a student will choose to drop from the school based his background and current situation. Human resource department in a large company can design the best career paths for its employees using models. Applications of data analytics is unlimited.