Preparing Big Data for Analytics: Best Practice from Opti-Num Solutions

The technical team at Opti-Num Solutions have experience working with all types of data, using various techniques to deliver insight through analytics so that our clients reach their goals. Every business has data that can be leveraged for driving initiatives to increase performance. Usually, that data is big! We asked our technical team some questions on how they prepare to tackle big data and collected a few tips and tricks , which they admitted they wished they knew before climbing into big data projects. Their tips are presented here to uncover the complexities behind big data, how it can be simplified, how you can start, and what best practices can be employed to ensure valuable use in business. Through this post, we hope to help you better prepare your data as well as show you how to find the low hanging fruit so you can better understand where to dive deeper into your system and find the underlying insights.

While there are plenty of great courses out there on big data analytics to help you with doing the necessary analytics, not much is stated on how to prepare your data for these analysis, as one of our software engineers stated concerning the example data sets, “… the data set is small, already cleaned and the patterns you are looking for are obvious by design. Real life analytics does not go that way.” At Opti-Num Solutions, we’ve been approached by clients with data sets of all kinds, from clear and consistent data sets with well-defined analytics goals that allows our consultants to dive straight in, to those that require much more time to wrangle before the analyses can begin.

In this article we will unpack the following tips to help you start working with big data:

  • How to approach big data
  • How to manage data security
  • How do you manage your data to get the most value
  • Workflows to define insights from large data
  • Other game-changing tips from the team

Question 1: How do you approach big data?

  1. Know what your goal is before you begin. Before you begin your analyses, know what your end goal is, with large sets of data especially, it is easy to get lost and go down a rabbit hole of analysis within your larger data set.
  2. Know what the data is. Before beginning, it is important to know what type of data it is you are working with. Is it time series or categorical? Does it contain complex structures necessary for the analyses? Having a good understanding of this will provide guidance as to which techniques would be well suited to your data. Along with this understanding, it would do well to have a reference document or a data dictionary to accompany your data describing what each folder, file and/or table column contains to remove any confusion that might arise, especially if those collecting the data are different from those performing the analyses.
  3. Store the data appropriately. How you store your data is a vital aspect of data analytics and will inform how analysts will interact with it. The types used to store your data, while generally pre-set by most storage systems, need to be checked to ensure they work as you expect. Be wary of dates, integers, floats, strings, characters, and especially Boolean values; using the correct data format should not only ensure better communication between data sets but also potentially reduce the storage space needed for your data if done well. While a relational database will generally cover most data storage needs, depending on what is being stored, other data storage options might be necessary. However, be wary as some storage options [such as MS Excel] are better served for smaller data sets where it can all be loaded simultaneously; it is heavily affected by the data you are working with. At Opti-Num we have developed the Data Management Framework which allows you to create custom databases for your pre-processed data for quick use in analysis.
  4. Start small and scale. When working with a large data set, begin with a subset that encompasses all your requirements as you build and test your analytics model before using your full data set. Be sure to include enough data in your initial sample set to cover everything you expect to see; cover as many categories as you can for categorical data, ensure you include any seasonality patterns in time series data.

Question 2: How do you ensure the data is secure?

  1. Access control. Who has the data? With big data comes sensitive information, and it must be made clear that the data can only be accessed by those who need it. Simply put, data is your most important commodity, the bigger it is the more insight there is to drive business decisions, only those who should be using it should use it.
  2. Keep it impersonal. When dealing with any data (big or small) that contains sensitive information regarding identity, an important first step is to obfuscate it. Removing any form of data that relates back to identity is an essential security measure, but the questions remains, what if that data is important to the analysis? During the pre-processing phase of working with the data, extract what is needed from that data and then remove it from the data stream.

Question 3: How do you set up the data for effective use?

  1. Know what is important for your analysis. Your data might include a lot of data which, while important, might not be useful for your analysis. Be sure to filter these out as an initial step as well as collecting all the necessary data together for faster access and effective use. Rename your data if you need to, storing it after large processing steps are complete for faster analysis and repeatability in future.
  2. Use time stamps and time series to better define your data. Making use of time stamps, especially when large chunks of processing have been completed to let you know when it needs to be recompiled or for how much longer it can be reused, this is especially useful for categorical data.

Question 4: What helps you define insights from large data?

  1. Visualise the data. Getting a feel for what is in the dataset, initially by breaking up the data into subsets, and creating high level plots to understand the data you are working with, you can begin to build a base from which insights can be extracted.
  2. Find the relationships. The one thing that will hold true is that there will be relationships within your data, and those high-level relationships will bring forward more questions to help get the answer to your question. With visualisations you can find areas for exploration within the data and those will lead to insights. Leveraging the natural behaviour in your data through various visualisations will aid you in achieving your goal.
  3. Segment and cluster. There are various techniques available to you that will allow for you to find characteristic groups within your data. Once these groups are defined, based on a determined classifier, your big data can be made into smaller chunks of workable data, this allows for areas of targeted focus in your analysis. Opti-Num have been able to successfully use these techniques in driving customer segmentation on a regional level.

Question 5: Other useful tips?

  1. Analysis paralysis is real. It is something everyone who deals with large data sets runs into at some point or another, the feeling that there is so much data to work through that you do not even know where to begin. When this happens, it is important to take a step back, acknowledge that it is happening and look back at the end goal of your analysis to see where to go next.
  2. Communicate with those generating the data. While a data dictionary is invaluable to those performing analysis on the data, communicating with those who set up the data will most definitely provide new insight to your investigation.
  3. Ensure consistency. While working on the data, be sure not to change it, any change to the system could invalidate some of the work already done, note changes that need to be made and communicate them thoroughly to anyone using the data to ensure the transition occurs as smoothly as possible.

Setting up your data for effective analysis can be challenging and nuanced, but with these tips, we hope to have provided some clarity to help you along your way. With the MATLAB datastore functionality, and the Processing Big Data with MATLAB course, along with the answers to the above questions, you can achieve effective and worthwhile analysis to help drive insightful decision making within your business.

If you would like Opti-Num to provide a more in-depth guidance and advanced analytics on your data, get in contact with us and we’ll help you find both the low hanging fruits and the deeper insights from your data sources.

What Can I Do Next?