Thursday, July 12, 2012

You can be a data scientist for big data just using Excel and Access

Just google 'Big Data' and watch the explosion of results. Some herald it as the next big thing, others peddle their consulting services/software/systems as the magic recipe for companies to mine gold from their (till now) ignored legacy data sets. However, as I've realized from a customer service project during the past few weeks, spreadsheets can do the trick equally well, provided you follow a few basic ground rules

  1. Understand the underlying data, its fields/formats and the limitations. This may need visiting the field/POS/scanners/call centres to get a feel for the source data, but can do wonders in understanding what can be relied upon and what can't. After all, all the number crunching in the world even using 500 columns is useless if the underlying data is not reliable
  2. Frame hypotheses(even though they may seem 'obvious') and look for data to validate and disprove them. For example, if you think that customers recently added on the network will face more issues with their bill delivery, then choose data fields that can help you validate that hypothesis. A few fields well chosen can really reduce the time used for import/export of data. 
  3. Use subject matter experts to review the findings and offer their views. often, if a finding challenges their conventional wisdom, they can offer alternate reasons for the same. 
  4. Creatively think of dimensions like time, space, temperature to analyze data sets. Often, these physical variables are ignored but are available in the system. Using the existing data is far cheaper than trying to get it from scratch. 
  5. Data mining for social networks, stock trading data and other massive data volumes would obviously need some automation via neural networks, crawlers, AI etc. But for everything else, one can use spreadsheets to squeeze out that last drop of insight from existing data sets
  6. Understand the import/export features of spreadsheet/database software like Excel/MS Access etc. If  your data set with some tweaking can fall in that range(like below 1million odd rows etc), then using those standardized tools may be faster and efficient. Also, no need to call expensive data conversion experts, you can do the job yourself(for example importing .txt files to Excel)
  7. Don't be afraid to google any trouble areas. Often, excellent solutions available online(for instance, how to export more than 65,436 rows from MS Access to Excel-the online solutions include exporting w/o formatting, or using OCDBC)
Bottomline:- Don't be awed by big data sets. Use pivot tables etc to summarize the data and get a feel for it, then slice and dice it the way you can using Excel/Access. Often, those quick wins may win you a justification for a bigger data mining project

No comments: