Art of Data Recon

By Vineeta Kommineni posted 12-20-2020 11:32 PM


I see a number of articles on how to master data querying and manipulation, but little by way of developing data smarts. Surprising, given that a big part of discerning meaning from data comes from sleuthing (and a gentle flexing of “pester power”). This piece will help you briskly gather sufficient intel on new datasets, and put your data to work in hours or days, rather than in months.     

The Dossier

As part of standard onboarding you should expect to receive the list of columns in a dataset, a vague description and associated data type, and a few sample queries. If you are lucky, you will get a business description of the fields, and the sample queries will come with detailed comments to explain why a specific filter is being used, indexed fields, key metrics and corresponding logic for calculation. 

 If your team prioritizes documentation (unfortunately, this is not common), you’ll also find these handy guides:   

  • “Data pipeline” diagram that shows the systems that shows the flow of the data through the system. This diagram is a first step in understanding the data chain, and finding out the processing the data undergoes at each stage (example, excluding bot traffic) 
  • If you are working with an RDBMS, an Entity-relationship diagram (ERD): diagram showing linkages between datasets and logic and business rules

The Informants

To gather more specifics, you need to navigate the organization; identify and introduce yourself to the following folks: 

  • Who owns this data? This is the individual that has to sign off on structural changes to the data. Ask them about the process for making changes. How can you ensure you’re in the loop when these changes are made? 
  • Who maintains this data? What times does the data get refreshed? How to ensure you get notified of downtimes? What level of granularity or aggregation is available - daily/monthly/quarterly…
  • Is there a data governance or audit team? Is there detailed documentation around metric definitions. Has consent been gathered when collecting data; General Data Protection Regulation (GDPR)/California Consumer Privacy Act (CCPA)? What parts of the data are sensitive, what parts are confidential. Has PII been hashed? What is the retention period for the data?
  • Which other teams use this data? Get acquainted with the heaviest users, so you can exchange nuances of the data. 
  • Who is the point person in the Finance team? Management tends to trust the financial data more, and it is important to reconcile numbers from financial systems and digital analytics and be armed with an explanation for any differences. 

Antennae Up

Here are a few parameters for sizing-up a dataset:

  • What does each row of the data signify? If it is web traffic data for example, is each row a hit or a visit level? 
  • What are the “hot fields” - the columns that are either most used or most nuanced in the dataset
  • For transactional tables you can legitimately have multiple rows of data for a primary key. 
  • Are there “unique” measures? Example, Visitors metric is unique and needs to be de-duped. Also, expect to encounter pre-aggregated unique measures, which cannot be summed and the granularity for de-duping is not stored.
  • What are the nuances around date columns, ‘Reporting Date’, ‘Transaction Date’...Which time zones are these dates in? Do they account for daylight savings time?
  • How are corrections handled for returns/cancellations/fraud? Negative amount/quantity? Order returns - data changes by day.
  • If financial transactions are in multiple currencies, what are the rules for exchange rates?
  • Are there dependencies between fields? If field type is A, then status may only be X, Y or Z and so on. 
  • Are labels localized? A status can be labeled ‘Approved’ in the US, but ‘Aprobado’ in Spain. Where can you find the lookup tables (30 - Approved, 30 - Aprobado)?
  • What is the significance of null values, if allowed?

Finally, understand the volume of data, so when querying you don't generate monster queries which the system owner will call you up for, for degrading system performance.


There are several other angles to consider in reducing the time to deliver value; nail business needs as they relate to data, appreciate subtleties in data instrumentation, visual data exploration and so on. These will be covered in future articles. 

What else do you watch out for from the perspective of understanding a dataset? Please share in the comments. 

Image Credit: Stocksy


Most Recent Blogs

  • In this post,   Tom Arundel   discusses five key areas of focus for a Digital Data Center of Excellence, and why one ...

  • Posted in: Women in Analytics

    Volunteering with the DAA’s Women in Analytics mentoring program has been an eye-opening experience for me. The DAA states ...

    1 person likes this.
  • Posted in: Women in Analytics

    I am delighted to bring you a guest blog from Tammy Kim-Newman is Business Developer and Tech Talent Specialist at the University ...

  • I was recently asked to interview a candidate who we are entertaining to join first PRO . Business continues to be brisk, ...

  • Posted in: Member Open Forum

    “Are you insane?” This was the reaction I received when I moved jobs, from client-side to an agency, in the latter part ...

  • For nearly the past three years in Philly, it has almost never snowed. I like it when it snows. Watching a snowfall is a ...