A standard question I ask my clients, sometimes directly, sometimes subtly is “Who here, loves to talk, and knows what they are talking about”. As a Consultant, in particular, it is critical to minimize ramp-up time and demonstrate value sooner, often in a situation where there is no formal onboarding.
In an early role, a co-worker assured me it would take at least 2 months to get a lay of the land in terms of the nuances of datasets, the gotchas in playing with the data, the data infrastructure and pipeline and so on. Over the years, I’ve been able to reduce the ramp-up in less than a week, using sleuthing (and a gentle flexing of “pester power”).
This article is for beginners and experienced data professionals alike who are seeking to revamp their ramp-up time. It lays out my work process: the questions to ask, identifying the right people who can give reliable answers. The objective is to help you briskly gather sufficient intel on new datasets, and put your data to work in hours or days, rather than in months.
As part of standard onboarding you should expect to receive the list of columns in a dataset, a vague description and associated data type, and a few sample queries. If you are lucky, you will get a business description of the fields, and the sample queries will come with detailed comments to explain why a specific filter is being used, indexed fields, key metrics and corresponding logic for calculation.
If your team prioritizes documentation (unfortunately, this is not common), you’ll also find these handy guides:
- “Data pipeline” diagram that shows the systems that shows the flow of the data through the system. This diagram is a first step in understanding the data chain, and finding out the processing the data undergoes at each stage (example, excluding bot traffic)
- If you are working with an RDBMS, an Entity-relationship diagram (ERD): diagram showing linkages between datasets and logic and business rules
To gather more specifics, you need to navigate the organization; identify and introduce yourself to the following folks:
- Who owns this data? This is the individual that has to sign off on structural changes to the data. Ask them about the process for making changes. How can you ensure you’re in the loop when these changes are made?
- Who maintains this data? What times does the data get refreshed? How to ensure you get notified of downtimes? What level of granularity or aggregation is available - daily/monthly/quarterly…
- Is there a data governance or audit team? Is there detailed documentation around metric definitions. Has consent been gathered when collecting data; General Data Protection Regulation (GDPR)/California Consumer Privacy Act (CCPA)? What parts of the data are sensitive, what parts are confidential. Has PII been hashed? What is the retention period for the data?
- Which other teams use this data? Get acquainted with the heaviest users, so you can exchange nuances of the data.
- Who is the point person in the Finance team? Management tends to trust the financial data more, and it is important to reconcile numbers from financial systems and digital analytics and be armed with an explanation for any differences.
Here are a few parameters for sizing-up a dataset:
- What does each row of the data signify? If it is web traffic data for example, is each row a hit or a visit level?
- What are the “hot fields” - the columns that are either most used or most nuanced in the dataset
- For transactional tables you can legitimately have multiple rows of data for a primary key.
- Are there “unique” measures? Example, Visitors metric is unique and needs to be de-duped. Also, expect to encounter pre-aggregated unique measures, which cannot be summed and the granularity for de-duping is not stored.
- What are the nuances around date columns, ‘Reporting Date’, ‘Transaction Date’...Which time zones are these dates in? Do they account for daylight savings time?
- How are corrections handled for returns/cancellations/fraud? Negative amount/quantity? Order returns - data changes by day.
- If financial transactions are in multiple currencies, what are the rules for exchange rates?
- Are there dependencies between fields? If field type is A, then status may only be X, Y or Z and so on.
- Are labels localized? A status can be labeled ‘Approved’ in the US, but ‘Aprobado’ in Spain. Where can you find the lookup tables (30 - Approved, 30 - Aprobado)?
- What is the significance of null values, if allowed?
Finally, understand the volume of data, so when querying you don't generate monster queries which the system owner will call you up for, for degrading system performance.
There are several other angles to consider in reducing the time to deliver value; nail business needs as they relate to data, appreciate subtleties in data instrumentation, visual data exploration and so on. These will be covered in future articles.
What else do you watch out for from the perspective of understanding a dataset? Please share in the comments.
Image Credit: Stocksy