How to Avoid Common Data Analysis Errors in Tech

Data Engineer, startdataengineering.com | Bringing software engineering best practices to data engineering.

47,175 followers 3mo

It took me 10 years to learn about the different types of data quality checks; I'll teach it to you in 5 minutes: 1. Check table constraints The goal is to ensure your table's structure is what you expect: * Uniqueness * Not null * Enum check * Referential integrity Ensuring the table's constraints is an excellent way to cover your data quality base. 2. Check business criteria Work with the subject matter expert to understand what data users check for: * Min/Max permitted value * Order of events check * Data format check, e.g., check for the presence of the '$' symbol Business criteria catch data quality issues specific to your data/business. 3. Table schema checks Schema checks are to ensure that no inadvertent schema changes happened * Using incorrect transformation function leading to different data type * Upstream schema changes 4. Anomaly detection Metrics change over time; ensure it's not due to a bug. * Check percentage change of metrics over time * Use simple percentage change across runs * Use standard deviation checks to ensure values are within the "normal" range Detecting value deviations over time is critical for business metrics (revenue, etc.) 5. Data distribution checks Ensure your data size remains similar over time. * Ensure the row counts remain similar across days * Ensure critical segments of data remain similar in size over time Distribution checks ensure you get all the correct dates due to faulty joins/filters. 6. Reconciliation checks Check that your output has the same number of entities as your input. * Check that your output didn't lose data due to buggy code 7. Audit logs Log the number of rows input and output for each "transformation step" in your pipeline. * Having a log of the number of rows going in & coming out is crucial for debugging * Audit logs can also help you answer business questions Debugging data questions? Look at the audit log to see where data duplication/dropping happens. DQ warning levels: Make sure that your data quality checks are tagged with appropriate warning levels (e.g., INFO, DEBUG, WARN, ERROR, etc.). Based on the criticality of the check, you can block the pipeline. Get started with the business and constraint checks, adding more only as needed. Before you know it, your data quality will skyrocket! Good Luck! - Like this thread? Read about they types of data quality checks in detail here 👇 https://lnkd.in/eBdmNbKE Please let me know what you think in the comments below. Also, follow me for more actionable data content. #data #dataengineering #dataquality

12 Comments

Taimur Sajid

Asset-Backed Finance | Ex-Blackstone & JPM

8,811 followers 6mo

🚫 Modeling Mistake #2: Ignoring Statistical Fundamentals The most pervasive issue I encounter among data scientists and quants — particularly those early in their careers or without formal training — is the tendency to skip past essential statistical analysis and jump straight into model development. This predisposition towards modeling first and analyzing data later (if at all) is guaranteed to create problems down the line. These fundamentals are aren't complex: I’m talking about examining distributions through density plots (you'd be surprised how often this reveals bimodal or mixed distributions), visualizing relationships between variables (instead of correlation matrices), and testing basic statistical properties of your data (imbalanced data, non-stationarity, etc.). Take credit risk modeling — I frequently see teams jump into using sophisticated ML techniques without first examining the distribution of their target variable, only to later discover they're dealing with a zero-inflated distribution. Or they'll use feature selection methods without even plotting their features against the target variable, missing obvious non-linear relationships. Why is this important? Well, I've seen projects where teams spent months developing sophisticated models, only to discover fundamental data quality issues that invalidated all of their work. This isn't just about following best practices. It's about building an understanding that will inform every subsequent modeling decision you make. #DataScience #Modeling #QuantitativeFinance #Analytics

Shikha Shah

Founder and CEO

4,623 followers 7mo

Today, I would like to share a common problem of *Broken Data Pipelines* that have encountered in the past in my career. This disrupts critical decision-making processes, leading to inaccurate insights, delays, and lost business opportunities. According to me, major reasons for these failures are: 1) Data Delays or Loss Incomplete data due to network failures, API downtime, or storage issues leading to reports and dashboards showing incorrect insights. 2) Data Quality Issues Inconsistent data formats, duplicates, or missing values leading to compromised analysis. 3) Version Mismatches Surprise updates to APIs, schema changes, or outdated code leading to mismatched or incompatible data structures in data lake or database. 4) Lack of Monitoring No real-time monitoring or alerts leading to delayed detection of the issue. 5) Scalability Challenges Pipelines not being able to handle increasing data volumes or complexity leading to slower processing times and potential crashes. Over the period, I and Team Quilytics has identified and implemented strategies to overcome this problem by following simple yet effective techniques: 1) Implement Robust Monitoring and Alerting We leverage tools like Apache Airflow, AWS CloudWatch, or Datadog to monitor pipeline health and set up automated alerts for anomalies or failures. 2) Ensure Data Quality at Every Step We have implemented data validation rules to check data consistency and completeness. Use tools like Great Expectations works wonders to automate data quality checks. 3) Adopt Schema Management Practices We use schema evolution tools or version control for databases. Regularly testing pipelines against new APIs or schema changes in a staging environment helps in staying ahead in the game 😊 4) Scale with Cloud-Native Solutions Leveraging cloud services like Amazon Web Services (AWS) Glue, Google Dataflow, or Microsoft Azure Datafactory to handle scaling is very worthwhile. We also use distributed processing frameworks like Apache Spark for handling large datasets. Key Takeaways Streamlining data pipelines involves proactive monitoring, robust data quality checks, and scalable designs. By implementing these strategies, businesses can minimize downtime, maintain reliable data flow, and ensure high-quality analytics for informed decision-making. Would you like to dive deeper into these techniques and examples we have implemented? If so, reach out to me on shikha.shah@quilytics.com

Jonathan Hershaff

Data Scientist @ Airbnb | ex-Stripe | Causal Inference | Economist | WhatsTheImpact.com

6,671 followers 1y

A common mistake I’ve seen from analysts, junior data scientists, and business partners is mistaking statistical significance for causality. We’ve all heard the mantra “correlation is not causation,” yet stats such as the uncontrolled college wage gap are commonly described as causal. Stakeholders and analysts may see significant results from a regression estimate and not recognize that the results may simply reflect correlations. To help understand this important concept, I generated hypothetical data on product sales, where the quantity sold is positively driven by “quality” but negatively driven by price. Plots and regression estimates that ignore or cannot accurately measure quality suffer from “omitted variable bias,” which in some cases can show statistically significant relationships that aren’t even directionally accurate (that is, the estimated result is positive despite the true relationship being negative). I share code with an accompanying tutorial here: https://lnkd.in/eUz456Hi #datascience #datasciencetutorial #dataanalyst #dataanalytics #datascientist

11 Comments

🎯 Ming "Tommy" Tang

Director of Bioinformatics | Cure Diseases with Data | Author of From Cell Line to Command Line | Learn to understand | Educator YouTube @chatomics

49,318 followers 6mo

Beginner Mistakes in Genomics Data Analysis (And How to Avoid Them) 1/ When I started with genomics, I made mistakes. Here are key lessons I learned the hard way. Don't repeat them. (I wrote it 10 years ago) 2/ Computers make mistakes too They can produce nonsense results without errors. Always test your code extensively before running large analyses. 3/ Share your code Even if it works, share it. Others can review, spot errors, and improve it. Open science benefits everyone. 4/ Make your scripts reusable Don't hardcode file paths. Instead, use arguments so your script can run on different datasets easily. python myscript.py --input data.bam --output results.txt 5/ Modularize your code Genomics data comes in different formats. Avoid one huge script. Instead, split it into logical steps. Example: ChIP-seq analysis • Module 1: Fastq → BAM • Module 2: BAM → Peaks If someone has BAM files, they can skip Module 1 6/ Comment your code heavily It helps others understand your logic and helps you six months later when you forget what you did. 7/ Make your analysis reproducible Document every step in a markdown file. • Every command you run • Intermediate files • Where & when you downloaded data It will save you (and others) from frustration later. 8/ Key Takeaways • Code fails silently—test it • Share & document your work • Avoid hardcoding & modularize scripts • Keep everything reproducible 9/ Action Item Start documenting every step today! Future-you will thank you. More tips: https://lnkd.in/eiJu7rR7 I hope you've found this post helpful. Follow me for more. Subscribe to my FREE newsletter https://lnkd.in/erw83Svn

5 Comments

David Honigs

Field Application Scientist at PerkinElmer Inc.

2,466 followers 8mo

#AskMrNIRPerson : The Worst Thing I've Seen in NIR Calibration Dear Mr. NIR Person, What’s the worst thing you’ve ever seen in #NIR calibration? Signed, Morbidly Curious Dear Morbidly Curious, Ah, the worst thing? The absolute nightmare? Easy. It’s when people don’t get their numbering or IDs straight, and we can’t match spectra to chemistry correctly. It doesn’t take many mix-ups to wreck everything. Honestly, it’s like trying to separate laundry after someone spilled coffee all over the labels. When there are just a couple of mismatches, we usually drop them because bad data corrupts good data faster than a bad habit ruins your tennis backhand. But why does this keep happening? Well, let me introduce you to the greatest hits of data disasters: Boxes Marked with Just a Name: Someone labels a box "Box A," and inside are samples 1 through 20. Then someone else thinks, "Why not just use 1 to 20 for IDs and ignore the box name?" Fast forward, and you’re trying to figure out which "Sample 5" came from which box. Pure chaos. Shortcutting Repetitive Names: Another classic move—people decide to "clean up" by cutting repetitive parts of the name. "Box_A_Sample_1" suddenly becomes "SA1," and now you’ve got 15 different "Something1s" with no context. Great job. Non-Waterproof Ink: Oh, the joy of smudged labels! You think you’re working with "Gloop," only to discover it might actually be "Glop." Ink that runs, labels that blur—it’s like a tragic magic trick that makes your data disappear. The worst part? Nobody notices until it’s too late. Human Error in Data Entry: IDs get manually entered, someone fat-fingers a number, or worse, skips an entry altogether. You’d think double-checking would be a universal skill, but alas. Every time, my advice is the same: Make sure the IDs match. Match your spectra IDs to your chemistry IDs like your project’s future depends on it—because it does. Label everything clearly, redundantly, and preferably with ink that won’t wash away in the rain. If anyone scoffs at your meticulous labeling, remind them: bad data weighs a hundred times more than good data, and cleaning it up later is expensive. And hey, we’ve all seen someone’s data equivalent of "Jerry Springer drama." You don’t want that. So, yes, Morbidly Curious, this is why I double-check IDs, use waterproof markers, and sleep with a stack of correctly labeled sample bags under my pillow. Because you never know when someone’s going to mix up Box A and Box B—or worse, turn Gloop into Glop. Yours in Calibration, Mr. NIR Person

11 Comments

Kavana Venkatesh

12,242 followers 1y

🛑 Here are the top mistakes I made as a data scientist fresh out of college, and the lessons I learned from them: 👉 Ignoring Business Context: One of the most significant mistakes I made was ignoring the business context of the organization I was working for. I was so focused on the technical aspects of the job that I forgot about the business goals and objectives. Approaching a problem with a value delivery focused mindset is a gamechanger! 👉 Overcomplicating Data: While data preparation is an integral part of the job, I realized that I was overcomplicating and wasting time on things that didn't matter. I learned to simplify the process by prioritizing important features to save time and deliver insights quickly. 👉 Not Communicating Effectively: Data scientists need to be able to communicate complex findings and models to non-technical stakeholders. Effective communication is key to gaining credibility and earning buy-in from decision-makers. 👉 Using Complex Models without Justification: New data scientists can often get excited about using the latest models, regardless of their level of complexity. However, it is important to justify the use of a model based on the business problem and data available. If there is no clear reason to use a complex model, then it is better to use simpler ones. 👉 Not Testing Assumptions: I used to make assumptions about my data without testing them. This can lead to incorrect conclusions and incorrect solutions to business problems. It's important to test assumptions and make sound inferences based on the data. Here are some helpful tips that I learned: ✔ Define the Problem: Start by defining the problem and why it matters to the business. This will help you stay focused on the end goal and develop solutions that align with business objectives. ✔ Continuously Learn: The field of data science is continually evolving, so it's essential to stay up to date on the latest tools and techniques. Take online courses, attend conferences, and participate in local meetups to expand your knowledge base continually. ✔ Collaborate: Data science is a collaborative field, so seek out opportunities to work with others. Collaborating with other professionals with different skill sets and perspectives can help you see problems from different angles and arrive at better solutions. ✔ Tell a Story with Data: Visualizing data can help tell a story, making complex data more accessible to stakeholders. Developing skills in data visualization can help you communicate your findings effectively. ✔ Focus on Impact: Always keep in mind the impact of your work on the business and end-users. Understanding the impact of your work can help you make better decisions and prioritize your time and resources. Follow Kavana Venkatesh for more such content. Book a 1:1 call with me for any support in your AI journey using the link in my profile. #datascience #ai #nlp #deeplearning #computervision #datatips #communication #leadership

1 Comment

Nicholas Plotnicoff, MBA

Used by Caesars & others to uncover missed savings in construction, data storage, and business decisions. No added headcount. Just results.

4,143 followers 1y

I see bad data insight discovery practices daily. But they’re easily fixed. Here are 3 tips to fix bad data insights: Tip 1: Start with a clear question. What they should stop doing: Diving into data without a specific goal. What they should do instead: Begin with a well-defined business question. Why that’s the better way: A clear question focuses your analysis and saves time. Why it works: It ensures you’re solving the right problem, leading to actionable insights. Example: Instead of asking, “What’s our sales trend?” ask, “How did our sales trend change after the last campaign?” Example: Replace “What’s happening with our customers?” with “Which customer segments show the highest churn?” Example: Swap “How’s our product doing?” for “What’s driving product X’s recent growth?” Quick summary: Start with a clear question, and your insights will have direction and purpose. Tip 2: Clean your data before analysis. What they should stop doing: Ignoring data quality issues and rushing into analysis. What they should do instead: Dedicate time to clean, organize, and validate your data. Why that’s the better way: Clean data ensures accurate and reliable results. Why it works: Garbage in, garbage out. Quality data leads to quality insights. Example: Before analyzing, remove duplicates and correct errors in your dataset. Example: Standardize date formats and fix missing values to avoid skewed results. Example: Ensure consistency in categorical variables (e.g., “NY” vs. “New York”). Quick summary: Clean data is the foundation for meaningful analysis. Tip 3: Visualize your findings effectively. What they should stop doing: Overloading stakeholders with complex charts and tables. What they should do instead: Use simple, clear visuals that tell a story. Why that’s the better way: Visuals should highlight insights, not overwhelm the audience. Why it works: People grasp information faster through visuals, leading to better decision-making. Example: Use a bar chart to show sales growth across regions instead of a cluttered spreadsheet. Example: Replace a dense pie chart with a simple line graph to show trends over time. Example: Use color sparingly to emphasize key points, not to decorate. Quick summary: Effective visuals turn data into compelling narratives. Takeaway: My clients are always amazed by the level of detail I go into fixing their data insight processes, thanks to the integration of an advanced AI with powerful BI capabilities. Every question matters. Every data point, every chart, every analysis. Every discovery or insight. Remember: Quality insights come from clear questions, clean data, and effective visuals. Get it wrong, and you’ll waste time on irrelevant data. Get it right, and with AI-driven BI tools, you’ll uncover insights that drive meaningful decisions faster than ever.

More in data analysis and decision-making

Explore categories