Member-only story

Data Collection and Pre-processing

Abideen Bello
3 min readFeb 9, 2023

--

Photo by Firmbee.com on Unsplash

Data collection and pre-processing are the unsung heroes of the data science world!

While they may not be as flashy as building models or creating stunning visualizations, they are the foundation upon which all great data-driven insights are built.

The success of any data analysis project largely depends on the quality and accuracy of the data that is collected and prepared for analysis.

Think about it — without clean and well-prepared data, your analysis is like trying to build a skyscraper on shaky ground.

It just won’t work!

However, with the correct data gathering and pre-processing techniques, you can convert a jumbled, unstructured mess of information into a polished, well-organized dataset suitable for analysis.

So, let’s dive in and discover the amazing world of data collecting and pre-processing!

Data collection

These is the process of obtaining information from many sources, such as databases, surveys, and web scraping, among others. The acquired data may be structured data, such as a spreadsheet, or unstructured data, such as text or photos.

Following data collection, pre-processing entails cleaning, converting, and preparing the data for analysis. This is an important step since erroneous or inadequate data can have a substantial influence on the outcomes of any research.

Consider the following scenario: a corporation wishes to collect statistics on client satisfaction with its product. They conduct an online poll and gather client feedback. The survey data gathered is structured data, with each customer response represented by a row in a spreadsheet and each answer represented by a column.

Data Cleaning

These is the process of finding and removing errors, inconsistencies, and duplication from data. This step also covers dealing with missing values and resolving outliers.

For example, when the survey data obtained has inaccuracies, inconsistencies, and missing information. To clean the data, the organization examines each response for the correctness and eliminates any duplicate responses. They then impute missing values, either by…

--

--

Abideen Bello
Abideen Bello

Written by Abideen Bello

I’m an IBM Certified Data Scientist with knowledge of Machine learning application, Data Visualization Expertise, and vast knowledge of Tableau, Power BI, etc.

No responses yet