About data collection. How to collect data, analyze them and rob korovany


In the previous article, we looked at data quality issues ( “About data quality and common errors when collecting them” at Habré).
Today I want to continue the conversation about the quality of data and discuss their collection: how to prioritize when choosing a source, how and what data to collect, assessing the value of data for the company, and more.

Collect all


Have you decided to improve the design and payment of goods on the site?
Great, what about the process of the basket formation by the buyer? At what point does he make the final choice of goods: before adding to the basket or before paying for the purchase?
Each site can be different, but how does a client behave with you?
If you have data on ordering, you can analyze them and decide on the update vector, which will be convenient not only for you, but also for users.



Collect all the data to which you reach. You will never know with absolute certainty which ones you may need, and only one can be given the opportunity to gather.

The more data you collect, the more information about users you will have, and what is more important - you will be able to understand and predict the context of their actions.
The context helps to better understand your client, his wishes and intentions, and the better you know your client, the better you will be able to fulfill his personal needs, and thus increase loyalty and increase the likelihood of the client returning.

Today, the collection of absolutely all data is not so rare, it is especially common in online projects. In a company that maximizes data collection and is able to work with them, almost all activities will be based on them: marketing, sales, staff work, updates and improvements, deliveries.
Each direction has internal and external sources of data in various formats and different quality.

This is good for analyst work and decision making, but this also has a problem with storing this data set and processing it. Each action increases the financial burden and the positive effect of owning data can grow into a “headache.”

In order to make a decision on the expediency of collecting and processing certain data, an understanding of their basic characteristics is necessary. Let's take a quick look at them:

Volume
An indicator that affects the financial costs of storing and modifying data and the time costs of processing them. And although with the increase in the volume of data, the unit's storage price decreases, but, given the increasing number of sources, the financial burden may become irrational.

Diversity
A diverse set of data sources provides a more complete picture and helps to better assess the context of user actions, but the flip side of the coin is the variety of formats and the cost of integrating them into your analytics system. It is not always possible to collect all the data together, and if possible, it is not always necessary.

Speed
How much data is required to process per unit of time?
Recall the recent US presidential elections — thanks to the fast processing of Twitter messages, it was possible to understand the mood of voters during the debates and adjust their course.

Giants of working with data, such as Facebook and Google, to achieve today's results require a huge amount of time, but thanks to this they now have data about each user and they can predict their actions.
A frequent problem of personnel working with data is limited resources, primarily financial and human resources.
In most companies, analysts have to set tight priorities in the choice of data sources, and thus abandon some of them.
In addition, you must take into account the interests of the business, and therefore assess the profitability of investments in working with data and the possible impact of data on the company.

Priorities and selection of data sources


With limited resources in working with these professionals have to set priorities and make a choice between sources.
What is guided by this and how to determine the value of the data for the company?

The main goal of the work of analysts is to provide the necessary information to other departments in a quality and timely manner. This information has a direct impact on the efficiency of the company and the work of the departments.

Each department or division has its own “main” data type.
So for the customer service department, customer contacts and data from his social networks are important, and for the marketing department, the purchase history and action map are important.
So it turns out that each team has its own set of “very important data” and this data is definitely more important and necessary than that of other divisions.

The problem with limited resources does not disappear only from the importance and necessity of the data, which means that it is necessary to set priorities and act in accordance with them. The main factor in determining the priority of data is ROI, but one should not forget about accessibility, completeness and quality.
Here is a list that lists some indicators that can help in setting priorities:

The list of parameters for prioritization
High
Cause: Data is needed immediately.
Explanation: If a unit has an urgent need for data with tightly limited time frames, such data is provided first.

High
Reason: Data adds value.
Explanation: Data increases profits or reduces costs by providing a high ROI.

High
Cause: Different teams need the same data.
Explanation: By satisfying the data needs of several teams, you increase ROI.

High
Reason: Short-term or streaming data.
Explanation: Some interfaces and protocols provide a time-limited “window” for data collection, so hurry up.

Average
Cause: Addition for an existing data set that improves their quality.
Explanation: New data complements existing and improves understanding of the context of actions.

Average
Cause: The processing code may be reused.
Explanation: Using known code reduces ROI and reduces possible errors.

Average
Reason: Data is readily available.
Explanation: If the data is valuable, it’s easy to get it ahead.

Average
Reason: Convenient API allows you to collect data for past periods.
Explanation: If the data is not required yesterday, and you can always access it, then you should not put it too high priority.

Low
Reason: Analysts have access to data or other ways to get it.
Explanation: If the analysts already have access to the data, then there may be more priority tasks.

Low
Reason: Poor data quality.
Explanation: Poor data may be useless and sometimes harmful.

Low
Cause: You need to extract from web pages.
Explanation: Processing such data can be quite complex and require excessive effort.

Low
Reason: Low probability of using data.
Explanation: Data that would be good to have, but if not, then fine.
But, having this data, you can rob the cows !

As we see, it is not important to provide all data “right now,” which means that it is necessary to set priorities and follow in accordance with them.
It is important to maintain a balance between the acquisition of new data and their value for the company.

Data interconnection


You get important data from the sales department, marketing, logisticians and feedback from customers, but the greatest value of the data arises after establishing connections between different types of data.

For example, consider Diana and her order. She recently ordered a set of garden furniture, comparing her order with analytics data, we see that she spent 30 minutes on the site and looked through 20 different sets. This means that she chose furniture already on the site, not knowing in advance what she would order.
Look where it came from - search results.

If we had information about other purchases of Diana, then we would have learned that she had often bought household goods over the past month.
Frequent online purchases and the use of search engines to find online stores indicates low brand loyalty, which means it will be difficult to persuade them to re-purchase.

So, receiving each new level of information, an individual portrait of the user is compiled from which one can learn about his life, affections, habits and predict his behavior.
We add information from the checkout and we understand that this is a woman, and at the delivery address we see that she lives in the private sector.

Continuing to analyze, you can find information about her home and the site, predict her needs and make a preventive proposal.
With proper analysis of the data, the offer may work and we will incline the customer to re-purchase, as well as increase his loyalty through an individual approach.

Offering discounts for inviting a friend from the social network will give us access to her list of friends and account information, then you can continue an individual marketing approach to the client and make targeted ads for it, but this is unlikely to be cost-effective.

Collection and purchase of data


Today, there are many ways to collect data, one of the most common - API. But besides how to collect data, they need to be updated, and everything here depends on the volume.

Small amounts of data (up to 100 thousand lines) should be replaced with fresh ones, but with large arrays a partial update is already relevant: adding new and removing obsolete values.

Arrays of some data are so huge that it will be too expensive for the company to process them, in such cases they are sampled, and on the basis of it they carry out analytics. “ Simple random sampling ” is often practiced, but usually the data collected with it are not representative and are comparable to a coin toss.

An important question: collect raw or aggregated data?
Some data providers provide compiled collections, but they have several drawbacks. For example, they may not have the necessary or desired values ​​that would increase the value of analytics based on this data for the company, but you will not be able to collect or supplement them. The data collected by third-party aggregators are convenient for archiving and storage, and they also significantly save time and human resources.

But if there is an opportunity to collect raw data, then it is better to select them - they are more complete, and you can independently aggregate them in accordance with your needs and business needs, and then work with them as you need.

Many companies independently collect data, and also use those available in open sources. But in some cases they are forced to pay to obtain the necessary data to a third party. Sometimes the choice of places for acquiring data may be limited, in other cases it is not, but regardless of this, when choosing a data source and deciding on their acquisition, several factors should be considered:

Price
Everyone loves free data - both guidance and analytics, but sometimes high-quality information is only available for money. In this case, consider the rationality of the acquisition and compare the cost and value of the data.

Quality
The data is clean, you can trust them?

Exclusivity
Data prepared individually for you or available to everyone? Will you get a competitive advantage if you use them?

Sample
Is it possible to get a sample to assess the quality of the data before the acquisition?

Updates
What is the lifespan of the data, how quickly do they become obsolete, will they be updated and how often?

Reliability
What are the limitations of data acquisition interfaces, what other limitations may be imposed on you?

Security
If the data is important, will they be encrypted and how reliable are the protocols? Also do not forget about security during their transfer.

Terms of Use
Licensing or other restrictions. What can prevent you from using the data in full?

Format
How comfortable is it to work with the format of the acquired data? Is it possible to integrate them into your system?

Documentation
If you are provided with documentation - well, and if not, then you should ask about the method of collecting data to assess their value and reliability.

Volume
If there is a lot of data, can you store and process it? Valuable data will not always be voluminous, and vice versa.

Degree of detail
Is this data appropriate for the level of analytics you need?

This is not all, but the main and undoubtedly important questions that should be asked before purchasing data from suppliers.

This concludes the data collection article.
If the information was useful to you, I will be glad to feedback.
Perhaps you disagree with something or want to share your methods and practices - I invite you to comment, and I hope for an exciting and useful discussion.
Thank you all for your attention and have a nice day!

A source of information
Posted by: Carl Anderson
Analytical culture. From data collection to business results
Creating a Data-Driven Organization
ISBN: 978-5-00100-781-4
Publisher: Mann, Ivanov and Ferber

Source: https://habr.com/ru/post/407977/


All Articles