Data Sources
SQL
Cloud Engineering Cheat Sheets
In most cases, this step can be executed before the first step if the data is available and you want to define the questions(problem) around it to better use the incoming data.
Based on the definition of your problem, you’d need to identify the data sources, which can be a database, a data repository, sensors, etc. For an application to be deployed in production, this step should be automated by developing data pipelines to keep the incoming data flowing into the system.
- List the sources and amount of data you need
- Check if space is going to be an issue
- Check if you’re authorized to use the data for your purpose or not
- Acquire the data and convert it into a workable format
- Check the type of data(textual, categorical, numerical, time series, images)
- Take aside a sample of it for final testing purposes
Data Description
- Document the data - Google’s The Data Cards Playbook
- Describe the extent of the data that is available.
- Describe data that is not available but is desirable.
- Describe the data that is available that you don’t need.
- Describe how the data was collected.