Long Bui consistent and discipline

Data Vocab

Data Vocab

List of data dictionaries

Serving as a comprehensive resource providing definitions, explanations, and clarifications of essential terms, jargon, and concepts used within the realm of data engineering, analytics, and related fields.

Starting with Discovery Question will have you to be curious about what it is, and that will be a good starting point for everything. ** Or if you want to learn more about how we start this stuff go win the proposal to get you started with Data Project either Freelance, Outsourcing, in-house product, etc.

Watch this video Prepared and built Data Project as Data Engineer / Data Consultant

Component Type Discovery Question    
Data Platform Pain Points What are the objectives key challenges and pain points related to data management and accessibility you want to address?
Data Platform DevOps What is current process for DevOps including Source Code management Version control and Release Management? Is CI/CD implemented? What are the Agile Practices in project?
Data Platform Data Volume What is the volume of data in Data Platform? What is the expected growth in data volume Monthly and Yearly?    
Data Platform Data Source Can you explain more provide more details about Source Connectivity, format, refreshing cycle Market Listener and Fraud Detector  
Data Platform Data Consumers How many concurrent and peak end users use the Data Platform (Ad-hoc users/analysts scheduled queries/reports ML training/advanced analytics queries)?
Data Platform DR What are the backup and disaster recovery strategies that need to be in place for the data lake?    
Data Platform Data Flow Can you provide an overview of the data flow which is using data pipelines    
Data Platform Architecture What steps are being taken to be compliant with protection of personal data. What percentage of data is being stored in local cloud or other for replication    
Data Platform Architecture Do you consider changes to current Architecture during migration? Or only want to lift-and-shift    
Data Platform Data Type What are the types of source data, file formats being used data type requirements for future state, key challenges faced in processing any particular type of data  
Data Platform Data Volume What is/are Current Data Set Size, Future Data Growth, Storage and Management Requirements, Size and Transfer Constraints    
Data Platform Data Frequency What is the schedule of Data Ingestion and Processing Pipelines    
Data Platform Integration Are there any specific integration requirements or preferences that you have?    
Data Platform Tools/Services What are all the tools/services of Data Platform?    
Data Platform Tools/Services Have you installed Elastic Search Redis on Kubernetes on GCP and MYSQL on Compute Engine? What is Cloud SQL being used for other than Postgresql as we notice that cost is considerably high?  
Data Platform Tools/Services What is the configuration of Kafka Cluster? Number of Brokers Hardware Resources Retention Settings Cluster Management
Data Platform Budget What is your budget for data processing and analytics platforms? How important is cost optimization to your business?    
Data Platform Support Any issues with current support being provided.Are you getting adequate support?    
Data Platform Users What is the current team structure?    
Data Platform Users How many users access the Data Platform?    
Data Platform Query Performance What is the SLA for Query Response Time? Any performance issues you are facing still in reporting or data processing?    
Data Platform Data Warehouse Are you using Flat Rate or OnDemand pricing for Cost of services usage? Any changes done after 25% increase in rates?    
Data Platform Scalability Any issues with Scalability of the Platform?    
Data Platform Data Security How do you manage access control and user permissions for your data? What types of security measures do you currently have in place? (e.g. encryption, dynamic data masking). Any specific security certifications or standards that you require?    
Data Engineering Data Pipelines Any plans of unifying Streaming Processing Pipelines as Pubsub now supports Schema Evolution? Is the change related to reducing Kafka Partitions implemented?Any performance issue after the change?    
Data Engineering Data Pipelines Can we get more details on workload?    
Data Engineering Data Ingestion How frequently you want to update/ingest data (Batch/Real-Time ingest intervals)? Considering any changes to frequency of data processing?    
Data Engineering Alerts How are you monitoring loads? What is the alert mechanism?    
Data Engineering Data Ingestion Any issues related to Debezium CDC which needs to be addressed?    
Data Engineering Data Processing What are the types of data processing being done? What programming language is being used for Dataflow Jobs Python. Scala or Java?    
Data Engineering Data Processing What are key challenges being faced with respect to Data Processing?    
Data Governance Data Security Are there any specific data privacy or compliance regulations. How is security and access control management implemented? How is PII data classification being done for data privacy?    
Data Governance Data Discovery How do you currently handle data cataloging metadata management and data lineage tracking? Is Open Metadata covering all aspects of Data Discovery and Collaboration? What are the gaps?
Data Governance Data Retention What are the specific data retention and archiving requirements?    
Data Governance Data Quality Is there any Data quality tool used in the current data warehouse?    
Machine Learning Scope What are the main goals of your machine learning projects? Can you provide the number of projects(use cases) and models developed. Are you focused on batch processing real-time predictions or both?    
Machine Learning Tools Is any of the tool being used for Machine Learning is tightly coupled with GCP?    
Machine Learning Tools What Tools you are using to pre-process the data for Training?    
Machine Learning Frameworks What are common frameworks being used for ML?    
Machine Learning AutoML Are you using any GCP AutoML services like AutoML Natural Language AutoML Translation    
Machine Learning Managed ML What are the plans of moving into Managed ML Services for experimenting training metrics logging, model deployment and monitoring
Reporting Reporting How many Reports overall are developed in PowerBI? Is it Self Service Analytics or there is a dedicated reporting team? What are the Report Embedding Requirements you have?    
Reporting Reporting What are the data analytics and reporting requirements across different teams or departments within your organization? Are there any dedicated BSAs?    

Other questions would benefits from other purposes, just ask before starting project.

Subscribe Newsletter to get updated


* indicates required