A data lake is a repository of data that collects all types of data from all types of sources into an organisation. Files, images, documents, records, logs, sensor data, IoT data, audio files, social media data, etc. you name it. ⠀

Data in a data lake may not be readily needed by anyone in the organisation but its necessity is predicated on the fact that some day, the data MIGHT be needed, Hence, every and all data is saved into a data lake.⠀

Data lake data is stored in their basic raw form so that users can find the data as is and then they can transform for their pertinent needs. These data can be transformed to schema (Schema on Read) at the time they are needed to pursue further analytic requirements but they are just left dormant until that moment. This helps organisations to be sure that they can produce any data/analysis at any time even though it does not reflect in their day to day operations. ⠀

For organisations that already have a data warehouse, data that is being often needed from the data lake can form the basis for improving the data warehouse so as to provide users with information they need quicker. It will also help to elicit and study relationships between such data and that already existing in the data warehouse thereby improving analysis of organisational data.⠀

Big tech houses like Google, Microsoft, Amazon have cloud offerings that can be used for building data lakes but a good and popular open source tool for data lakes is Apache Hadoop and its HDFS file system.