In recent years, progress in the fields of machine learning and artificial intelligence has reached the point where extremely large models trained on internet-scale datasets have achieved state-of-the-art performance on a diverse range of learning tasks. This success has spurred a paradigm shift in the way modern day machine learning models are trained.
These pre-trained models have earned the name foundation models. They are adapted via fine-tuning or few-shot/zero-shot learning strategies and subsequently deployed on a wide range of domains. The foundation models allow for the transfer and sharing of knowledge across domains and mitigate the need for task specific training data.
Examples of foundation models are:
Large Language Models (LLMs)
Large vision foundation models
Large multimodal foundation models
Large reinforcement learning foundation models
Large Language Models
We will focus our discussion on LLMs. LLMs unify various language understanding and generation tasks into one model. In recent years, LLMs have revolutionized our understanding of natural language. In addition, approaches have been proposed to deploy LLMs on vision-related tasks.
For example, a modular approach has recently been introduced called LENS (Large Language Models ENhanced to See) that leverages LLM as the “reasoning module” and operates over “vision modules.” This approach first extracts rich textual information using pre-trained vision models. Then the text is fed onto the LLM allowing it to perform object recognition and vision and language tasks. Integrating LENS enables one to leverage the latest advancements in both computer vision and natural language processing out of the box.
What is Databricks?
Databricks is a data and AI company founded by the creators of Apache Spark, Delta Lake, and MLFlow. Databricks as a technology is a simple, cloud-based platform that can handle all of your data needs. It is conveniently available on top of your existing cloud, whether that’s Amazon Web Services, Microsoft Azure, Google Cloud or a multi-cloud combination.
The Databricks platform simplifies the complex mix of data warehouses and data lakes. In other words, it combines your data warehouse and data lake into a “data lakehouse.” The platform also simplifies the variety of other tools for analytics, business intelligence, or data science. Databricks depth and performance means it can be used by all members of a data team, including data engineers, data analysts, business intelligence practitioners, data scientists and machine learning engineers.
Dolly 2.0 is the first open-source instruction-following Large Language Model (LLM) trained on the Databricks machine learning platform, licensed for research and commercial use. To help you get started, there is the databricks-dolly-15k dataset that can be downloaded from the Hugging Face website.
Authored by over 5,ooo Databricks employees during a period spanning March and April 2023, this is a dataset containing 15,000 high-quality human-generated prompt / response pairs specifically designed for instruction tuning large language models. Fortunately, it is under the Creative Commons Attribution Sharealike 3.0 Unported license terms, so anyone can use, modify, or extend this dataset for any purpose, including commercial applications.
Contact us for more information on Databricks DOlly LLM.
Li, Chunyuan, et al. “Multimodal foundation models: From specialists to general-purpose assistants.” arXiv preprint arXiv:2309.10020 (2023).
Mai, Gengchen, et al. “On the opportunities and challenges of foundation models for geospatial artificial intelligence.” arXiv preprint arXiv:2304.06798 (2023).
Berrios, William, et al. “Towards language models that can see: Computer vision through the lens of natural language.” arXiv preprint arXiv:2306.16410 (2023).