Categories: FAANG

Datasets at your fingertips in Google Search

Access to datasets is critical to many of today’s endeavors across verticals and industries, whether scientific research, business analysis, or public policy. In the scientific community and throughout various levels of the public sector, reproducibility and transparency are essential for progress, so sharing data is vital. For one example, in the United States a recent new policy requires free and equitable access to outcomes of all federally funded research, including data and statistical information along with publications.

To facilitate discovery of content with this level of statistical detail and better distill this information from across the web, Google now makes it easier to search for datasets. You can click on any of the top three results (see below) to get to the dataset page or you can explore further by clicking “More datasets.” Here is an example:

When users search for datasets in Google search, they find a dedicated section highlighting pages with dataset descriptions. They can explore many more datasets by clicking on “More datasets” and going to Dataset Search.

Powered by Dataset Search

Dataset Search, a dedicated search engine for datasets, powers this feature and indexes more than 45 million datasets from more than 13,000 websites. Datasets cover many disciplines and topics, including government, scientific, and commercial datasets. Dataset Search shows users essential metadata about datasets and previews of the data where available. Users can then follow the links to the data repositories that host the datasets.

Dataset Search primarily indexes dataset pages on the Web that contain schema.org structured data. The schema.org metadata allows Web page authors to describe the semantics of the page: the entities on the pages and their properties. For dataset pages, schema.org metadata describes key elements of the datasets, such as their description, license, temporal and spatial coverage, and available download formats. In addition to aggregating this metadata and providing easy access to it, Dataset Search normalizes and reconciles the metadata that comes directly from the Web pages.

If you are a dataset author or provider and want others to find your datasets in Search, make sure that you publish your dataset in a way that makes it discoverable and specifies how others can reuse the data. Specifically, ensure that the Web page that describes the dataset has machine-readable metadata. The easiest way to ensure this is to publish your dataset in an established dataset repository. Some repositories cater to specific research communities, while others are “generalists” (figshare.com, zenodo.org, datadryad.org, kaggle.com, etc.). These repositories automatically include metadata in dataset pages for every dataset, which makes it easy for search engines to discover and include them in specialized result sections, as in the figure above.

As data sharing continues to grow and evolve, we will continue to make datasets as easy to find, access, and use as any other type of information on the web.

Acknowledgments

We are extremely grateful to the numerous Googlers who contributed to developing and launching this feature, including: Rachel Zax, Damian Biollo, Shiyu Chen, Jonathan Drake, Sunil Vemuri, Stephen Tseou, Amit Bapat, Will Leszczuk, Marc Najork, Sergei Vassilvitskii, Bruno Possas, and Corinna Cortes.

AI Generated Robotic Content

Recent Posts

The Surprising MacBook Neo Competitor You’ve Never Heard Of

In many ways, the HP OmniBook 5 is a better budget laptop than the MacBook…

45 mins ago

Tiny cameras in earbuds let users talk with AI about what they see

University of Washington researchers developed the first system that incorporates tiny cameras in off-the-shelf wireless…

45 mins ago

Update: Distilled v1.1 is live

We've pushed an LTX-2.3 update today. The Distilled model has been retrained (now v1.1) with…

24 hours ago

How to Implement Tool Calling with Gemma 4 and Python

The open-weights model ecosystem shifted recently with the release of the

24 hours ago

Structured Outputs vs. Function Calling: Which Should Your Agent Use?

Language models (LMs), at their core, are text-in and text-out systems.

24 hours ago

Cram Less to Fit More: Training Data Pruning Improves Memorization of Facts

This paper was accepted at the Workshop on Navigating and Addressing Data Problems for Foundation…

24 hours ago