Skip to content

Add best practices for statvars for Custom DC#683

Merged
kmoscoe merged 87 commits intodatacommonsorg:masterfrom
kmoscoe:extend
Mar 3, 2026
Merged

Add best practices for statvars for Custom DC#683
kmoscoe merged 87 commits intodatacommonsorg:masterfrom
kmoscoe:extend

Conversation

@kmoscoe
Copy link
Copy Markdown
Contributor

@kmoscoe kmoscoe commented Feb 25, 2026

This PR makes the following changes:

  • Describes when to reuse existing statvars vs. creating new ones
  • Adds naming convention guidelines (based on go/dcguidebook)
  • Fixes description of groupStatVarsByProperty to cover actual behavior
  • Expands statvar groups section to include existing variables

Staged at http://bullie.svl.corp.google.com:4000

Next PR will cover similar content for custom entities and enums

kmoscoe and others added 30 commits June 11, 2025 19:07
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @kmoscoe, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly improves the documentation for Custom Data Commons, particularly focusing on best practices for statistical variable management. The changes aim to provide clearer, more detailed guidance for users on defining, naming, and organizing statistical variables, ensuring better data quality and compatibility for potential contributions to the main Data Commons knowledge graph. It also refines explanations of core concepts like entity types and variable grouping.

Highlights

  • Statistical Variable Best Practices: Added comprehensive guidelines on when to reuse existing statistical variables versus creating new ones, emphasizing the importance of reuse for data contribution and accurate linking.
  • Naming Conventions: Introduced a new section detailing standard naming conventions for statistical variables, crucial for those planning to contribute their data to base Data Commons.
  • Enhanced Entity Type Explanations: Clarified the distinction between populationType and aggregation levels within entity types, providing more detailed guidance for users.
  • MCF Variable Definition Guidance: Refined instructions for defining statistical variables in Metadata Content Format (MCF), including a note on when to skip this step and a warning against overriding existing variable definitions.
  • Statistical Variable Grouping: Updated the section on defining statistical variable groups, making it optional and adding instructions for assigning existing variables to new custom groups for better discoverability.
Changelog
  • custom_dc/config.md
    • Updated the description of groupStatVarsByProperty to provide a more detailed explanation and removed an associated screenshot.
  • custom_dc/custom_data.md
    • Modified the link text for defining custom entities from "new custom entities" to "new entities".
    • Expanded the explanation of "Entities and entity types" to differentiate between populationType and aggregation levels.
    • Revised the guidance on statistical variables, stressing the importance of reusing or extending existing variables and the implications for contributing data.
    • Introduced a new section detailing "Naming conventions" for statistical variables, including examples for basic, measurementQualifier, and measurementDenominator cases.
    • Clarified the process for defining statistical variables in MCF, including a note on when to skip this step and a warning against overriding existing variable definitions.
    • Updated the description for populationType to advise on creating new entity types if necessary.
    • Enhanced the statType description by providing more examples of calculated values and a link to the full list of allowable values.
    • Restructured the "Define a statistical variable group" section, making it optional and adding instructions for assigning existing variables to new groups.
    • Adjusted the numbering of subsequent steps in the data preparation guide.
  • custom_dc/quickstart.md
    • Removed an unnecessary blank line within a shell command example.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request improves the documentation for custom Data Commons, focusing on best practices for statistical variables (statvars). The changes clarify when to reuse existing statvars, add naming convention guidelines, and provide better examples. I've found a few areas in the documentation that could be further improved for clarity and correctness, including a typo, a broken link, and a couple of confusing or incorrect examples. My suggestions aim to make the documentation more accurate and easier for users to follow.

Comment thread custom_dc/custom_data.md Outdated
Comment thread custom_dc/custom_data.md Outdated
Comment thread custom_dc/custom_data.md Outdated
Comment thread custom_dc/custom_data.md Outdated
kmoscoe and others added 3 commits February 25, 2026 15:34
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
@kmoscoe kmoscoe requested review from ajaits and beets February 25, 2026 23:56
Copy link
Copy Markdown
Contributor

@beets beets left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wow, this is a great update. thanks for breaking down all the requirements.

there is currently a limit to the length of dcid's in custom dc. i'll ping that group to see where we are with it

might want to wait for ajai to approve this pr too

Comment thread custom_dc/config.md
Comment thread custom_dc/custom_data.md

Schema.org and the base Data Commons knowledge graph define entity types for just about everything in the world. An _entity type_ is a high-level concept, and is derived directly from a [`Class`](https://datacommons.org/browser/Class){: target="_blank"} type. The most common entity types in Data Commons are place types, such as `City`, `Country`, `AdministrativeArea1`, etc. Examples of other entity types are `Hospital`, `PublicSchool`, `Company`, `BusStation`, `Campground`, `Library` etc. It is rare that you would need to create a new entity type, unless you are working in a highly specialized domain.
Schema.org and the base Data Commons knowledge graph define entity types for just about everything in the world. An _entity type_ is a high-level concept, and is derived directly from a [`Class`](https://datacommons.org/browser/Class){: target="_blank"} type. Non-place entities are of two types:
- The thing you are measuring, known as the `populationType` in Data Commons. Often this is a `Person`, which is a commonly used population in Data Commons. But it could be something else entirely, like the beds in a hospital, the price of a commodity, Olympic medals won by a country, or the surface area of an ocean.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be nice if we linked to the pop types for these examples

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand what you mean here. Do you mean link to the browser pages for these entities?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, we could link to the "Person" browser page. i'm also wondering if we have poptypes for the others in the examples (beds in a hospital, price of a commodity, etc). in other words, i thought the reader might be curious about how those examples would be modeled.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thing is it's a bit confusing: the incoming populationType arcs section shows statvars (https://screenshot.googleplex.com/ASWnrHqVefUfW5r) while it's the domainIncludes section that actually shows the population types: https://screenshot.googleplex.com/BrTds7ghnFBvekP. I think at this point in the doc it's too early to get into these details, which are actually quite confusing.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sg to skip then

Comment thread custom_dc/config.md
@kmoscoe
Copy link
Copy Markdown
Contributor Author

kmoscoe commented Mar 2, 2026

wow, this is a great update. thanks for breaking down all the requirements.

there is currently a limit to the length of dcid's in custom dc. i'll ping that group to see where we are with it

any update on that?

might want to wait for ajai to approve this pr too

@kmoscoe kmoscoe requested review from ajaits and beets March 2, 2026 23:28
@kmoscoe kmoscoe merged commit ad4dbc1 into datacommonsorg:master Mar 3, 2026
2 checks passed
@kmoscoe kmoscoe deleted the extend branch April 14, 2026 21:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants