PerDocData

Per Doc DataPer-Document Signal

GoogleApi.ContentWarehouse.V1.Model.PerDocData

9
out of 10
Critical
SEO Impact
=========================================================================== # Make sure you read the comments in the bottom before you add any new field. NB: As noted in the comments, this protocol buffer is used in both indexing and serving. In mustang serving implementations we only decode perdocdata during the search phase, and so this protocol should only contain data used during search. See mustang/repos_www/attachments.proto:{MustangBasicInfo,MustangContentInfo} for protocols used during search and/or docinfo. Next available tag deprecated, use this (and look for commented out fields): blaze-bin/net/proto_compiler/protocol-compiler --freetags \ indexer/perdocdata/perdocdata.proto Next tag: 225

SEO Analysis

AI Generated

Stores per-document data signals that are maintained for each indexed page. These document-level signals are core inputs to Google's ranking algorithms and may include quality scores, topical classifications, and other per-page assessments. Changes to these signals can directly affect a page's ranking potential.

Actionable Insights for SEOs

  • Monitor for changes in rankings that may correlate with updates to this system
  • Consider how your content strategy aligns with what this signal evaluates

Attributes

132
Sort:|Filter:
scienceDoctypeinteger(
Default: nil

Scholar/Science Document type: <0 == not a Science Document -- default 0 == Science doc fully visible >0 == Science doc but limited visibility, the number is the visible terms

ScaledExptIndyRank2integer(
Default: nil

experimental

Default: nilFull type: GoogleApi.ContentWarehouse.V1.Model.QualityVidyaVideoLanguageVideoLanguage.t

Audio-based language classified by Automatic Language Identification (only for watch pages).

Default: nilFull type: GoogleApi.ContentWarehouse.V1.Model.PhilPerDocData.t
uacSpamScoreinteger(
Default: nil

The uac spam score is represented in 7 bits, going from 0 to 127. Threshold is 64. Score >= 64 is considered as uac spam.

DEPRECATEDAuthorObfuscatedGaiastring
Default: nilFull type: list(String.t

The obfuscated google profile gaia id(s) of the author(s) of the document. This field is deprecated, use the string version.

spamtokensContentScorenumber(
Default: nil

For SpamTokens content scores. Used in SiteBoostTwiddler to determine whether a page is UGC Spam. See go/spamtokens-dd for details.

Default: nilFull type: GoogleApi.ContentWarehouse.V1.Model.RepositoryWebrefWebrefMustangAttachment.t

WebRef entities associated to the document. See go/webref for details.

Default: nilFull type: GoogleApi.ContentWarehouse.V1.Model.PremiumPerDocData.t

Additional metadata for Premium document in the Google index.

Default: nilFull type: GoogleApi.ContentWarehouse.V1.Model.SpamMuppetjoinsMuppetSignals.t

Contains hacked site signals which will be used in query time joins. As of Oct'19, the field is stored in a separate corpus. It'll only be populated for in-flight requests between retrieve and full-score in perdocdata. So no extra storage is needed on muppet side.

Default: nilFull type: GoogleApi.ContentWarehouse.V1.Model.SocialPersonalizationKnexAnnotation.t

For indexing k'nex annotations for FreshDocs.

Default: nilFull type: GoogleApi.ContentWarehouse.V1.Model.SmartphonePerDocData.t

Additional metadata for smartphone documents in the Google index.

semanticDateConfidenceinteger(
Default: nil

DEPRECATED: semantic_date_confidence replaced by semantic_date_info.

trendspamScoreinteger(
Default: nil

For now, the count of matching trendspam queries.

ScaledSpamScoreYoraminteger(
Default: nil

Spamscores are represented as a 7-bit integer, going from 0 to 127.

numUrlsinteger(
Default: nil

Total number of urls encoded in the url section = # of alternate urls + 1

datesInfostring
Default: nilFull type: String.t

Stores dates-related info (e.g. page is old based on its date annotations). Used in FreshnessTwiddler. Use encode/decode functions from quality/timebased/utils/dates-info-helper-inl.h

pagerank2number(
Default: nil
Default: nilFull type: GoogleApi.ContentWarehouse.V1.Model.QualityNsrNsrData.t

Stripped site-level signals, not present in the explicit nsr_* fields, nor compressed_quality_signals.

Default: nilFull type: GoogleApi.ContentWarehouse.V1.Model.QualityFringeFringeQueryPriorPerDocData.t

Contains encoded FringeQueryPrior information. Unlikely to be meaningful for anyone other than fringe-ranking team. Contact fringe-ranking team if any questions, but do NOT use directly without consulting them.

Default: nilFull type: GoogleApi.ContentWarehouse.V1.Model.KaltixPerDocData.t
ymylHealthScoreinteger(
Default: nil

Stores scores of ymyl health classifier as defined at go/ymyl-classifier-dd. To use this field, you MUST join g/pq-classifiers-announce and add your use case at http://shortn/_nfg9oAldou.

authorObfuscatedGaiaStrstring
Default: nilFull type: list(String.t
lastSignificantUpdatestring
Default: nilFull type: String.t

Last significant update of the document. This is sourced from the quality_timebased.LastSignificantUpdate proto as computed by the LSUSelector from various signals. The value is a UNIX timestamp in seconds.

Default: nilFull type: GoogleApi.ContentWarehouse.V1.Model.SpamBrainData.t

Host-v1 sitechunk level scores coming from spambrain.

DEPRECATEDQuarantineWhitelistboolean(
Default: nil
tundraClusterIdinteger(
Default: nil

This field is propagated to shards. Stores clustering information on a site level for the Tundra project. This field is deprecated - used the equivalent field inside nsr_data_proto instead.

bodyWordsToTokensRatioTotalnumber(
Default: nil
homepagePagerankNsinteger(
Default: nil

The page-rank of the homepage of the site. Copied from the cdoc.doc().pagerank_ns() of the homepage.

topPetacatTaxIdinteger(
Default: nil

Top petacat of the site. Used in SiteboostTwiddler to determine result/query matching.

OriginalContentScoreinteger(
Default: nil

The original content score is represented as a 7-bits, going from 0 to 127. Only pages with little content have this field. The actual original content score ranges from 0 to 512. It is encoded with quality_q2::OriginalContentUtil::EncodeOriginalContentScore(). To decode the value, use quality_q2::OriginalContentUtil::DecodeOriginalContentScore().

contentAttributionsContentAttributions →
Default: nilFull type: GoogleApi.ContentWarehouse.V1.Model.ContentAttributions.t
webmirrorEcnFpstring
Default: nilFull type: String.t
DocLevelSpamScoreinteger(
Default: nil

The document spam score is represented as a 7-bits, going from 0 to 127.

Default: nilFull type: GoogleApi.ContentWarehouse.V1.Model.UrlPoisoningData.t

Contains url poisoning data for suppressing spam documents.

Default: nilFull type: list(GoogleApi.ContentWarehouse.V1.Model.PerDocDebugEvent.t

Free form debug info. NB2: consider carefully what to save here. It's easy to eat lots of gfs space with debug info that nobody needs...

Default: nilFull type: GoogleApi.ContentWarehouse.V1.Model.ImageQualitySensitiveMediaOrPeopleEntities.t

Contains the mids of the 5 most topical entities annotated with selected KG collections. This information is currently used on Image Search to detect cases where results converged to mostly a single person or media entity. More details: go/result-set-convergence.

scaledSelectionTierRankinteger(
Default: nil

Selection tier rank is a language normalized score ranging from 0-32767 over the serving tier (Base, Zeppelins, Landfills) for this document. This is converted back to fractional position within the index tier by scaled_selection_tier_rank/32767.

pageTagslist(integer(
Default: nil
smearingMaxTotalOffdomainAnchorsinteger(
Default: nil
pageranknumber(
Default: nil

Experimental pageranks (DEPRECATED; only pagerank in MustangBasicInfo is used).

QuarantineInfointeger(
Default: nil

bitmask of QuarantineBits (or'd together) used to store quarantine related information. For example: QUARANTINE_WHITELIST | QUARANTINE_URLINURL.

rosettaLanguagesstring
Default: nilFull type: list(String.t

Top two document language BCP-47 codes as generated by the RosettaLanguageAnnotator in the decreasing order of probability.

freshnessEncodedSignalsstring
Default: nilFull type: String.t

Stores freshness and aging related data, such as time-related quality metrics predicted from url-pattern level signals. Use the encoding decoding API in quality/freshness/docclassifier/aging/encoded-pattern-signals.h This field is deprecated.

Default: nilFull type: GoogleApi.ContentWarehouse.V1.Model.ImagePerDocData.t
videoCorpusDocidstring
Default: nilFull type: String.t
queriesForWhichOfficialOfficialPagesQuerySet →
Default: nilFull type: GoogleApi.ContentWarehouse.V1.Model.OfficialPagesQuerySet.t

The set of (query, country, language) triples for which this document is considered to be the official page. For example, www.britneyspears.com would be official for ("britney spears", "us", 0) and others (0 is English).

nsrIsCovidLocalAuthorityboolean(
Default: nil

This field is propagated to shards. In addition, it is populated at serving time by go/web-signal-joins. This field is deprecated - used the equivalent field inside nsr_data_proto instead.

Default: nilFull type: GoogleApi.ContentWarehouse.V1.Model.LogsProtoIndexingCrawlerIdCrawlerIdProto.t

For crawler-ID variations, the crawling context applied to the document. See go/url, and the description in google3/indexing/crawler_id

ScaledSpamScoreEricinteger(
Default: nil
Default: nilFull type: GoogleApi.ContentWarehouse.V1.Model.BiasingPerDocData.t
ScaledExptSpamScoreEricinteger(
Default: nil
Default: nilFull type: GoogleApi.ContentWarehouse.V1.Model.QualitySherlockKnexAnnotation.t

For indexing v2 k'nex, see/go/knex-v2-doc-annotation for details.

Default: nilFull type: GoogleApi.ContentWarehouse.V1.Model.MobilePerDocData.t

Additional metadata for lowend mobile documents in the Google index.

Default: nilFull type: GoogleApi.ContentWarehouse.V1.Model.BookCitationPerDocData.t

the book citation data for each web page, the average size is about 10 bytes

semanticDateinteger(
Default: nil

SemanticDate, estimated date of the content of a document based on the contents of the document (via parsing), anchors and related documents. Date is encoded as a 32-bits UNIX date (1970 Jan 1 epoch). Confidence is encoded using a SemanticDate specific format. For details of encoding, please refer to quality/freshness/docclassifier/semanticdate/public/semantic_date.proto

Default: nilFull type: GoogleApi.ContentWarehouse.V1.Model.BiasingPerDocData2.t

A replacement for BiasingPerDocData that is more space efficient. Once this is live everywhere, biasingdata will be deprecated.

ymylNewsScoreinteger(
Default: nil

Stores scores of ymyl news classifier as defined at go/ymyl-classifier-dd. To use this field, you MUST join g/pq-classifiers-announce and add your use case at http://shortn/_nfg9oAldou.

saftLanguageIntlist(integer(
Default: nil

Top document language as generated by SAFT LangID. For now we store bare minimum: just the top 1 language value, converted to the language enum, and only when different from the first value in 'languages'.

Default: nilFull type: GoogleApi.ContentWarehouse.V1.Model.RepositoryAnnotationsRdfaRdfaRichSnippetsApplication.t

Application information associated to the document.

domainAgeinteger(
Default: nil

16-bit

lastSignificantUpdateInfostring
Default: nilFull type: String.t

Metadata about last significant update. Currently this only encodes the quality_timebased.LastSignificantUpdate.source field which contains the info on the source of the signal. NOTE: Please do not read the value directly. Use helpers from quality/timebased/lastsignificantupdate/lsu-helper.h instead.

pagerank1number(
Default: nil
spamCookbookActionSpamCookbookAction →
Default: nilFull type: GoogleApi.ContentWarehouse.V1.Model.SpamCookbookAction.t

Actions based on Cookbook recipes that match the page.

compressedUrlstring
Default: nilFull type: String.t

Compressed URL string used for SETI.

Default: nilFull type: GoogleApi.ContentWarehouse.V1.Model.Proto2BridgeMessageSet.t

This field is available only in the docjoins: it is cleared before building per-doc data in both Mustang and Teragoogle. (MessageSet is inefficient in space for serving data) Use this for all new fields that aren't needed during serving. Currently this field contains: UrlSignals for the document level spam classifier (when the doclevelspamscore is set). PerDocLangidData and realtimespam::ClassifierResult for the document level fresh spam classifier (when the doc-level fresh spam score is generated). MicroblogDocQualitySignals for document-level microblog spam classifier. This only exists in Firebird for now. spam_buckets::BucketsData for a document-structure hash This field is non-personal since the personal fields in MessageSet are not populated in production.

socialgraphNodeNameFpstring
Default: nilFull type: String.t

For Social Search we store the fingerprint of the SG node name. This is used in one of the superroot's PRE_DOC twiddlers as a lookup key for the full Social Search data. PRE_DOC = twiddlers firing before the DocInfo request is sent to the mustang backend.

urlAfterRedirectsFpstring
Default: nilFull type: String.t

These two fingerprints are used for de-duping results in a twiddler. They should only be populated by freshdocs, and will only be present for documents that are chosen to be canonicals in a cluster whose previous canonical is also in the index. Additionally, url_after_redirects_fp is only present if it is different from a fingerprint of the URL.

Default: nilFull type: GoogleApi.ContentWarehouse.V1.Model.IndexingDupsLocalizedLocalizedCluster.t

Information on localized clusters, which is the relationship of translated and/or localized pages.

pageregionsstring
Default: nilFull type: String.t

String that encodes the position ranges for different regions of the document. See "indexer/pageregion.h" for an explanation, and how to decode the string

KeywordStuffingScoreinteger(
Default: nil

The keyword stuffing score is represented in 7 bits, going from 0 to 127.

spambrainTotalDocSpamScorenumber(
Default: nil

The document total spam score identified by spambrain, going from 0 to 1.

noimageframeoverlayreasoninteger(
Default: nil

If not 0, we should not show the image in overlay mode in image snippets

scienceHoldingsIdsstring
Default: nilFull type: list(String.t

Deprecated 2016/01/14.

crawlPagerankinteger(
Default: nil

This field is used internally by the docjoiner to forward the crawl pageranks from original canonicals to canonicals we actually chose; outside sources should not set it, and it should not be present in actual docjoins or the index.

Default: nilFull type: GoogleApi.ContentWarehouse.V1.Model.BlogPerDocData.t
nsrIsVideoFocusedSiteboolean(
Default: nil

This field is propagated to shards. It will also be populated at serving time by go/web-signal-joins (see b/170607253). Bit indicating whether this site is video-focused, but not hosted on any major known video hosting domains. This field is deprecated - used the equivalent field inside nsr_data_proto instead.

ScaledExptSpamScoreYoraminteger(
Default: nil
spamrankinteger(
Default: nil

The spamrank measures the likelihood that this document links to known spammers. Its value is between 0 and 65535.

compressedQualitySignalsCompressedQualitySignals →
Default: nilFull type: GoogleApi.ContentWarehouse.V1.Model.CompressedQualitySignals.t
Default: nilFull type: GoogleApi.ContentWarehouse.V1.Model.VideoPerDocData.t
Default: nilFull type: GoogleApi.ContentWarehouse.V1.Model.S3AudioLanguageS3AudioLanguage.t

Primary video's audio language classified by S3 based Automatic Language Identification (only for watch pages).

Default: nilFull type: GoogleApi.ContentWarehouse.V1.Model.WatchpageLanguageWatchPageLanguageResult.t

Language classified by the WatchPageLanguage Model (go/watchpage-language). Only present for watch pages.

Default: nilFull type: GoogleApi.ContentWarehouse.V1.Model.QualityCalypsoAppsLink.t

AppsLink contains Android application IDs in outlinks. It is used to improve results ranking within applications universal. See http://go/apps-universal for the project details.

Default: nilFull type: GoogleApi.ContentWarehouse.V1.Model.IndexingMobileInterstitialsProtoDesktopInterstitials.t

Contains desktop interstitials signal for VOLT ranking change.

Default: nilFull type: GoogleApi.ContentWarehouse.V1.Model.WeboftrustLiveResultsDocAttachments.t
Default: nilFull type: GoogleApi.ContentWarehouse.V1.Model.CrowdingPerDocData.t
nsrSitechunkstring
Default: nilFull type: String.t

SiteChunk computed for nsr. It some cases it can use more information than just url (e.g. youtube channels). See NsrAnnotator for details. If sitechunk is longer than --populate_nsr_sitechunk_max_length (default=100), it will not get populated. This field might be compressed and needs to be decoded with quality_nsr::util::DecodeNsrSitechunk. See go/nsr-chunks for more details. This field contains only nontrivial primary chunks.

originalTitleHardTokenCountinteger(
Default: nil

The number of hard tokens in the title.

hostAgeinteger(
Default: nil

The earliest firstseen date of all pages in this host/domain. These data are used in twiddler to sandbox fresh spam in serving time. It is 16 bit and the time is day number after 2005-12-31, and all the previous time are set to 0. If this url's host_age == domain_age, then omit domain_age Please use //spam/content/siteage-util.h to convert the day between epoch second. Regarding usage of Sentinel values: We would like to check if a value exists in scoring bundle while using in Ranklab AST. For this having a sentinel value will help us know if the field exists or has a sentinel value (in the case it does not exist). 16-bit

inNewsstandboolean(
Default: nil

This field indicates whether the document is in the newsstand corpus.

origininteger(
Default: nil
Default: nilFull type: GoogleApi.ContentWarehouse.V1.Model.QualityRichsnippetsAppsProtosLaunchAppInfoPerDocData.t

Info on how to launch a mobile app to consume this document's content, if applicable (see go/calypso).

eventsDatestring
Default: nilFull type: list(String.t

Date for Events. A web page might list multiple events with different dates. We only take one date (start date) per event.

homePageInfointeger(
Default: nil
GibberishScoreinteger(
Default: nil

The gibberish score is represented in 7 bits, going from 0 to 127.

toolbarPagerankinteger(
Default: nil

A copy of the value stored in /namespace/indexing/wwwglobal//fakepr/* for this document. A value of quality_bakery::FakeprUtils::kUnknownToolbarPagerank indicates that we don't have toolbar pagerank for this document. A value between 0 and 10 (inclusive) means that this is the toolbar pagerank of the page. Finally, if this value is not set it means that the toolbar pagerank is equivalent to: quality_bakery::FakeprUtils::EstimatePreDemotionFromPagerankNearestSeeds( basic_info.pagerank_ns()) called on the MustangBasicInfo attachment for the same document.

freshboxArticleScoresinteger(
Default: nil

Stores scores of freshness-related classifiers: freshbox article score, live blog score and host-level article score. The encoding/decoding API is in quality/freshness/freshbox/goldmine/freshbox_annotation_encoder.h. To use this field, you MUST join g/pq-classifiers-announce and add your use case at http://shortn/_RYXS2lX2IV.

WhirlpoolDiscountnumber(
Default: nil
ScaledExptIndyRank3integer(
Default: nil

experimental

Default: nilFull type: GoogleApi.ContentWarehouse.V1.Model.ToolBarPerDocData.t
nsrIsElectionAuthorityboolean(
Default: nil

This field is propagated to shards. It will also be populated at serving time by go/web-signal-joins (see b/168114815). This field is deprecated - used the equivalent field inside nsr_data_proto instead.

onsiteProminenceinteger(
Default: nil

Onsite prominence measures the importance of the document within its site. It is computed by propagating simulated traffic from the homepage and high craps click pages. It is a 13-bit int.

Default: nilFull type: GoogleApi.ContentWarehouse.V1.Model.QualityTravelGoodSitesData.t

This field stores information about good travel sites.

IsAnchorBayesSpamboolean(
Default: nil

Is this document considered spam by the anchor bayes classifier?

isHotdocboolean(
Default: nil

Set by the FreshDocs instant doc joiner. See //indexing/instant/hotdocs/README and http://go/freshdocs-hotdocs.

commercialScorenumber(
Default: nil

A measure of commerciality of the document Score > 0 indicates document is commercial (i.e. sells something) Computed by repository/pageclassifiers/parsehandler-commercial.cc

Default: nilFull type: GoogleApi.ContentWarehouse.V1.Model.QualityOrbitAsteroidBeltDocumentIntentScores.t

For indexing Asteroid Belt intent scores. See go/asteroid-belt for details.

TagPageScoreinteger(
Default: nil

Tag-site-ness of a page, repesented in 7-bits range from 0 to 100. Smaller value means worse tag page.

geodatastring
Default: nilFull type: String.t

geo data; approx 24 bytes for 23M U.S. pages

Default: nilFull type: GoogleApi.ContentWarehouse.V1.Model.OceanPerDocData.t

28 bytes per page, only in the Ocean index

pagerank0number(
Default: nil
SpamWordScoreinteger(
Default: nil

The spamword score is represented in 7-bits, going from 0 to 127.

ScaledIndyRankinteger(
Default: nil

The independence rank is represented as a 16-bit integer, which is multiplied by (max_indy_rank / 65536) to produce actual independence rank values. max_indy_rank is typically 0.84.

bodyWordsToTokensRatioBeginnumber(
Default: nil

The body words over tokens ratios for the beginning part and whole doc. NB: To save space, field body_words_to_tokens_ratio_total is not set if it has the same value as body_words_to_tokens_ratio_begin (e.g., short docs).

topPetacatWeightnumber(
Default: nil
Default: nilFull type: GoogleApi.ContentWarehouse.V1.Model.QualityCopiaFireflySiteSignal.t

Contains Site signal information for Firefly ranking change. See http://ariane/313938 for more details.

titleHardTokenCountWithoutStopwordsinteger(
Default: nil

Number of hard tokens originally in title without counting the stopwords.

hostNsrinteger(
Default: nil

Site rank computed for host-level sitechunks. This value encodes nsr, site_pr and new_nsr. See quality_nsr::util::ConvertNsrDataToHostNsr and go/nsr. This field is deprecated - used the equivalent field inside nsr_data_proto instead.

semanticDateInfointeger(
Default: nil

Info is encoded using a SemanticDate specific format. Contains confidence scores for day/month/year components as well as various meta data required by the freshness twiddlers.

languageslist(integer(
Default: nil

Plausible languages in order of decreasing plausibility. Language values are small, IE < 127 so this should compress to one byte each.

Default: nilFull type: GoogleApi.ContentWarehouse.V1.Model.GroupsPerDocData.t

16 bytes of groups2 data: used only in groups2 index

Default: nilFull type: GoogleApi.ContentWarehouse.V1.Model.CountryCountryAttachment.t

This field stores the country information for the document in the form of CountryAttachment.

Default: nilFull type: GoogleApi.ContentWarehouse.V1.Model.QualityGeoBrainlocBrainlocAttachment.t

Brainloc contains location information for the document. See ariane/273189 for details.

ScaledLinkAgeSpamScoreinteger(
Default: nil

End DEPRECATED ------------------------------------------------------------ Link age score is represented as a 7-bit integer, going from 0 to 127.

ScaledExptIndyRankinteger(
Default: nil

DEPRECATED ---------------------------------------------------------------- Please do not use these fields in any new code. experimental

Default: nilFull type: GoogleApi.ContentWarehouse.V1.Model.ShingleInfoPerDocData.t
Default: nilFull type: GoogleApi.ContentWarehouse.V1.Model.QualityProductProductSiteData.t

This field stores information about product sites.

spambrainDomainSitechunkDataSpamBrainData →
Default: nilFull type: GoogleApi.ContentWarehouse.V1.Model.SpamBrainData.t

Domain sitechunk level scores coming from spambrain.

Default: nilFull type: GoogleApi.ContentWarehouse.V1.Model.IndexingMobileVoltVoltPerDocData.t

Contains page UX signals for VOLT ranking change. See http://ariane/4025970 for more details.

timeSensitivityinteger(
Default: nil

Encoded Document Time Sensitivity signal.

Default: nilFull type: GoogleApi.ContentWarehouse.V1.Model.IndexingDocjoinerServingTimeClusterIds.t

A set of cluster ids which are generated in Alexandria and used to de-dup results at serving time.