Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve search index generation for PHP.net #154

Merged
merged 5 commits into from
Oct 6, 2024

Conversation

lhsazevedo
Copy link
Contributor

Note

This is a companion PR to php/web-php#1084, but it is not dependent on it and can be merged independently.

Intro

PHD index entries can have both short (sdesc) and long (ldesc) descriptions. Often, the short description is empty, leading to the current fallback mechanism in getShortDescription and getLongDescription:

// phpdotnet/phd/Format.php
final public function getLongDescription($id, &$isLDesc = null) {
    if ($this->indexes[$id]["ldesc"]) {
        $isLDesc = true;
        return $this->indexes[$id]["ldesc"];
    } else {
        $isLDesc = false;
        return $this->indexes[$id]["sdesc"];
    }
}
final public function getShortDescription($id, &$isSDesc = null) {
    if ($this->indexes[$id]["sdesc"]) {
        $isSDesc = true;
        return $this->indexes[$id]["sdesc"];
    } else {
        $isSDesc = false;
        return $this->indexes[$id]["ldesc"];
    }
}

Problem

The current search index JSON generation doesn't utilize this fallback
mechanism, resulting in missing entries in PHP.net search results.

Solution

This PR addresses the issue by:

  • Using the long description as the title when the short description is empty.
  • Using the parent book title as the long description in such cases.
  • Ignoring non-page entries (non-chunk entries) when generating the search index JSON.

Impact

  • Improves search result completeness and accuracy.
  • Reduces search index size by excluding irrelevant entries.

Examples

Currently, "PHP Manual > Language Reference > Types > String" is missing from search results due to an empty short description. This change will use "Strings" (the long description) as the title and "Language Reference" (the parent book title) as the long description.

search-index.json

 [
-  "",
+  "Strings",
   "language.types.string",
   "sect1"
 ],
-[
-  "",
-  "language.types.string",
-  "sect2"
-],

...
  [
-   "",
+   "Enumerations",
    "language.types.enumerations",
    "sect1"
  ],

search-description.json

-  "language.types.string": "Strings",
+  "language.types.string": "Language Reference",
...
-  "language.types.enumerations": "Enumerations",
+  "language.types.enumerations": "Language Reference",

Statistics

stat before after change
Entries count 29,765 11,256 -62%
Entries lacking title (sdesc) 19,955 0 -100%
Search index size 1,383KB 693KB -50%

Generated Search Index Diff Preview

Only the first 100 lines are shown for brevity.

 [
   [
-    "",
+    "Copyright",
     "copyright",
     "legalnotice"
   ],
   [
-    "",
-    "index",
-    "info"
-  ],
-  [
-    "",
-    "index",
-    "book"
-  ],
-  [
-    "",
-    "preface",
-    "section"
-  ],
-  [
-    "",
+    "Preface",
     "preface",
     "preface"
   ],
   [
-    "",
-    "intro-whatis",
-    "example"
-  ],
-  [
-    "",
+    "What is PHP?",
     "intro-whatis",
     "section"
   ],
   [
-    "",
+    "What can PHP do?",
     "intro-whatcando",
     "section"
   ],
   [
-    "",
+    "Introduction",
     "introduction",
     "chapter"
   ],
   [
-    "",
+    "What do I need?",
     "tutorial.requirements",
     "section"
   ],
   [
-    "",
-    "tutorial.firstpage",
-    "example"
-  ],
-  [
-    "",
-    "tutorial.firstpage",
-    "example"
-  ],
-  [
-    "",
+    "Your first PHP-enabled page",
     "tutorial.firstpage",
     "section"
   ],
   [
-    "",
-    "tutorial.useful",
-    "example"
-  ],
-  [
-    "",
-    "tutorial.useful",
-    "example"
-  ],
-  [
-    "",
-    "tutorial.useful",
-    "example"
-  ],
-  [
-    "",
+    "Something Useful",
     "tutorial.useful",
     "section"
   ],
   [
-    "",
-    "tutorial.forms",
-    "example"
-  ],

Improves the search indexes generated by the PHP-Web format by:
- Adding short descriptions to entries that lack them
- Skipping non-chunk entries (page elements)
@kamil-tekiela
Copy link
Member

Will this break the search if we merge it now?

@lhsazevedo
Copy link
Contributor Author

No. It is safe to merge as it doesn't alter the JSON structure.

@kamil-tekiela kamil-tekiela merged commit 673b2da into php:master Oct 6, 2024
9 checks passed
@lhsazevedo lhsazevedo deleted the improve-search-index branch October 6, 2024 13:55
@lhsazevedo
Copy link
Contributor Author

Thank you!

@kamil-tekiela
Copy link
Member

Just curious. You said it would not affect the current functionality, right? Does that mean you changed something in the other PR that will avail of this change?

@lhsazevedo
Copy link
Contributor Author

I said that it wouldn't break the current search, but you can already see the improved results on php.net.

Unfortunately, the current UI hides them under the last result group ("Other Matches"). You'll probably need to scroll down the menu to see it. You may also need to clear you local storage because the search index is cached for two weeks.

Try searching for "syntax", "types" or "operators". You should see some language reference results under "Other Matches". Those pages were missing from the index before because they don't have an sdesc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants