Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicated titles and descriptions in search index for chunks without parent book #159

Open
lhsazevedo opened this issue Oct 8, 2024 · 0 comments · May be fixed by #160
Open

Duplicated titles and descriptions in search index for chunks without parent book #159

lhsazevedo opened this issue Oct 8, 2024 · 0 comments · May be fixed by #160

Comments

@lhsazevedo
Copy link
Contributor

lhsazevedo commented Oct 8, 2024

Background

In #154, we addressed the issue of missing pages in the search index. The root cause was that some index entries lacked a title property (sdesc). We resolved this by applying the same solution used for the manual content: using the description (ldesc) as the title.

To avoid duplicating text in both the title and description fields, we now pull the description from the parent <book>. For example:

Types / Language Reference

In this example, the title "Type" was taken from the page description, while the new description ("Language Reference") comes from the parent <book>. You can see the implementation here:

if ($index["sdesc"] === "" && $index["ldesc"] !== "") {
$index["sdesc"] = $index["ldesc"];
$parentId = $index['parent_id'];
// isset() to guard against undefined array keys, either for root
// elements (no parent) or in case the index structure is broken.
while (isset($this->indexes[$parentId])) {
$parent = $this->indexes[$parentId];
if ($parent['element'] === 'book') {
$index["ldesc"] = Format::getLongDescription($parent['docbook_id']);
break;
}
$parentId = $parent['parent_id'];
}
}

Issue

Some entries, like extension main pages (e.g. book.strings, book.zip) and top-level pages (e.g. copyright, getting-started, security), don’t have a parent <book>. In these cases, the description is being reused as the title, resulting in duplicate content:

book.strings
book.zip
copyright

Proposed fix

While some entries lack a parent <book>, every entry has at least one parent <set>. The root entry itself is a set called "PHP Manual".

The proposed solution is to fall back to the first <set> in the hierarchy when no <book> is found:

book.strings
book.zip
copyright

I have a working implementation and will submit a PR soon.

lhsazevedo added a commit to lhsazevedo/phd that referenced this issue Oct 8, 2024
This commit enhances the search index generation process by providing
more meaningful descriptions for entries that lack a parent <book>
element. Additionally, refactors writeJsonIndex() into smaller methods.

Fixes php#159
@lhsazevedo lhsazevedo linked a pull request Oct 8, 2024 that will close this issue
@lhsazevedo lhsazevedo changed the title Duplicate titles and descriptions in search index for chunks without parent book Duplicated titles and descriptions in search index for chunks without parent book Oct 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant