Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to forceLanguage via JS API #551

Closed
c-w opened this issue Feb 1, 2024 · 6 comments
Closed

Unable to forceLanguage via JS API #551

c-w opened this issue Feb 1, 2024 · 6 comments
Labels
bug Something isn't working Pagefind CLI The CLI responsible for indexing content

Comments

@c-w
Copy link

c-w commented Feb 1, 2024

Problem

Using the JS API to create an index, forceLanguage doesn't seem to have any effect.

Repro

Steps:

  1. Save the following file as test.mjs
  2. Run node test.mjs
  3. Open the browser at http://localhost:3000/index.html
  4. Search for "shit"

Actual behavior:

  • Document containing "shy" is returned (stemming is applied)

Expected behavior:

  • No document is returned (stemming isn't applied)
import http from "node:http";
import { rm, mkdir, writeFile } from "node:fs/promises";
import { createReadStream, stat } from "fs";
import path from "node:path";
import * as pagefind from "pagefind";
import { fileURLToPath } from "url";
import mime from "mime";

void async function() {
  const domain = "http://foo.com";

  const { index } = await pagefind.createIndex({
    forceLanguage: "unknown"
  });

  await index.addCustomRecord({
    url: domain + "/shy.html",
    content: "not shy of using words",
    language: "en"
  });

  const outputPath = path.join(path.dirname(fileURLToPath(import.meta.url)), "demo");
  await rm(outputPath, { force: true, recursive: true });
  await mkdir(outputPath);
  await index.writeFiles({ outputPath });

  await writeFile(path.join(outputPath, "index.html"), `
<link href="/pagefind-ui.css" rel="stylesheet">
<script src="/pagefind-ui.js"></script>
<div id="search"></div>
<script>
    window.addEventListener("DOMContentLoaded", () => {
        new PagefindUI({ element: "#search", showSubResults: true });
    });
</script>
  `, "utf-8");

  const server = http.createServer((req, res) => {
    const url = new URL(domain + req.url);
    const filePath = path.join(outputPath, url.pathname);
    stat(filePath, (err, stat) => {
      if (err || !stat.isFile()) {
        res.writeHead(404);
        res.end("not found");
      } else if (stat.isFile()) {
        res.writeHead(200, {
          "Content-Length": stat.size,
          "Content-Type": mime.getType(filePath)
        });
        createReadStream(filePath).pipe(res);
      }
    });
  });

  server.listen(3000);
}();

Work-around

Applying the following patch fixes the problem, however, according to the documentation I'd expect to be able to set forceLanguage once on the top-level configuration and not have to do it for every document. Perhaps the documentation should be updated or precedence given to the top-level configuration item instead of the document-level value.

@@ -9,14 +9,12 @@ import mime from "mime";
 void async function() {
   const domain = "http://foo.com";
 
-  const { index } = await pagefind.createIndex({
-    forceLanguage: "unknown"
-  });
+  const { index } = await pagefind.createIndex({});
 
   await index.addCustomRecord({
     url: domain + "/shy.html",
     content: "not shy of using words",
-    language: "en"
+    language: "unknown",
   });
 
   const outputPath = path.join(path.dirname(fileURLToPath(import.meta.url)), "demo");

Context

Pagefind version: 1.0.4

@bglw
Copy link
Contributor

bglw commented Feb 1, 2024

Ah, good find!

The addCustomRecord() flow is stepping around the function that sets this — using addHTMLFile() would override the language as you're expecting.

Will fix so that it overrides both cases 👍

@bglw bglw added bug Something isn't working Pagefind CLI The CLI responsible for indexing content labels Feb 1, 2024
@c-w
Copy link
Author

c-w commented Feb 1, 2024

There's another strange behavior I noticed related to stemming. If we add a few more test to the example above:

  await index.addCustomRecord({
    url: domain + "/c.html",
    content: "industrialist and General Motors co-founder William C. Durant",
    language: "unknown"
  });

  await index.addCustomRecord({
    url: domain + "/p.html",
    content: "George P. Knapp",
    language: "unknown"
  });

Now searching for "poop" or "crap" will match the single letter tokens P and C which is quite unexpected to me.

@bglw
Copy link
Contributor

bglw commented Feb 1, 2024

PR created for the language fix + test case: #552

Re: the strange behavior, that's currently intentional, though indeed here isn't the most useful. Pagefind really likes giving some result over nothing. One way it does that is to trim the search term back until it finds a search term that would match — the idea being that if you type generalx it gets trimmed back to general. There's no escape hatch on this though, so it will trim it back to one character if need be.

It's an open area for improvement — hopefully one day getting some better typo tolerance features in place will allow us to ease back on this one to something a little more intuitive 🙂

@c-w
Copy link
Author

c-w commented Feb 1, 2024

Thanks for the explanation. For now I'll hack around it by client-side parsing the excerpt and filtering out any matches where the mark is shorter than some threshold.

@bglw
Copy link
Contributor

bglw commented Feb 1, 2024

v1.0.5-rc2 has been published with the fix for forceLanguage 🙂

Will leave this issue alive til it hits stable.

@bglw
Copy link
Contributor

bglw commented Apr 2, 2024

This has landed in the v1.1.0 release 🎉

@bglw bglw closed this as completed Apr 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Pagefind CLI The CLI responsible for indexing content
Projects
None yet
Development

No branches or pull requests

2 participants