Skip to content

Latest commit

 

History

History
118 lines (98 loc) · 46.6 KB

README.md

File metadata and controls

118 lines (98 loc) · 46.6 KB

LangTrace - Trace Attributes

This repository hosts the JSON schema definitions and the generated model code for both Python and TypeScript. It's designed to streamline the development process across different programming languages, ensuring consistency in data structure and validation logic. The repository includes tools for automatically generating model code from JSON schema definitions, simplifying the task of keeping model implementations synchronized with schema changes.

Repository Structure

/
├── schemas/                      # JSON schema definitions
│   └── openai_span_attributes.json
├── scripts/                      # Shell scripts for model generation
│   └── generate_python.sh
├── generated/                    # Generated model code
│   ├── python/                   # Python models
│   └── typescript/               # TypeScript interfaces
├── package.json
├── requirements.txt
├── README.md
└── .gitignore

Prerequisites

Before you begin, make sure you have the following installed on your system:

  • Node.js and npm
  • Python and pip
  • ts-node for running TypeScript scripts directly (install globally via npm install -g ts-node)
  • datamodel-code-generator for Python model generation (install via pip install datamodel-code-generator)

Generating Models

Python Models

To generate Python models from a JSON schema, use the generate_python.sh script located in the scripts directory. This script takes the path to a JSON schema file as an argument and generates a Python model in the generated/python directory.

./scripts/generate_python.sh schemas/llm_span_attributes.json

TypeScript Interfaces

To generate TypeScript interfaces from a JSON schema, use the scripts/generate_typescript.sh script located in the scripts directory. This script also takes the path to a JSON schema file as an argument and generates a TypeScript interface in the src/typescript/models directory. t

(cd src/typescript && npm i)
./scripts/generate_typescript.sh schemas/llm_span_attributes.json

OpenTelemetry Semantic Attributes

Service Type Name Type/Schema Description
LLM llm.prompts [{role: string, content: string}] Captures the input messages given to the LLM. It includes the prompt with role "System" and any "user" and "assistant" messages along with the history.
Notes:
1. Prompts are standardized for every LLM vendor.
2. The "system" role will always represent the system prompt passed. Ex: The preamble parameter passed to the cohere API is appended to the system prompt and captured within llm.prompts.
LLM llm.responses [{role: string, content: string}] Captures the output messages given by the LLM.
Notes:
1. For image generation, content is an object which has, 'url' which is the url of the image and any other properties that gets attached with it based on the LLM vendor.
2. For tool calling, the list includes role, content and additional properties like tool_id depending on the LLM vendor.
LLM llm.token.counts llm.token.counts: {
input_tokens: number,
output_tokens: number,
total_tokens: number
}
Captures the token counts used with the request including input, output and total tokens.
Notes:
1. For streaming mode, some LLM vendors like OpenAI do not have the token counts. So, this metric calculates the token counts for each stream chunk using the tiktoken library. As a result, it may not be accurate.
2. For cohere, this captures the billed units. And also captures the search_units when search capabilities are used.
LLM llm.api string The endpoint being invoked. Ex: /chat/completions
LLM llm.model string The model used for the call. The model is captured from the response and not from the request. Response has the accurate model name. Ex: Passing "gpt-4" in the request can result in "gpt-4-0613" in the response depending on the version of gpt-4 being used. This is more accurate description of the model used for the call.
LLM llm.temprature number The temperature setting used
LLM llm.top_p number Top P setting
LLM llm.top_k number Top K setting
Note:
1. For LLMs that support top_n, the argument is captured in this attribute as both top_k and top_n represent the same thing.
LLM llm.user string This is an LLM request parama for identifying the user originating this request. Not to be confused with the user.id attribute passed to the langtrace SDK using with_additional_attributes option.
LLM llm.system.fingerprint string The system fingerprint parameter passed to the API.
LLM llm.stream boolean Whether or not streaming is used
LLM llm.encoding.formats [string] Mainly applies to Embedding models. List of encoding formats used for embedding.
LLM llm.dimensions string The number of dimensions the resulting output embeddings should have
LLM llm.generation_id string Captures the generation_id from a response if any.
LLM llm.response_id string Captures the response_id from a response if any.
LLM llm.citations [object] List of citations from cohere’s response. Serialized as is without any mutation to apply any standardization. Cohere Documentation on Documents and Citations
LLM llm.documents [object] Serialized list of documents passed to the rerank API of cohere. This primarily applies to retrieval models and serialized as is without any mutation to apply any standardization.
LLM llm.frequency_penalty string Frequency penalty if passed
LLM llm.presence_penalty string Presence penalty if passed
LLM llm.connectors [object] Applies mainly for cohere. Serialized directly without mutation.
LLM llm.tools [object] The list of tools or functions available for the LLM to take a decision on. There is no standardization applied for the schema and serialized as is for different LLM vendors.
LLM llm.tool_results [object] For LLM vendors that require tool_results passed as a separate parameter with the request. Ex: Cohere. For OpenAI, tool results are part of the messages parameter and are captured with llm.prompts.
LLM llm.embedding_inputs [string] Captures the input strings provided to the embedding model.
LLM llm.embedding_dataset_id string Applies only for cohere
LLM llm.embedding_input_type string Applies only for cohere
LLM llm.embedding_job_name string Applies only for the embed_job API for cohere.
LLM llm.retrieval.query string Query passed to the retrieval model. Ex: Cohere Rerank
LLM llm.retrieval.results [string] Serialized array of objects returned by a retrieval model that usually includes the score and the index of the documents passed.
VectorDB server.address string Captures the DB server address if found
VectorDB db.operation string Operations of a vectorDB - add, delete, query, peek etc.
VectorDB db.system string Captures the db - chromedb, pinecone etc.
VectorDB db.namespace string Namespace of the database
VectorDB db.index string Index passed to the database if any
VectorDB db.collection.name string Captures the collection name where vectors are stored that the operation is querying.
VectorDB db.pinecone.top_k string Captures the top_k value for KNN search
VectorDB db.chromadb.embedding_model string Captures the embedding model used with chromadb
Framework http://langchain.task.name/angchain.task.name string Short term that indicates what task the framework is performing. The names are framework specific. Currently it could be one of the following: load_pdf, vector_store, split_text, retriever, prompt, runnable, runnablepassthrough, jsonoutputparser, stroutputparser, listoutputparser, xmloutputparser.
Framework langchain.inputs string Serialized inputs to the function call
Framework langchain.outputs string Serialized outputs of the function call
Framework llamaindex.task.name string Short term that indicates what task the framework is performing. Currently it could be one of the following - query, retrieve, extract, aextract, load_data, chat, achat
Framework llamaindex.inputs string Serialized inputs to the function call
Framework llamaindex.outputs string Serialized outputs of the function call
Langtrace user.feedback.rating number This is useful for capturing the feedback provided by the user of the application for an LLM’s response. Ex: a user hitting a thumbs up or down for a chatbot’s response.
Langtrace user.id string This is application specific and can be optionally passed using the with_additional_attributes option from the SDK for tying users to requests. More details: Langtrace Trace User Feedback
Langtrace langtrace.testId string Unique id of the test generated within langtrace for capturing requests to a specific test bucket. Useful for evaluating a set of requests against a specific test. Ex: A test for measuring factual accuracy.
Langtrace langtrace.service.name string Captures the service name - Ex: openai, llamaindex etc.
Langtrace langtrace.service.type string Captures the service type - It can be one of the below 3
- LLM
- VectorDB
- Framework
Langtrace langtrace.service.version string Version of the library being used: Ex: 3.0.0 represents the 3.0.0 version of openai python library
Langtrace langtrace.sdk.name string Langtrace SDK that is generating this span. Currently its typescript or python.
Langtrace langtrace.version string Langtrace SDK version.

Contributing

Contributions are welcome! If you'd like to add a new schema or improve the existing model generation process, please follow these steps:

  1. Fork the repository.
  2. Create a new branch for your feature or fix.
  3. Make your changes.
  4. Test your changes to ensure the generated models are correct.
  5. Submit a pull request with a clear description of your changes.

License

This project is licensed under the Apache 2.0. See the LICENSE file for more details.