``pds4_create_xml_index`` Program ================================= Introduction ------------ The RMS Node's PDS4 Index Creation Tool (``pds4_create_xml_index``) is designed to facilitate the extraction and indexing of information from `Planetary Data System (PDS) `_ `PDS4-format `_ label files. This tool automates the process of parsing specified directories for label files, extracting user-defined elements using XPath expressions, and generating organized output files. Users can customize the extraction process through various options, such as filtering content, sorting output, and integrating additional file metadata. The tool supports flexibility with configuration files and provides a straightforward interface for creating both CSV-based (variable or fixed-width) index files and text files listing available XPath headers. Whether for scientific research, data management, or archival purposes, the PDS4 Index Creation Tool offers a robust solution for efficiently managing and accessing structured data within PDS4-compliant datasets. XPath Syntax and Structure -------------------------- ``pds4_create_xml_index`` generates index file headers that represent the position of each element in the XML hierarchy in a manner similar to `the standard XPath format `_. While familiarity with XPath syntax is beneficial, it is not necessary to use this tool effectively. Note that for simplicity we call our headers XPaths throughout this document and within the command line arguments, despite some minor syntax differences. Here we describe the syntax we use for our headers. An XPath header is read from left to right, starting with the root element and moving through each subsequent child element. Elements are separated by a forward slash (``/``). If an element has multiple instances within a single parent, predicates (numbers within angle brackets) are used to specify the exact instance, such as ``../pds:Observation_Area<1>/pds:version_id<1>``, which selects the first ``version_id`` element in ``Observation_Area``. The predicate numbers count the instances of the child element, rather than its structural position in the file. For example, given the XML fragment:: ... ... ... ... ... ... the available XPath headers are:: Product_Observational<1>/Identification_Area<1>/logical_identifier<1> Product_Observational<1>/Identification_Area<1>/Citation_Information<1>/author_list<1> Product_Observational<1>/Identification_Area<1>/Citation_Information<1>/publication_year<1> Product_Observational<1>/Identification_Area<1>/Citation_Information<1>/keyword<1> Product_Observational<1>/Identification_Area<1>/Citation_Information<1>/keyword<2> Product_Observational<1>/Identification_Area<1>/Citation_Information<1>/description<1> Command Line Arguments ---------------------- Required arguments ^^^^^^^^^^^^^^^^^^ Two command line arguments are required in every run. The first is the top-level directory of the collection, bundle, or other directory structure where the label files are location. All file path strings included in the index file and/or label will be given relative to this directory. The second is one or more ``glob``-style patterns that specify the filenames of the labels to be scraped. The patterns should be given relative to the top-level directory. ``glob``-style patterns allow wildcard symbols similar to those used by most Unix shells: - ``?`` matches any single character within a directory or file name - ``*`` matches any series of characters within a directory or file name - ``**`` matches any filename or zero or more nested directories - ``[seq]`` matches any character in ``seq`` - ``[!seq]`` matches any character not in ``seq`` To avoid interpretation by the shell, all patterns must be surrounded by double quotes. If more than one pattern is specified, they should each be surrounded by double quotes and separated by spaces. Example:: pds4_create_xml_index /top/level/directory "data/planet/*.xml" "**/rings/*.xml" Optional arguments ^^^^^^^^^^^^^^^^^^ Index file generation """"""""""""""""""""" - ``--output-index-file INDEX_FILEPATH``: Specify the location and filename of the index file. This file will contain the extracted information organized in CSV format. It is recommended that the file have the suffix ``.csv``. If no directory is specified, the index file will be written into the current directory. If this option is omitted entirely, the default filename ``index.csv`` will be used. However, to prevent accidentally overwriting an existing index file, if ``index.csv`` already exists in the current directory, the index will be written into ``index1.csv``, ``index2.csv``, etc. as necessary. - ``--add-extra-file-info COMMA_SEPARATED_COLUMN_NAMES``: Generate additional information columns in the index file. One or more column names can be specified separated by commas. The available column names are: - ``filename``: The base filename of the label file - ``filepath``: The path of the label file relative to the top-level directory - ``bundle_lid``: The LID of the bundle containing the label file - ``bundle``: The name of the bundle containing the label file .. note:: **Adds utility columns before any header simplification/renaming. - ``--sort-by COMMA_SEPARATED_HEADER_NAME(s)``: Sort the resulting index file by the value(s) in one or more columns. The column names are those that appear in the final index file, as modified by ``--simplify-xpaths``, ``--limit-xpaths-file``, or ``--clean-header-field-names``, and include any additional columns added with ``--add-extra-file-info``. To see a list of available column names, use ``--output-headers-file``. More than one sort key can be specified by separating them by commas, in which case the sort proceeds hierarchically from left to right. As the XPath syntax includes special characters that may be interpreted by the shell, it may be necessary to surround the list of sort keys with double quotes. .. note:: Sorting uses the **current** header names after steps 1 – 5. Example:: pds4_create_xml_index <...> --sort-by "pds:Product_Observational/pds:Identification_Area<1>/pds:version_id<1>,pds:logical_identifier" - ``--fixed-width``: Format the index file using fixed-width columns. - ``--clean-header-field-names``: Rename column headers to use only characters permissible in variable names, making them more compatible with certain file readers. .. note:: Normalizes header strings after simplification but before sorting/rename. - ``--simplify-xpaths``: Where possible, rename column headers to use only the tag instead of the full XPath. If this would cause ambiguity, leave the name using the full XPath instead. This will usually produce an index file with simpler column names, potentially making the file easier to display or use. .. note:: Reduces headers to shortest unambiguous form prior to cleaning/sorting/rename. - ``--dont-number-unique-tags``: Whenever a tag is unique within its hierarchy, remove the predicates (``<#>``) from the full or simplified XPath. When used together with ``--simplify-xpaths``, this command generates the shortest possible headers. .. note:: Removes predicates from unique tags prior to simplify/clean/sort/rename Limiting results """""""""""""""" - ``--limit-xpaths-file XPATHS_FILEPATH``: Specify a text file containing a list of specific XPaths to extract from the label files. If this argument is not specified, all elements found in the label files will be included. This command uses only the whole versions of the XPath(s) -- simplified versions are not allowed. The given text file can specify XPaths using ``glob``-style syntax, where each XPath level is treated as if it were a directory in a filesystem. Available wildcards are: - ``?`` matches any single character within an XPath level - ``*`` matches any series of characters within an XPath level - ``**`` matches any tags and zero or more nested XPath levels - ``[seq]`` matches any character in ``seq`` - ``[!seq]`` matches any character not in ``seq`` For example, the XPath ``pds:Product_Observational/pds:Identification_Area<1>/pds:version_id<1>`` could be matched using: - ``pds:Product_Observational/pds:Identification_Area<1>/pds:version_id<1>`` - ``pds:Product_Observational/pds:Identification_Area<1>/*`` - ``pds:Product_Observational/**/*version*`` - ``pds:Product_Observational/**`` In addition, XPaths can be removed from the selected set by prefacing the pattern with ``!``. For example, the following set of patterns would select all XPaths except for any containing the string ``version`` somewhere in the name:: ** !**/*version* .. note:: Defines the working header set for all subsequent steps. - ``--output-headers-file HEADERS_FILEPATH``: Write a list of all column names included in the index file. The column names will precisely agree with those given in the first line of the index file, as modified by ``--simplify-xpaths``, ``--limit-xpaths-file``, ``--dont-number-unique-tags``, or ``--clean-header-field-names``, and include any additional columns added with ``--add-extra-file-info``. This file is useful to easily verify the contents of the index file and also to serve as a starting point for a file to be supplied to ``--limit-xpaths-file``. .. note:: Not in header-processing sequence. Writes the final headers that would appear in the index. Internally applies transforms in this order for the headers file: simplify → rename → clean. Label generation """""""""""""""" - ``--generate-label {ancillary,metadata}``: Generate a label file describing the index file. The label file will be placed in the same directory as the index file and will have the same name except that the suffix will be ``.xml``. The required argument specifies the type of metadata class to use in the label file, ``Product_Ancillary`` for ``ancillary`` or ``Product_Metadata_Supplemental`` for ``metadata``. Miscellaneous """"""""""""" - ``--verbose``: Display detailed information during the file scraping process that may be useful for debugging. - ``--rename-headers``: Change the headers of the output file from their XPath/simplified XPath counterparts to user-defined values via a given text file. Each line within the text file must have the format ``,``. - ``--config-file``: Specify one or more YAML-style configuration files for further customization of the extraction process. See the section below for details. Configuration Files ------------------- ``pds4_create_xml_index`` allows for the use of `YAML `_ configuration files to alter specific contents of index files and generated label files. Configuration files can specify information for "nillable elements" and details to be included in the generated label file. Because you can specify multiple configuration files, you have the option of separating different types of configuration data into separate files, or including it all in a single file. Nillable Elements ^^^^^^^^^^^^^^^^^ The first application of configuration files is to cover instances of nilled elements. Nilled elements are those intentionally omitted due to being inapplicable, missing, unknown, or anticipated. ``pds4_create_xml_index`` has a default configuration file that contains values for a common set of nilled elements. Below is the ``nillable`` section of the default configuration file that covers these values:: nillable: pds:ASCII_Date_YMD: inapplicable: '0001-01-01' missing: '0002-01-01' unknown: '0003-01-01' anticipated: '0004-01-01' pds:ASCII_Date_Time_YMD: inapplicable: '0001-01-01T12:00' missing: '0002-01-01T12:00' unknown: '0003-01-01T12:00' anticipated: '0004-01-01T12:00' pds:ASCII_Date_Time_YMD_UTC: inapplicable: '0001-01-01T12:00Z' missing: '0002-01-01T12:00Z' unknown: '0003-01-01T12:00Z' anticipated: '0004-01-01T12:00Z' pds:ASCII_Integer: inapplicable: -999 missing: -998 unknown: -997 anticipated: -996 pds:ASCII_Real: inapplicable: -999.0 missing: -998.0 unknown: -997.0 anticipated: -996.0 pds:ASCII_Short_String_Collapsed: inapplicable: inapplicable missing: missing unknown: unknown anticipated: anticipated **NOTE**: YAML considers the ``000X-0X-0X`` format as a datetime object. As such, assigned values for ``ASCII_Date_YMD`` and other data types that use this format need to be surrounded by quotes. You can support any additional nillable data types, or override the default values, by supplying additional configuration files with the ``--config-file`` option. Below is an example of a configuration file that overrides the default values for three of the six common data types:: nillable: pds:ASCII_Integer: inapplicable: -9999 missing: -9988 unknown: -9977 anticipated: -9966 pds:ASCII_Real: inapplicable: -9999.0 missing: -9988.0 unknown: -9977.0 anticipated: -9966.0 pds:ASCII_Short_String_Collapsed: inapplicable: inapplicable_alt missing: missing_alt unknown: unknown_alt anticipated: anticipated_alt Processing Order of Options """"""""""""""""""""""""""" Most options either transform the **set and names of headers** in the index or affect output/labels. During index generation, header-affecting options are applied in this order: 1. ``--limit-xpaths-file`` Restricts which headers (XPaths) are considered at all. 2. ``--add-extra-file-info`` Appends utility columns (e.g., ``filepath``) to the working set. 3. ``--dont-number-unique-tags`` Removes predicates (``<1>``) from headers where the tag is unique within its hierarchy. 4. ``--simplify-xpaths`` Rewrites headers to their simplest unambiguous form. 5. ``--clean-header-field-names`` Normalizes header strings to reader-friendly variable names. 6. ``--sort-by`` Sorts rows using the **current** (already-transformed) header names. 7. ``--rename-headers`` Applies user-supplied final header renames. Notes ^^^^^ - ``--output-headers-file`` writes the **final headers** that would appear in the index, but internally applies header transforms in the order: simplify → rename → clean. (This differs slightly from the main pipeline, but the resulting file still reflects what users will see when the same options are used.) - Formatting/packaging options (e.g., ``--fixed-width``, ``--output-index-file``, ``--generate-label``) occur at write time and do not participate in the header transformation pipeline. Label Contents ^^^^^^^^^^^^^^ Moreover, the configuration files can include content for label generation. This feature allows you to add optional classes to the generated label file, such as ``Citation_Information``, and ``Modification_History``. Additionally, you can override existing values within the generated label file. Below is the ``label-contents`` section of the default configuration file:: label-contents: version_id: 1.0 title: Index File Citation_Information: Modification_Detail: Internal_Reference: External_Reference: Source_Product_Internal: Source_Product_External: File_Area_Ancillary: File_Area_Metadata: Each listed value with an empty dictionary is an optional field the user can include in their generated label. If the user does decide to include one of these fields, **they must also include all elements within that field in their configuration file, even if the element will remain empty**. For reference, provided below are the full contents of the optional label classes:: Citation_Information: author_list: editor_list: publication_year: doi: keyword: description: Funding_Acknowledgement: funding_source: funding_year: funding_award: funding_acknowledgement_text: Modification_Detail: modification_date: version_id: description: Internal_Reference: lid_reference: reference_type: comment: External_Reference: doi: reference_text: description: Source_Product_Internal: lidvid_reference: reference_type: comment: Source_Product_External: external_source_product_identifier: reference_type: doi: curating_facility: description: File_Area_Ancillary / File_Area_Metadata: creation_date_time: If no new contents are specified for label generation, the label will contain the following classes:: ... ... ... ... ... Creative Common Public License CC0 1.0 (2024) Creative Commons Zero (CC0) license information. urn:nasa:pds:system_bundle:document_pds4_standards:creative_commons_1.0.0::1.0 product_to_license Depending on the chosen argument for ``--generate-label`` (``ancillary`` or ``metadata``), the label will be given either the ``File_Area_Ancillary`` or ``File_Area_Metadata`` class, which will then contain either ``Table_Character`` for fixed-width index files or ``Table_Delimited`` for variable-length files. These classes are populated by the label generation code and cannot be altered with configuration files (except in the case of nilled elements). Here is an example of a configuration file that overrides the default title and adds modification history:: label-contents: title: Index file for my occultation bundle Modification_Detail: - modification_date: '2024-01-01' version_id: 1.1 description: | This is a lengthy description of what this modification changed in the bundle. There were lots of changes. - modification_date: '2023-01-01' version_id: 1.0 description: Initial release.