Content API

pyPreservica now contains interfaces to the content API which supports searching the repository.

https://us.preservica.com/api/content/documentation.html

The content API is a readonly interface which returns json documents rather than XML and which has some duplication with the entity API, but it does contain search capabilities.

The content API client is created using

from pyPreservica import *

client = ContentAPI()

object-details

Get the details for an Asset or Folder as a Python dictionary object containing CMIS attributes

client = ContentAPI()

client.object_details("IO", "uuid")
client.object_details("SO", "uuid")

e.g.

from pyPreservica import *

client = ContentAPI()

details = client.object_details("IO", "de1c32a3-bd9f-4843-a5f1-46df080f83d2")
print(details['name'])

from pyPreservica import *

client = ContentAPI()

details = client.object_details(EntityType.ASSET, "de1c32a3-bd9f-4843-a5f1-46df080f83d2")
print(details['name'])

Indexed Fields

Get a list of all the indexed metadata fields within the Preservica search engine. This includes the default xip.* fields and any custom indexes which have been created through custom index files.

client = ContentAPI()

client.indexed_fields()

Full Text Index

If a document contains text such as a PDF or a Word document or it has been OCR’d the full text index will contain the extracted text.

To extract the value of the full text index for an Asset use the following call:

from pyPreservica import *

content = ContentAPI()

text: str = content.full_text("48c79abd-01f3-4b77-8132-546a76e0d337")

The reference supplied must be a valid Asset reference.

This allows you to copy the full text index into a description field to allow users to view the OCR text, for example:

from pyPreservica import *

content = ContentAPI()
entity = EntityAPI()

asset = entity.asset("48c79abd-01f3-4b77-8132-546a76e0d337")

asset.description = content.full_text(asset.reference)
entity.save(asset)

Search

Search the repository using a single expression which matches on any indexed field.

client = ContentAPI()

client.simple_search_csv()

Searches for everything and writes the results to a csv file called “search.csv”, by default the csv columns contain reference, title, description, document_type, parent_ref, security_tag.

You can pass the query term as the first argument (% is the wildcard character) and the csv file name as the second argument.

client = ContentAPI()

client.simple_search_csv("", "everything.csv")

client.simple_search_csv("Oxford", "oxford.csv")

client.simple_search_csv("History of Oxford", "history.csv")

The last argument is an optional list of indexed fields which are the csv file columns.

client = ContentAPI()

metadata_fields = ["xip.reference", "xip.title", "xip.description", "xip.document_type", "xip.parent_ref", "xip.security_descriptor"]
client.simple_search_csv("%", "results.csv", metadata_fields)

or to include everything except the full text index value

client = ContentAPI()

everything = list(filter(lambda x: x != "xip.full_text", client.indexed_fields()))
client.simple_search_csv("%", "results.csv", everything)

There is an equivalent call which does not write the output to CSV, but returns a generator of dictionary objects. This is useful if you want to process the results within the script and not generate a report directly.

client = ContentAPI()

for hit in client.simple_search_list("History of Oxford"):
    print(hit)

and

client = ContentAPI()

metadata_fields = ["xip.reference", "xip.title", "xip.description", "xip.document_type", "xip.parent_ref", "xip.security_descriptor"]
for hit in client.simple_search_list("History of Oxford", metadata_fields):
    print(hit['xip.title'])

If you want to do searches with advanced filter terms then the following calls can be used. These calls use a Python dictionary to allow the caller to specify filter values on the indexed terms.

client = ContentAPI()

filters = {"dc.rights": "Public Domain", "xip.security_descriptor": "public"}
for hit in client.search_index_filter_list(query="History of Oxford", filter_values=filters):
    print(hit)

If you want to generate a report which can be opened directly in Excel, thne use the csv version.

client = ContentAPI()

filters = {"oai_dc.contributor": "*", "xip.security_descriptor": "public"}
client.search_index_filter_csv(query="History of Oxford", csv_file="my-report.csv", filter_values=filters)

The special filter value “*” is used to filter indexes which have a value, i.e. values which are not empty or missing. The filter value “” is used to specify any value including empty values.

For example to create a report on the security tags of all assets within a folder you can use

client = ContentAPI()

filters = {"xip.title": "%", "xip.description": "%", "xip.security_descriptor": "*", "xip.parent_ref": "48c79abd-01f3-4b77-8132-546a76e0d337"}
client.search_index_filter_csv(query="%", csv_file="security.csv", filter_values=filters)

Filter values can also be provided as a list of values to match on:

client = ContentAPI()

filters = {"xip.title": "%", "xip.description": "%", "xip.security_descriptor": ["open", "public"], "xip.parent_ref": "48c79abd-01f3-4b77-8132-546a76e0d337"}
client.search_index_filter_csv(query="%", csv_file="security.csv", filter_values=filters)

Search Progress

Searching across a large Preservica repository is very quick, but returning very large datasets back to the client can be slow. To avoid putting undue load on the server pyPreservica will request a single page of results at a time for each server request.

If you are using the simple_search_csv or search_index_filter_csv functions which write directly to a csv file then it can be difficult to monitor the report generation progress.

To allow monitoring of search result downloads, you can add a callback to the search client. The callback class will be called for every page of search results returned to the client. The value passed to the callback contains the total number of search hits for the query and the current number of results processed.

Preservica provides a default callback

class ReportProgressCallBack:
    def __init__(self):
        self.current = 0
        self.total = 0
        self._lock = threading.Lock()

    def __call__(self, value):
        with self._lock:
            values = value.split(":")
            self.total = int(values[1])
            self.current = int(values[0])
            percentage = (self.current / self.total) * 100
            sys.stdout.write("\r%s / %s  (%.2f%%)" % (self.current, self.total, percentage))
            sys.stdout.flush()

To use the default callback in your scripts include the following line

client.search_callback(client.ReportProgressCallBack())

Excluding results from Search

The search API now allows results to be excluded from results by applying an operator to exclude terms.

Note

This functionality is only available in Preservica 7.5 and later.

To use this new functionality pyPreservica has provided a new search API which takes a list of Field objects. The Field object has a name of the index, the value to search from and a sort order. There is also an optional operator which determines if the field value should be excluded or included in the search.

To include filters in the search use:

fields = [Field(name='xip.title', value='Blockchain')]

for hit in content.search_fields(query="", fields=fields):
    print(hit)

To exclude filters in the search use:

fields = [Field(name='xip.title', value='Blockchain', operator=Operator.NOT,  sort_order=SortOrder.desc)]

for hit in content.search_fields(query="", fields=fields):
    print(hit)

To use a list of possible values use:

term = Field(name='xip.security_descriptor', value=["open", "public"])

for hit in content.search_fields(query="", fields=[term]):
    print(hit)

Reporting Examples

Create a spreadsheet containing all Assets within the repository

Generate a CSV report on all assets within the system, spreadsheet columns include asset title, description, security tag etc

from pyPreservica import *

client = ContentAPI()


if __name__ == '__main__':
    metadata_fields = {
        "xip.reference": "*", "xip.title": "",  "xip.description": "", "xip.document_type": "IO",  "xip.parent_ref": "",
        "xip.security_descriptor": "*",
        "xip.identifier": "", "xip.bitstream_names_r_Preservation": ""}

    client.search_callback(client.ReportProgressCallBack())

    client.search_index_filter_csv("", "assets.csv", metadata_fields)

Create a spreadsheet containing all Assets and Folders within the repository

from pyPreservica import *

client = ContentAPI()

if __name__ == '__main__':
    metadata_fields = {
        "xip.reference": "*", "xip.title": "",  "xip.description": "", "xip.document_type": "*",  "xip.parent_ref": "",
        "xip.security_descriptor": "*",
        "xip.identifier": "", "xip.bitstream_names_r_Preservation": ""}

    client.search_callback(client.ReportProgressCallBack())

    client.search_index_filter_csv("", "all_objects.csv", metadata_fields)

Create a spreadsheet containing all Assets and Folders underneath a specific folder

from pyPreservica import *

content = ContentAPI()
entity = EntityAPI()

folder = entity.folder(sys.argv[1])

print(f"Searching inside folder {folder.title}")

if __name__ == '__main__':
    metadata_fields = {
        "xip.reference": "*", "xip.title": "", "xip.description": "", "xip.document_type": "*", "xip.parent_hierarchy": f"{folder.reference}",
        "xip.security_descriptor": "*",
        "xip.identifier": "", "xip.bitstream_names_r_Preservation": ""}


    content.search_callback(content.ReportProgressCallBack())

    content.search_index_filter_csv("", "assets.csv", metadata_fields)

User Security Tags

You can get a list of available security tags for the current user by calling:

client = ContentAPI()

client.user_security_tags()