NAV Navigation
Python JavaScript

Nucleus API documentation

Welcome

Scroll down for code samples, example requests and responses. Select a language for code samples from the tabs above or the mobile navigation menu.

In this documentation, you will find comprehensive guides and examples to help you start working with the Nucleus API. The Nucleus API is a suite of high-performance text-analytics services developed by SumUp Analytics and geared toward large-scale and high-velocity text data.

Please contact our customer success team to discuss your use case. To learn more, please visit www.sumup.ai.

Install the SDK

The Nucleus API are available to integrate within your framework using either a Python or Javascript SDK, running on a RESTful server.

Python GitHub Download

Javascript GitHub Download

Authenticate

To get authorized, use this code:

import nucleus_api

# Instantiate APIs
configuration = nucleus_api.Configuration()

configuration.host = 'your-api-server-hostname'

configuration.api_key['x-api-key'] = 'your-api-key'

api_instance = nucleus_api.NucleusApi(nucleus_api.ApiClient(configuration))

Authorization through an API key is required to get access to Nucleus services. If you don't have a key or if your key has expired, please contact our Customer Success team.

Tutorial Initialization

NOTICE: All information contained herein is, and remains the property of SumUp Analytics Inc. and its suppliers, if any. The intellectual and technical concepts contained herein are proprietary to SumUp Analytics Inc. and its suppliers and may be covered by U.S. and Foreign Patents, patents in process, and are protected by trade secret or copyright law.

Dissemination of this information or reproduction of this material is strictly forbidden unless prior written permission is obtained from SumUp Analytics Inc.

Copyright (c) 2018-2019 SumUp Analytics, Inc. All Rights Reserved.

Configure API host and key, and create a new API instance

import os
import csv
import json
import datetime
import time
import nucleus_api
from nucleus_api.rest import ApiException
import nucleus_api.api.nucleus_api as nucleus_helper
from pprint import pprint
import numpy as np
from pathlib import Path

# Determine if in Jupyter notebook or not
try:
    ip = get_ipython()
    running_notebook = True
except NameError:
    running_notebook = False

if running_notebook:
    print('Running example in Jupyter Notebook')
else:
    print('Running example in script mode')

configuration = nucleus_api.Configuration()
configuration.host = 'UPDATE-WITH-API-SERVER-HOSTNAME'
configuration.api_key['x-api-key'] = 'UPDATE-WITH-API-KEY'

# Create API instance
api_instance = nucleus_api.NucleusApi(nucleus_api.ApiClient(configuration))

Manage Datasets

In this section, we walk you through our dataset ingestion and management APIs. You will learn how to:

Insert a specific file from a local drive

dataset = "dataset_test"
file = 'quarles20181109a.pdf'         
metadata = {"time": "1/2/2018",
            "author": "Test Author"}  # Optional json containing additional document metadata
try:
    api_response = api_instance.post_upload_file(file, dataset, metadata=metadata)
    fp = api_response.result
    print(fp.filename, '(', fp.size, 'bytes) has been added to dataset', dataset,)    
except ApiException as e:
    api_error = json.loads(e.body)
    print('ERROR:', api_error['message'])

Insert all PDFs from a folder using parallel injection

folder = 'fomc-minutes'         
dataset = 'dataset_test'# str | Destination dataset where the file will be inserted.

# build file iterable. Each item in the iterable is in the format below:
# {'filename': filename,   # filename to be uploaded. REQUIRED
#  'metadata': {           # metadata for the file. Optional
#      'key1': val1,       # keys can have arbiturary names as long as the names only
#      'key2': val2        # contain alphanumeric (0-9|a-z|A-Z) and underscore (_)
#   }
# }

file_iter = []
for root, dirs, files in os.walk(folder):
    for file in files:
        if Path(file).suffix == '.pdf':
            file_dict = {'filename': os.path.join(root, file),
                         'metadata': {'field1': 'financial'}}
            file_iter.append(file_dict)

file_props = nucleus_helper.upload_files(api_instance, dataset, file_iter, processes=1)
for fp in file_props:
    print(fp.filename, '(', fp.size, 'bytes) has been added to dataset', dataset)

Insert a file from a URL

dataset = 'dataset_test'
file_url = 'https://s3-us-west-2.amazonaws.com/sumup-public/nucleus-sdk/quarles20181109a.docx'
# Optional filename saved on the server for the URL. If not specified, Nucleus will make an intelligent guess from the file URL
filename = 'quarles20181109a-newname.pdf'  
payload = nucleus_api.UploadURLModel(dataset=dataset,
                                     file_url=file_url,
                                     filename=filename)
try:
    api_response = api_instance.post_upload_url(payload)
    url_prop = api_response.result
    print(url_prop.file_url, '(', url_prop.size, ' bytes) has been added to dataset', dataset)

except ApiException as e:
    api_error = json.loads(e.body)
    print('ERROR:', api_error['message'])

Insert multiple URLs using parallel injection

dataset = 'dataset_test'
file_urls = ['https://s3-us-west-2.amazonaws.com/sumup-public/nucleus-sdk/quarles20181109a.docx',
             'https://s3-us-west-2.amazonaws.com/sumup-public/nucleus-sdk/quarles20181109b.docx',
             'https://s3-us-west-2.amazonaws.com/sumup-public/nucleus-sdk/quarles20181109c.docx',
             'https://s3-us-west-2.amazonaws.com/sumup-public/nucleus-sdk/quarles20181109d.docx']

url_props = nucleus_helper.upload_urls(api_instance, dataset, file_urls, processes=1)

for up in url_props:
    print(up.file_url, '(', up.size, ' bytes) has been added to dataset', dataset)

Insert a JSON

dataset = 'dataset_test'

# The fields "title", "time", and "content" are mandatory in the JSON record.
# Users can add any custom fields to the JSON record and all the information will be saved as metadata for the document.
document = {"title": "This a test json title field",
            "time": "2019-01-01",
            "content": "This is a test json content field"}

payload = nucleus_api.Appendjsonparams(dataset=dataset,
                                       document=document)
try:
    api_response = api_instance.post_append_json_to_dataset(payload)
    print(api_response.result)
except ApiException as e:
    api_error = json.loads(e.body)
    print('ERROR:', api_error['message'])

The JSON record must have "title", "time", and "content" fields.

Users can add custom fields to the JSON record and all the information will be saved as metadata for the dataset.

This metadata can subsequently be used in the analytics APIs to apply custom selections of documents in your dataset.

Insert JSONs from a CSV file using parallel injection

csv_file = 'trump-tweets-100.csv'
dataset = 'trump_tweets'

with open(csv_file, encoding='utf-8-sig') as csvfile:
    reader = csv.DictReader(csvfile)
    json_props = nucleus_helper.upload_jsons(api_instance, dataset, reader, processes=1)

    total_size = 0
    total_jsons = 0
    for jp in json_props:
        total_size += jp.size
        total_jsons += 1

    print(total_jsons, 'JSON records (', total_size, 'bytes) appended to', dataset)

The CSV file must have "title", "time", and "content" columns.

Users can add any column to their CSV file and all the information will be saved as metadata for the dataset.

This metadata can subsequently be used in the analytics APIs to apply custom selections of documents in your dataset.

Create a dataset using embedded datafeeds:

The Nucleus platform contains a collection of datafeeds available in read-only mode to users.

Central Banks

dataset_central_bank = 'sumup/central_banks_chinese'
metadata_selection_central_bank = {'bank': 'people_bank_of_china',
                                   'document_category': ('speech', 'press release', 'publication')}

Connect to these feeds by language using the following naming structure for the dataset name: 'sumup/central_banks_LANGUAGE'

with LANGUAGE in {english, chinese, japanese, german, portuguese, spanish, russian, french, italian}

You can then define a custom metadata selection off this feed by specifiying a set of banks and a set of document categories

When passing these parameters to any of the analytics APIs, you can also specify a time period selection using either of:

Examples of such calls are detailed in this tutorial, within the sections discussing analytics APIs

News RSS

dataset = 'sumup/rss_feed_ai'

Connect to these feeds by field using the following naming structure for the dataset name: 'sumup/rss_feed_FIELD'

with FIELD in {ai, finance, economics, news, crypto, culture}

When passing these parameters to any of the analytics APIs, you can also specify a time period selection using either of:

Examples of such calls are detailed in this tutorial, within the sections discussing analytics APIs

SEC Filings

# GET THE LIST OF ALL THE COMPANIES AVAILABLE IN THE FEED

payload = nucleus_api.EdgarFields(tickers=[],
                                  filing_types=[],
                                  sections=[])
try:
    api_response = api_instance.post_available_sec_filings(payload)
    api_ok = True
except ApiException as e:
    api_error = json.loads(e.body)
    print('ERROR:', api_error['message'])
    api_ok = False

if api_ok:
    print('SEC filings selected:')
    print('    Company count:', len(api_response.result['tickers']))
    print('    Date range:', api_response.result['date_range'])
# GET THE LIST OF AVAILABLE FILING TYPES FOR A COMPANY

payload = nucleus_api.EdgarFields(tickers=["IBM"], # Select IBM company
                                  filing_types=[],
                                  sections=[])
try:
    api_response = api_instance.post_available_sec_filings(payload)
    api_ok = True
except ApiException as e:
    api_error = json.loads(e.body)
    print('ERROR:', api_error['message'])
    api_ok = False

if api_ok:
    print('SEC filings for:', api_response.result['tickers'])
    print('    Types:', api_response.result['filing_types'])
    print('    Count:', api_response.result['count'])
    print('    Date ranges:', api_response.result['date_range'])
# GET THE LIST OF AVAILABLE SECTIONS IN A GIVEN FILING TYPE FOR A GIVEN COMPANY

payload = nucleus_api.EdgarFields(tickers=["IBM"], # Select IBM company
                                  filing_types=["10-K"], # Get list of sections available in 10-Ks
                                  sections=[])
try:
    api_response = api_response = api_instance.post_available_sec_filings(payload)
    api_ok = True
except ApiException as e:
    api_error = json.loads(e.body)
    print('ERROR:', api_error['message'])
    api_ok = False

if api_ok:
    print('Sections in {} filings for {}'.format(api_response.result['filing_types'], api_response.result['tickers']))
    for section in api_response.result['sections']:
        print('    {}'.format(section))
# BUILD A DATASET FROM A CUSTOM SELECTION OF SEC FILINGS

dataset = "dataset_sec1"

# Dataset from a particular section for a ticker
payload = nucleus_api.EdgarQuery(destination_dataset=dataset,
                                 tickers=["BABA"],
                                 filing_types=["20-F"],
                                 sections=["Quantitative and Qualitative Disclosures about Market Risk"])
try:
    api_response = api_instance.post_create_dataset_from_sec_filings(payload)
    api_ok = True
except ApiException as e:
    api_error = json.loads(e.body)
    print('ERROR:', api_error['message'])
    api_ok = False

if api_ok:
    print('Dataset {} created successfully from SEC filings'.format(api_response.result['destination_dataset']))
# BUILD A DATASET FROM A CUSTOM SELECTION OF SEC FILINGS

dataset = "dataset_sec2"
period_start = "2018-01-01"
period_end= "2019-06-01"

# Dataset is all 8Ks for  the last 18 months
payload = nucleus_api.EdgarQuery(destination_dataset=dataset,
                                 tickers=["NFLX"],
                                 filing_types=["8-K"],
                                 sections=[],
                                 period_start=period_start,
                                 period_end=period_end)
try:
    api_response = api_instance.post_create_dataset_from_sec_filings(payload)
    api_ok = True
except ApiException as e:
    api_error = json.loads(e.body)
    print('ERROR:', api_error['message'])
    api_ok = False

if api_ok:
    print('Dataset {} created successfully from SEC filings'.format(api_response.result['destination_dataset']))

SEC filings have a more complex structure, therefore Nucleus provides a specific set of APIs to interact with this data and to allow you to create tailored datasets

There are two payloads available to you:

These payloads expose 5 optional arguments:

List all the datasets available to a user

try:
    api_response = api_instance.get_list_datasets()
except ApiException as e:
    print("Exception when calling DatasetsApi->get_list_datasets: %s\n" % e)

list_datasets = api_response.result

print(len(list_datasets), 'datasets in the database:')
for ds in list_datasets:
    print('    ', ds.name)

Retrieve summary information for a dataset

dataset = 'dataset_sec2' # str | Dataset name.
query = '' # str | Fulltext query, using mysql MATCH boolean query format. (optional)
metadata_selection = '' # str | json object of {\"metadata_field\":[\"selected_values\"]} (optional)
time_period = '' # str | Time period selection (optional)
period_start = "" # str | Start date for the period to analyze within the dataset. Format: "YYYY-MM-DD HH:MM:SS"
period_end = "" # str | End date for the period to analyze within the dataset. Format: "YYYY-MM-DD HH:MM:SS"

try:
    payload = nucleus_api.DatasetInfo(dataset=dataset,
                                    query=query,
                                    metadata_selection=metadata_selection,
                                    time_period=time_period)
    api_response = api_instance.post_dataset_info(payload)
    print('Information about dataset', dataset)
    print('    Language:', api_response.result.detected_language)
    print('    Number of documents:', api_response.result.num_documents)
    print('    Time range:', datetime.datetime.fromtimestamp(float(api_response.result.time_range[0])),
             'to', datetime.datetime.fromtimestamp(float(api_response.result.time_range[1])))
except ApiException as e:
    api_error = json.loads(e.body)
    print('ERROR:', api_error['message'])

Delete documents from a dataset

dataset = 'dataset_test'

docid = '1'
payload = nucleus_api.Deletedocumentmodel(dataset=dataset,
                                          docid=docid)
try:
    api_response = api_instance.post_delete_document(payload)
except ApiException as e:
    print("Exception when calling DatasetsApi->post_delete_document: %s\n" % e)


print('Document', docid, 'from dataset', dataset, 'has been deleted.')

Delete a dataset

dataset = 'dataset_test'
payload = nucleus_api.Deletedatasetmodel(dataset=dataset) # Deletedatasetmodel |

try:
    api_response = api_instance.post_delete_dataset(payload)
    print(api_response.result['result'])
except ApiException as e:
    api_error = json.loads(e.body)
    print('ERROR:', api_error['message'])

# List datasets again to check if the specified dataset has been deleted
try:
    api_response = api_instance.get_list_datasets()
except ApiException as e:
    api_error = json.loads(e.body)
    print('ERROR:', api_error['message'])

Analyze topics

This section goes over all APIs that enable users to identify, extract and analyze topics found in a dataset.

Identify and extract a list of topics

dataset = 'trump_tweets'
#query = '("Trump" OR "president")' # str | Fulltext query, using mysql MATCH boolean query format. Example, (\"word1\" OR \"word2\") AND (\"word3\" OR \"word4\") (optional)
query = ''
custom_stop_words = ["real","hillary"] # str | List of stop words. (optional)
num_topics = 8 # int | Number of topics to be extracted from the dataset. (optional) (default to 8)
metadata_selection = "" # dict | JSON object specifying metadata-based queries on the dataset, of type {"metadata_field": "selected_values"} (optional)
time_period = ""     # str | Time period selection. Choices: ["1M","3M","6M","12M","3Y","5Y",""] (optional)

try:
    payload = nucleus_api.Topics(dataset=dataset,                                
                                query=query,                   
                                custom_stop_words=custom_stop_words,     
                                num_topics=num_topics,
                                metadata_selection=metadata_selection,
                                time_period=time_period)
    api_response = api_instance.post_topic_api(payload)        
except ApiException as e:
    api_error = json.loads(e.body)
    print('ERROR:', api_error['message'])

doc_ids = api_response.result.doc_ids
topics = api_response.result.topics
for i, res in enumerate(topics):
    print('Topic', i, 'keywords:')
    print('    Keywords:', res.keywords)
    keywords_weight_str = ";".join(str(x) for x in res.keywords_weight)
    print('    Keyword weights:', keywords_weight_str)
    print('    Strength:', res.strength)
    doc_topic_exposure_sel = []  # list of non-zero doc_topic_exposures
    doc_id_sel = []        # list of doc ids matching doc_topic_exposure_sel
    for j in range(len(res.doc_topic_exposures)):
        doc_topic_exp = float(res.doc_topic_exposures[j])
        if doc_topic_exp != 0:
            doc_topic_exposure_sel.append(doc_topic_exp)
            doc_id_sel.append(doc_ids[j])

    doc_id_sel_str = ' '.join(str(x) for x in doc_id_sel)
    doc_topic_exposure_sel_str = ' '.join(str(x) for x in doc_topic_exposure_sel)
    print('    Document IDs:', doc_id_sel_str)
    print('    Document exposures:', doc_topic_exposure_sel_str)
    print('---------------')

Identify and extract a list of topics with a time range selection

dataset = 'trump_tweets'
#query = '("Trump" OR "president")' # str | Fulltext query, using mysql MATCH boolean query format. Example, (\"word1\" OR \"word2\") AND (\"word3\" OR \"word4\") (optional)
query = ''
custom_stop_words = ["real","hillary"] # str | List of stop words. (optional)
num_topics = 8 # int | Number of topics to be extracted from the dataset. (optional) (default to 8)
metadata_selection = "" # dict | JSON object specifying metadata-based queries on the dataset, of type {"metadata_field": "selected_values"} (optional)
period_start = "2016-10-15" # str | Start date for the period to analyze within the dataset. Format: "YYYY-MM-DD"
period_end = "2019-01-01" # str | End date for the period to analyze within the dataset. Format: "YYYY-MM-DD"

try:
    payload = nucleus_api.Topics(dataset=dataset,                                
                                 query=query,                   
                                 custom_stop_words=custom_stop_words,     
                                 num_topics=num_topics,
                                 metadata_selection=metadata_selection,
                                 period_start=period_start,
                                 period_end=period_end)
    api_response = api_instance.post_topic_api(payload)        
except ApiException as e:
    api_error = json.loads(e.body)
    print('ERROR:', api_error['message'])

doc_ids = api_response.result.doc_ids
topics = api_response.result.topics
for i, res in enumerate(topics):
    print('Topic', i, 'keywords:')
    print('    Keywords:', res.keywords)
    keywords_weight_str = ";".join(str(x) for x in res.keywords_weight)
    print('    Keyword weights:', keywords_weight_str)
    print('    Strength:', res.strength)
    doc_topic_exposure_sel = []  # list of non-zero doc_topic_exposure
    doc_id_sel = []        # list of doc ids matching doc_topic_exposure_sel
    for j in range(len(res.doc_topic_exposures)):
        doc_topic_exp = float(res.doc_topic_exposures[j])
        if doc_topic_exp != 0:
            doc_topic_exposure_sel.append(doc_topic_exp)
            doc_id_sel.append(doc_ids[j])

    doc_id_sel_str = ' '.join(str(x) for x in doc_id_sel)
    doc_topic_exposure_sel_str = ' '.join(str(x) for x in doc_topic_exposure_sel)
    print('    Document IDs:', doc_id_sel_str)
    print('    Document exposures:', doc_topic_exposure_sel_str)
    print('---------------')

Identify and extract a list of topics with a metadata selection

dataset = 'trump_tweets'
#query = '("Trump" OR "president")' # str | Fulltext query, using mysql MATCH boolean query format. Example, (\"word1\" OR \"word2\") AND (\"word3\" OR \"word4\") (optional)
query = ''
custom_stop_words = ["real","hillary"] # str | List of stop words. (optional)
num_topics = 8 # int | Number of topics to be extracted from the dataset. (optional) (default to 8)
metadata_selection = {"author": "D_Trump16"} # dict | JSON object specifying metadata-based queries on the dataset, of type {"metadata_field": "selected_values"} (optional)

try:
    payload = nucleus_api.Topics(dataset=dataset,                                
                                 query=query,                   
                                 custom_stop_words=custom_stop_words,     
                                 num_topics=num_topics,
                                 metadata_selection=metadata_selection)
    api_response = api_instance.post_topic_api(payload)        
except ApiException as e:
    api_error = json.loads(e.body)
    print('ERROR:', api_error['message'])

doc_ids = api_response.result.doc_ids
topics = api_response.result.topics
for i, res in enumerate(topics):
    print('Topic', i, 'keywords:')
    print('    Keywords:', res.keywords)
    keywords_weight_str = ";".join(str(x) for x in res.keywords_weight)
    print('    Keyword weights:', keywords_weight_str)
    print('    Strength:', res.strength)
    doc_topic_exposure_sel = []  # list of non-zero doc_topic_exposure
    doc_id_sel = []        # list of doc ids matching doc_topic_exposure_sel
    for j in range(len(res.doc_topic_exposures)):
        doc_topic_exp = float(res.doc_topic_exposures[j])
        if doc_topic_exp != 0:
            doc_topic_exposure_sel.append(doc_topic_exp)
            doc_id_sel.append(doc_ids[j])

    doc_id_sel_str = ' '.join(str(x) for x in doc_id_sel)
    doc_topic_exposure_sel_str = ' '.join(str(x) for x in doc_topic_exposure_sel)
    print('    Document IDs:', doc_id_sel_str)
    print('    Document exposures:', doc_topic_exposure_sel_str)
    print('---------------')

Identify and extract a list of topics without removing redundant content

dataset = 'trump_tweets'
#query = '("Trump" OR "president")' # str | Fulltext query, using mysql MATCH boolean query format. Example, (\"word1\" OR \"word2\") AND (\"word3\" OR \"word4\") (optional)
query = ''
custom_stop_words = ["real","hillary"] # str | List of stop words. (optional)
num_topics = 8 # int | Number of topics to be extracted from the dataset. (optional) (default to 8)
metadata_selection = "" # dict | JSON object specifying metadata-based queries on the dataset, of type {"metadata_field": "selected_values"} (optional)
remove_redundancies = False # bool | If True, this option removes quasi-duplicates from the analysis. A quasi-duplicate would have the same NLP representation, but not necessarily the exact same text. (optional) (default True)

try:
    payload = nucleus_api.Topics(dataset=dataset,                                
                                 query=query,                   
                                 custom_stop_words=custom_stop_words,     
                                 num_topics=num_topics,
                                 metadata_selection=metadata_selection,
                                 remove_redundancies=remove_redundancies)
    api_response = api_instance.post_topic_api(payload)        
except ApiException as e:
    api_error = json.loads(e.body)
    print('ERROR:', api_error['message'])

doc_ids = api_response.result.doc_ids
topics = api_response.result.topics
for i, res in enumerate(topics):
    print('Topic', i, 'keywords:')
    print('    Keywords:', res.keywords)
    keywords_weight_str = ";".join(str(x) for x in res.keywords_weight)
    print('    Keyword weights:', keywords_weight_str)
    print('    Strength:', res.strength)
    doc_topic_exposure_sel = []  # list of non-zero doc_topic_exposure
    doc_id_sel = []        # list of doc ids matching doc_topic_exposure_sel
    for j in range(len(res.doc_topic_exposures)):
        doc_topic_exp = float(res.doc_topic_exposures[j])
        if doc_topic_exp != 0:
            doc_topic_exposure_sel.append(doc_topic_exp)
            doc_id_sel.append(doc_ids[j])

    doc_id_sel_str = ' '.join(str(x) for x in doc_id_sel)
    doc_topic_exposure_sel_str = ' '.join(str(x) for x in doc_topic_exposure_sel)
    print('    Document IDs:', doc_id_sel_str)
    print('    Document exposures:', doc_topic_exposure_sel_str)
    print('---------------')

Generate a summary for each topic

dataset = 'trump_tweets'
#query = '("Trump" OR "president")' # str | Fulltext query, using mysql MATCH boolean query format. Example, (\"word1\" OR \"word2\") AND (\"word3\" OR \"word4\") (optional)
query = ''
custom_stop_words = ["real","hillary"] # str | List of stop words. (optional)
num_topics = 8 # int | Number of topics to be extracted from the dataset. (optional) (default to 8)
num_keywords = 8 # int | Number of keywords per topic that is extracted from the dataset. (optional) (default to 8)
summary_length = 6 # int | The maximum number of bullet points a user wants to see in each topic summary. (optional) (default to 6)
context_amount = 0 # int | The number of sentences surrounding key summary sentences in the documents that they come from. (optional) (default to 0)
num_docs = 20 # int | The maximum number of key documents to use for summarization. (optional) (default to 20)
excluded_docs = '' # str | List of document IDs that should be excluded from the analysis. Example, ["docid1", "docid2", ..., "docidN"]  (optional)
remove_redundancies = False # bool | If True, this option removes quasi-duplicates from the analysis. A quasi-duplicate would have the same NLP representation, but not necessarily the exact same text. (optional) (default True)

metadata_selection ="" # dict | JSON object specifying metadata-based queries on the dataset, of type {"metadata_field": "selected_values"} (optional)
time_period = ""     # str | Time period selection. Choices: ["1M","3M","6M","12M","3Y","5Y",""]  (optional)
period_start = "" # str | Start date for the period to analyze within the dataset. Format: "YYYY-MM-DD HH:MM:SS"
period_end = "" # str | End date for the period to analyze within the dataset. Format: "YYYY-MM-DD HH:MM:SS"
api_response = None

try:
    payload = nucleus_api.TopicSummaryModel	(dataset=dataset,
                                             query=query,
                                             custom_stop_words=custom_stop_words,
                                             num_topics=num_topics,
                                             num_keywords=num_keywords,
                                             metadata_selection=metadata_selection,
                                             summary_length=summary_length,
                                             context_amount=context_amount,
                                             num_docs=num_docs)
    api_response = api_instance.post_topic_summary_api(payload)
    api_ok = True
except ApiException as e:
    api_error = json.loads(e.body)
    print('ERROR:', api_error['message'])
    api_ok = False

if api_ok:
    for i,res in enumerate(api_response.result):
        print('Topic', i, 'summary:')
        print('    Keywords:', res.keywords)
        for j in range(len(res.summary)):
            print(res.summary[j])
            print('    Document ID:', res.summary[j].sourceid)
            print('        Title:', res.summary[j].title)
            print('        Sentences:', res.summary[j].sentences)
            print('        Author:', res.summary[j].attribute['author'])
            print('        Time:', datetime.datetime.fromtimestamp(float(res.summary[j].attribute['time'])))
        print('---------------')

Measure the sentiment on each topic

dataset = 'trump_tweets' # str | Dataset name
#query = '("Trump" OR "president")' # str | Fulltext query, using mysql MATCH boolean query format. Example, (\"word1\" OR \"word2\") AND (\"word3\" OR \"word4\") (optional)
query = ''
custom_stop_words = ["real","hillary"] # str | List of stop words. (optional)
num_topics = 8 # int | Number of topics to be extracted from the dataset. (optional) (default to 8)
num_keywords = 8 # int | Number of keywords per topic that is extracted from the dataset. (optional) (default to 8)
excluded_docs = '' # str | List of document IDs that should be excluded from the analysis. Example, ["docid1", "docid2", ..., "docidN"]  (optional)
custom_dict_file = {"great": 1.0, "awful": -1.0, "clinton":-1.0, "trump":1.0} # file | Custom sentiment dictionary JSON file. Example, {"field1": value1, ..., "fieldN": valueN} (optional)
remove_redundancies = False # bool | If True, this option removes quasi-duplicates from the analysis. A quasi-duplicate would have the same NLP representation, but not necessarily the exact same text. (optional) (default True)

metadata_selection ="" # dict | JSON object specifying metadata-based queries on the dataset, of type {"metadata_field": "selected_values"} (optional)
time_period = ""     # str | Time period selection. Choices: ["1M","3M","6M","12M","3Y","5Y",""] (optional)
period_start = "" # str | Start date for the period to analyze within the dataset. Format: "YYYY-MM-DD HH:MM:SS"
period_end = "" # str | End date for the period to analyze within the dataset. Format: "YYYY-MM-DD HH:MM:SS"

try:
    payload = nucleus_api.TopicSentimentModel(dataset=dataset,
                                              query=query,
                                              custom_stop_words=custom_stop_words,
                                              num_topics=num_topics,
                                              num_keywords=num_keywords,
                                              custom_dict_file=custom_dict_file)
    api_response = api_instance.post_topic_sentiment_api(payload)

except ApiException as e:
    api_error = json.loads(e.body)
    print('ERROR:', api_error['message'])

for i,res in enumerate(api_response.result):
    print('Topic', i, 'sentiment:')
    print('    Keywords:', res.keywords)
    print('    Sentiment:', res.sentiment)
    print('    Strength:', res.strength)

    doc_id_str = ' '.join(str(x) for x in res.doc_ids)
    doc_sentiment_str = ' '.join(str(x) for x in res.doc_sentiments)
    doc_score_str = ' '.join(str(x) for x in res.doc_topic_exposures)
    print('    Document IDs:', doc_id_str)
    print('    Document Sentiments:', doc_sentiment_str)
    print('    Document Exposures:', doc_score_str)
    print('---------------')

Measure the consensus on each topic

dataset = 'trump_tweets' # str | Dataset name.
query = '' # str | Fulltext query, using mysql MATCH boolean query format. Example, (\"word1\" OR \"word2\") AND (\"word3\" OR \"word4\") (optional)
custom_stop_words = ["real","hillary"] # str | List of stop words. (optional)
num_topics = 8 # int | Number of topics to be extracted from the dataset. (optional) (default to 8)
num_keywords = 8 # int | Number of keywords per topic that is extracted from the dataset. (optional) (default to 8)
excluded_docs = [''] # str | List of document IDs that should be excluded from the analysis. Example, ["docid1", "docid2", ..., "docidN"]  (optional)
custom_dict_file = {"great": 1.0, "awful": -1.0, "clinton":-1.0, "trump":1.0} # file | Custom sentiment dictionary JSON file. Example, {"field1": value1, ..., "fieldN": valueN} (optional)
remove_redundancies = False # bool | If True, this option removes quasi-duplicates from the analysis. A quasi-duplicate would have the same NLP representation, but not necessarily the exact same text. (optional) (default True)

metadata_selection ="" # dict | JSON object specifying metadata-based queries on the dataset, of type {"metadata_field": "selected_values"} (optional)
time_period = ""     # str | Time period selection. Choices: ["1M","3M","6M","12M","3Y","5Y",""] (optional)
period_start = "" # str | Start date for the period to analyze within the dataset. Format: "YYYY-MM-DD HH:MM:SS"
period_end = "" # str | End date for the period to analyze within the dataset. Format: "YYYY-MM-DD HH:MM:SS"

try:
    payload = nucleus_api.TopicConsensusModel(dataset=dataset,
                                              query=query,
                                              custom_stop_words=custom_stop_words,
                                              num_topics=num_topics,
                                              num_keywords=num_keywords,
                                              Scustom_dict_file=custom_dict_file)
    api_response = api_instance.post_topic_consensus_api(payload)
except ApiException as e:
    api_error = json.loads(e.body)
    print('ERROR:', api_error['message'])

for i, res in enumerate(api_response.result):
    print('Topic', i, 'consensus:')
    print('    Keywords:', res.keywords)
    print('    Consensus:', res.consensus)
    print('    Strength:', res.strength)
    print('---------------')

Perform a historical analysis of topics' strength, sentiment, and consensus

dataset = 'trump_tweets'   # str | Dataset name.
update_period = 'm' # str | Frequency at which the historical anlaysis is performed. choices=["d","m","H","M"] (default to d)
query = '' # str | Fulltext query, using mysql MATCH boolean query format. Example, (\"word1\" OR \"word2\") AND (\"word3\" OR \"word4\") (optional)
custom_stop_words = ["real","hillary"] # str | List of stop words (optional)
num_topics = 8 # int | Number of topics to be extracted from the dataset. (optional) (default to 8)
num_keywords = 8 # int | Number of keywords per topic that is extracted from the dataset. (optional) (default to 8)
inc_step = 1 # int | Number of increments of the udpate period in between two historical computations. (optional) (default to 1)
excluded_docs = [''] # str | List of document IDs that should be excluded from the analysis. Example, ["docid1", "docid2", ..., "docidN"]  (optional)
custom_dict_file = {} # file | Custom sentiment dictionary JSON file. Example, {"field1": value1, ..., "fieldN": valueN} (optional)
remove_redundancies = False # bool | If True, this option removes quasi-duplicates from the analysis. A quasi-duplicate would have the same NLP representation, but not necessarily the exact same text. (optional) (default True)

metadata_selection ="" # dict | JSON object specifying metadata-based queries on the dataset, of type {"metadata_field": "selected_values"} (optional)
time_period = "12M"     # str | Time period selection. Choices: ["1M","3M","6M","12M","3Y","5Y",""] (optional)
period_start = "" # str | Start date for the period to analyze within the dataset. Format: "YYYY-MM-DD HH:MM:SS"
period_end = "" # str | End date for the period to analyze within the dataset. Format: "YYYY-MM-DD HH:MM:SS"
api_response = None
try:
    payload = nucleus_api.TopicHistoryModel(dataset=dataset,
                                            time_period=time_period,
                                            update_period=update_period,
                                            query=query,
                                            custom_stop_words=custom_stop_words,
                                            num_topics=num_topics,
                                            num_keywords=num_keywords,
                                            metadata_selection=metadata_selection,
                                            inc_step=inc_step,
                                            excluded_docs=excluded_docs,
                                            custom_dict_file=custom_dict_file)
    api_response = api_instance.post_topic_historical_analysis_api(payload)
except ApiException as e:
    api_error = json.loads(e.body)
    print('ERROR:', api_error['message'])
    print(e)

print('Printing historical metrics data...')
print('NOTE: historical metrics data can be plotted when running the example in Jupyter Notebook')

for i,res in enumerate(api_response.result):
    print('Topic', i, res.keywords)
    print('    Timestamps:', res.time_stamps)
    print('    Strengths:', res.strengths)
    print('    Consensuses:', res.consensuses)
    print('    Sentiments:', res.sentiments)
    print('----------------')


# chart the historical metrics when running in Jupyter Notebook
if running_notebook:
    print('Plotting historical metrics data...')
    historical_metrics = []
    for res in api_response.result:
        # construct a list of historical metrics dictionaries for charting
        historical_metrics.append({
            'topic'    : res.keywords,
            'time_stamps' : np.array(res.time_stamps),
            'strength' : np.array(res.strengths, dtype=np.float32),
            'consensus': np.array(res.consensuses, dtype=np.float32),
            'sentiment': np.array(res.sentiments, dtype=np.float32)})

    selected_topics = range(len(historical_metrics))
    #nucleus_helper.topic_charts_historical(historical_metrics, selected_topics, True)

Determine the network of authors similar to a chosen contributor

dataset = 'trump_tweets' # str | Dataset name.
target_author = 'D_Trump16' # str | Name of the author to be analyzed.
query = '' # str | Fulltext query, using mysql MATCH boolean query format. Subject covered by the author, on which to focus the analysis of connectivity. Example, (\"word1\" OR \"word2\") AND (\"word3\" OR \"word4\") (optional)
custom_stop_words = ["real","hillary"] # str | List of words possibly used by the target author that are considered not information-bearing. (optional)
excluded_docs = [''] # str | List of document IDs that should be excluded from the analysis. Example, ["docid1", "docid2", ..., "docidN"]  (optional)

metadata_selection ="" # dict | JSON object specifying metadata-based queries on the dataset, of type {"metadata_field": "selected_values"} (optional)
time_period = "12M"     # str | Time period selection. Choices: ["1M","3M","6M","12M","3Y","5Y",""] (optional)
period_start = "" # str | Start date for the period to analyze within the dataset. Format: "YYYY-MM-DD HH:MM:SS"
period_end = "" # str | End date for the period to analyze within the dataset. Format: "YYYY-MM-DD HH:MM:SS"

try:
    payload = nucleus_api.AuthorConnection(dataset=dataset,
                                           target_author=target_author,
                                           query=query,
                                           custom_stop_words=custom_stop_words,
                                           time_period=time_period,
                                           metadata_selection=metadata_selection,
                                           excluded_docs=excluded_docs)
    api_response = api_instance.post_author_connectivity_api(payload)    
except ApiException as e:
    api_error = json.loads(e.body)
    print('ERROR:', api_error['message'])

res = api_response.result
print('Mainstream connections:')
for mc in res.mainstream_connections:
    print('    Keywords:', mc.keywords)
    print('    Authors:', " ".join(str(x) for x in mc.authors))

print('Niche connections:')
for nc in res.niche_connections:
    print('    Keywords:', nc.keywords)
    print('    Authors:', " ".join(str(x) for x in nc.authors))

Transfer learning of topics identified in one dataset onto another dataset

dataset0 = 'trump_tweets'
dataset1 = None # str | Validation dataset (optional if period_0 and period_1 dates provided)
#query = '("Trump" OR "president")' # str | Fulltext query, using mysql MATCH boolean query format. Example, (\"word1\" OR \"word2\") AND (\"word3\" OR \"word4\") (optional)
query = ''
custom_stop_words = [""] # str | List of stop words. (optional)
num_topics = 8 # int | Number of topics to be extracted from the dataset. (optional) (default to 8)
num_keywords = 8 # int | Number of keywords per topic that is extracted from the dataset. (optional) (default to 8)
metadata_selection = "" # dict | JSON object specifying metadata-based queries on the dataset, of type {"metadata_field": "selected_values"} (optional)
period_0_start = '2018-08-12' # Not needed if you provide a validation dataset in the "dataset1" variable
period_0_end = '2018-08-16' # Not needed if you provide a validation dataset in the "dataset1" variable
period_1_start = '2018-08-14' # Not needed if you provide a validation dataset in the "dataset1" variable
period_1_end = '2018-08-18' # Not needed if you provide a validation dataset in the "dataset1" variable
excluded_docs = '' # str | List of document IDs that should be excluded from the analysis. Example, ["docid1", "docid2", ..., "docidN"]  (optional)
remove_redundancies = False # bool | If True, this option removes quasi-duplicates from the analysis. A quasi-duplicate would have the same NLP representation, but not necessarily the exact same text. (optional) (default True)

try:
    payload = nucleus_api.TopicTransferModel(dataset0=dataset0,
                                             dataset1=dataset1,
                                             query=query,
                                             custom_stop_words=custom_stop_words,
                                             num_topics=num_topics,
                                             num_keywords=num_keywords,
                                             period_0_start=period_0_start,
                                             period_0_end=period_0_end,
                                             period_1_start=period_1_start,
                                             period_1_end=period_1_end,
                                             metadata_selection=metadata_selection)
    api_response = api_instance.post_topic_transfer_api(payload)
    api_ok = True
except ApiException as e:
    print(e)
    api_error = json.loads(e.body)
    print('ERROR:', api_error['message'])
    api_ok = False

print(api_response)

if api_ok:
    doc_ids_t1 = api_response.result.doc_ids_t1
    topics = api_response.result.topics
    for i,res in enumerate(topics):
        print('Topic', i, 'exposure within validation dataset:')
        print('    Keywords:', res.keywords)
        print('    Strength:', res.strength)
        print('    Document IDs:', doc_ids_t1)
        print('    Exposure per Doc in Validation Dataset:', res.doc_topic_exposures_t1)
        print('---------------')

Transfer learning of topics, exogenously chosen, onto a dataset

dataset0 = 'trump_tweets'
dataset1 = None # str | Validation dataset (optional if period_0 and period_1 dates provided)
fixed_topics = [{"keywords": ["north korea", "nuclear weapons", "real estate"], "weights": [0.5, 0.3, 0.2]},
               {"keywords": ["America", "jobs", "stock market"], "weights": [0.3, 0.3, 0.3]}] # The weights are optional
query = ''
custom_stop_words = [""] # str | List of stop words. (optional)
num_topics = 8 # int | Number of topics to be extracted from the dataset. (optional) (default to 8)
num_keywords = 8 # int | Number of keywords per topic that is extracted from the dataset. (optional) (default to 8)
metadata_selection = "" # dict | JSON object specifying metadata-based queries on the dataset, of type {"metadata_field": "selected_values"} (optional)
period_0_start = '2017-01-01' # Not needed if you provide a validation dataset in the "dataset1" variable
period_0_end = '2017-12-31' # Not needed if you provide a validation dataset in the "dataset1" variable
period_1_start = '2018-01-01' # Not needed if you provide a validation dataset in the "dataset1" variable
period_1_end = '2018-08-18' # Not needed if you provide a validation dataset in the "dataset1" variable
period_0_start = '2018-08-12' # Not needed if you provide a validation dataset in the "dataset1" variable
period_0_end = '2018-08-16' # Not needed if you provide a validation dataset in the "dataset1" variable
period_1_start = '2018-08-14' # Not needed if you provide a validation dataset in the "dataset1" variable
period_1_end = '2018-08-18' # Not needed if you provide a validation dataset in the "dataset1" variable

excluded_docs = '' # str | List of document IDs that should be excluded from the analysis. Example, ["docid1", "docid2", ..., "docidN"]  (optional)
remove_redundancies = False # bool | If True, this option removes quasi-duplicates from the analysis. A quasi-duplicate would have the same NLP representation, but not necessarily the exact same text. (optional) (default True)

try:
    payload = nucleus_api.TopicTransferModel(dataset0=dataset0,
                                             dataset1=dataset1,
                                             fixed_topics=fixed_topics,
                                             query=query,
                                             custom_stop_words=custom_stop_words,
                                             num_topics=num_topics,
                                             num_keywords=num_keywords,
                                             period_0_start=period_0_start,
                                             period_0_end=period_0_end,
                                             period_1_start=period_1_start,
                                             period_1_end=period_1_end,
                                             metadata_selection=metadata_selection)
    api_response = api_instance.post_topic_transfer_api(payload)
    api_ok = True
except ApiException as e:
    api_error = json.loads(e.body)
    print('ERROR:', api_error['message'])
    api_ok = False

if api_ok:
    doc_ids_t1 = api_response.result.doc_ids_t1
    topics = api_response.result.topics
    for i,res in enumerate(topics):
        print('Topic', i, 'exposure within validation dataset:')
        print('    Keywords:', res.keywords)
        print('    Strength:', res.strength)
        print('    Document IDs:', doc_ids_t1)
        print('    Exposure per Doc in Validation Dataset:', res.doc_topic_exposures_t1)
        print('---------------')

Transfer learning of topics identified in one dataset onto another dataset for sentiment analysis

dataset0 = 'trump_tweets'
dataset1 = None
#dataset1 = dataset # str | Validation dataset (optional if period_0 and period_1 dates provided)
#query = '("Trump" OR "president")' # str | Fulltext query, using mysql MATCH boolean query format. Example, (\"word1\" OR \"word2\") AND (\"word3\" OR \"word4\") (optional)
#fixed_topic is also an available input argument
query = ''
custom_stop_words = [""] # str | List of stop words. (optional)
num_topics = 8 # int | Number of topics to be extracted from the dataset. (optional) (default to 8)
num_keywords = 8 # int | Number of keywords per topic that is extracted from the dataset. (optional) (default to 8)
metadata_selection = "" # dict | JSON object specifying metadata-based queries on the dataset, of type {"metadata_field": "selected_values"} (optional)
period_0_start = '2018-08-12' # Not needed if you provide a validation dataset in the "dataset1" variable
period_0_end = '2018-08-16' # Not needed if you provide a validation dataset in the "dataset1" variable
period_1_start = '2018-08-14' # Not needed if you provide a validation dataset in the "dataset1" variable
period_1_end = '2018-08-18' # Not needed if you provide a validation dataset in the "dataset1" variable
excluded_docs = '' # str | List of document IDs that should be excluded from the analysis. Example, ["docid1", "docid2", ..., "docidN"]  (optional)
custom_dict_file = {"great": 1.0, "awful": -1.0, "clinton":-1.0, "trump":1.0} # file | Custom sentiment dictionary JSON file. Example, {"field1": value1, ..., "fieldN": valueN} (optional)
remove_redundancies = False # bool | If True, this option removes quasi-duplicates from the analysis. A quasi-duplicate would have the same NLP representation, but not necessarily the exact same text. (optional) (default True)

try:
    payload = nucleus_api.TopicSentimentTransferModel(dataset0=dataset0,
                                                      dataset1=dataset1,
                                                      query=query,
                                                      custom_stop_words=custom_stop_words,
                                                      num_topics=num_topics,
                                                      num_keywords=num_keywords,
                                                      period_0_start=period_0_start,
                                                      period_0_end=period_0_end,
                                                      period_1_start=period_1_start,
                                                      period_1_end=period_1_end,
                                                      metadata_selection=metadata_selection,
                                                      custom_dict_file=custom_dict_file)

    api_response = api_instance.post_topic_sentiment_transfer_api(payload)
    api_ok = True
except ApiException as e:
    api_error = json.loads(e.body)
    print('ERROR:', api_error['message'])
    api_ok = False

if api_ok:
    topics = api_response.result
    for i,res in enumerate(topics):
        print('Topic', i, 'exposure within validation dataset:')
        print('    Keywords:', res.keywords)
        print('    Strength:', res.strength)
        print('    Sentiment:', res.sentiment)
        print('    Document IDs:', res.doc_ids_t1)
        print('    Sentiment per Doc in Validation Dataset:', res.doc_sentiments_t1)
        print('---------------')

Transfer learning of topics identified in one dataset onto another dataset for consensus analysis

dataset0 = 'trump_tweets'
dataset1 = None # str | Validation dataset (optional if period_0 and period_1 dates provided)
#query = '("Trump" OR "president")' # str | Fulltext query, using mysql MATCH boolean query format. Example, (\"word1\" OR \"word2\") AND (\"word3\" OR \"word4\") (optional)
#fixed_topic is also an available input argument
query = ''
custom_stop_words = [""] # str | List of stop words. (optional)
num_topics = 8 # int | Number of topics to be extracted from the dataset. (optional) (default to 8)
num_keywords = 8 # int | Number of keywords per topic that is extracted from the dataset. (optional) (default to 8)
metadata_selection = "" # dict | JSON object specifying metadata-based queries on the dataset, of type {"metadata_field": "selected_values"} (optional)
period_0_start = '2018-08-12' # Not needed if you provide a validation dataset in the "dataset1" variable
period_0_end = '2018-08-16' # Not needed if you provide a validation dataset in the "dataset1" variable
period_1_start = '2018-08-14' # Not needed if you provide a validation dataset in the "dataset1" variable
period_1_end = '2019-08-18' # Not needed if you provide a validation dataset in the "dataset1" variable
excluded_docs = '' # str | List of document IDs that should be excluded from the analysis. Example, ["docid1", "docid2", ..., "docidN"]  (optional)
custom_dict_file = {"great": 1.0, "awful": -1.0, "clinton":-1.0, "trump":1.0} # file | Custom sentiment dictionary JSON file. Example, {"field1": value1, ..., "fieldN": valueN} (optional)
remove_redundancies = False # bool | If True, this option removes quasi-duplicates from the analysis. A quasi-duplicate would have the same NLP representation, but not necessarily the exact same text. (optional) (default True)

try:
    payload = nucleus_api.TopicConsensusTransferModel(dataset0=dataset0,
                                                      dataset1=dataset1,
                                                      query=query,
                                                      custom_stop_words=custom_stop_words,
                                                      num_topics=num_topics,
                                                      num_keywords=num_keywords,
                                                      period_0_start=period_0_start,
                                                      period_0_end=period_0_end,
                                                      period_1_start=period_1_start,
                                                      period_1_end=period_1_end,
                                                      metadata_selection=metadata_selection,
                                                      custom_dict_file=custom_dict_file)

    api_response = api_instance.post_topic_consensus_transfer_api(payload)
    api_ok = True
except ApiException as e:
    api_error = json.loads(e.body)
    print('ERROR:', api_error['message'])
    api_ok = False

if api_ok:
    topics = api_response.result
    for i,res in enumerate(topics):
        print('Topic', i, 'exposure within validation dataset:')
        print('    Keywords:', res.keywords)
        print('    Consensus:', res.consensus)
        print('---------------')

Extract a topic contrasting two content categories

dataset = 'trump_tweets' # str | Dataset name.
metadata_selection = {"content": "Trump"} # dict | The metadata selection defining the two categories of documents to contrast and summarize against each other
query = '' # str | Dataset-language-specific fulltext query, using mysql MATCH boolean query format (optional)
custom_stop_words = ["real","hillary"] # List of stop words. (optional)
time_period = "1M" # str | Alternative 1: time period counting back from today over which the analysis is conducted (optional)
period_start = '2018-08-12' # str | Alternative 2: start of period over which the analysis is conducted (optional)
period_end = '2018-08-15' # str | Alternative 2: start of period over which the analysis is conducted (optional)
excluded_docs = '' # str | List of document IDs that should be excluded from the analysis. Example, ["docid1", "docid2", ..., "docidN"]  (optional)
syntax_variables = True # bool | Specifies whether to take into account syntax aspects of each category of documents to help with contrasting them (optional) (default to False)
compression = 0.002 # float | Parameter controlling the breadth of the contrasted topic. Contained between 0 and 1, the smaller it is, the more contrasting terms will be captured, with decreasing weight. (optional) (default to 0.000002)
remove_redundancies = False # bool | If True, this option removes quasi-duplicates from the analysis and retain only one copy of it. A quasi-duplicate would have the same NLP representation, but not necessarily the exact same text. (optional) (default False)

try:
    payload = nucleus_api.TopicContrastModel(dataset=dataset,
                                             metadata_selection=metadata_selection)
    api_response = api_instance.post_topic_contrast_api(payload)

    print('Contrasted Topic')
    print('    Keywords:', api_response.result.keywords)
    print('    Keywords Weight:', api_response.result.keywords_weight)

except ApiException as e:
    api_error = json.loads(e.body)
    print('ERROR:', api_error['message'])

Analyze documents

This section goes over all APIs that enable users to analyze documents in a dataset on a standalone basis.

Retrieve the metadata information of documents

dataset = 'trump_tweets'
# doc_titles, doc_ids, and metadata_selection below are filters to narrow down
# documents to be retrieved.
# The information of all documents will be retrived when no filters are provided.

# doc_titles: list of strings
# The titles of the documents to retrieve. Example: ["title1", "title2", ..., "titleN"]  (optional)
# doc_titles = ['D_Trump2018_8_18_1_47']   
doc_titles = []
# doc_ids: list of strings
# The docid of the documents to retrieve. Example: ["docid1", "docid2", ..., "docidN"]  (optional)
# doc_ids = ['3397215194896514820', '776902852041351634']
doc_ids = []

# metadata_selection = {"author": "D_Trump16"} # dict | A selector off metadata. Example: {"field": "value"}  (optional)
metadata_selection = ''

try:
    payload = nucleus_api.DocInfo(dataset=dataset,
                                doc_titles=doc_titles,
                                doc_ids=doc_ids,
                                metadata_selection=metadata_selection)
    api_response = api_instance.post_doc_info(payload)
except ApiException as e:
    api_error = json.loads(e.body)
    print('ERROR:', api_error['message'])

for res in api_response.result:
    print('Document ID:', res.sourceid)
    print('    title:', res.title)
    for attr in res.attribute.keys():
        if attr == 'time':
            print('   ', attr, ':', datetime.datetime.fromtimestamp(float(res.attribute[attr])))
        else:
            print('   ', attr, ':', res.attribute[attr])
    print('---------------')

Retrieve the metadata information of documents with a metadata filter

dataset = 'trump_tweets' # str | Dataset name.
metadata_selection = {"author": "D_Trump16"}      # dict | A selector off metadata. Example: {"field": "value"}  (optional)

try:
    payload = nucleus_api.DocInfo(dataset=dataset, metadata_selection=metadata_selection)
    api_response = api_instance.post_doc_info(payload)

except ApiException as e:
    api_error = json.loads(e.body)
    print('ERROR:', api_error['message'])

for res in api_response.result:
    print('Document ID:', res.sourceid)
    print('    title:', res.title)
    for attr in res.attribute.keys():
        if attr == 'time':
            print('   ', attr, ':', datetime.datetime.fromtimestamp(float(res.attribute[attr])))
        else:
            print('   ', attr, ':', res.attribute[attr])
    print('---------------')

Display chosen documents, content and metadata included

dataset = 'trump_tweets' # str | Dataset name.
#doc_titles = ['D_Trump2018_8_18_1_47']   # str | The title of the documents to retrieve. Example: ["title1", "title2", ..., "titleN"]  (optional)
doc_ids = ['776902852041351634']      # str | The docid of the documents to retrieve. Example: ["docid1", "docid2", ..., "docidN"]  (optional)

try:
    payload = nucleus_api.DocDisplay(dataset, doc_ids=doc_ids)
    api_response = api_instance.post_doc_display(payload)

except ApiException as e:
    api_error = json.loads(e.body)
    print('ERROR:', api_error['message'])

for res in api_response.result:
    print('Document ID:', res.sourceid)
    print('    Title:', res.title)
    print('    Author:', res.attribute['author'])
    print('    Time:', datetime.datetime.fromtimestamp(float(res.attribute['time'])))
    print('    Content', res.content)
    print('---------------')

Display chosen documents, content and metadata included, with a metadata filter

dataset = 'trump_tweets' # str | Dataset name.
metadata_selection = {"author": "D_Trump16"}      # dict | A selector off metadata. Example: {"field": "value"}  (optional)

try:
    payload = nucleus_api.DocDisplay(dataset=dataset, metadata_selection=metadata_selection)
    api_response = api_instance.post_doc_display(payload)

except ApiException as e:
    api_error = json.loads(e.body)
    print('ERROR:', api_error['message'])

for res in api_response.result:
    print('Document ID:', res.sourceid)
    print('    Title:', res.title)
    print('    Author:', res.attribute['author'])
    print('    Time:', datetime.datetime.fromtimestamp(float(res.attribute['time'])))
    print('    Content', res.content)
    print('---------------')

Generate document recommendations on topics

dataset = 'trump_tweets' # str | Dataset name.
#query = '("Trump" OR "president")' # str | Fulltext query, using mysql MATCH boolean query format. Example, (\"word1\" OR \"word2\") AND (\"word3\" OR \"word4\") (optional)
query = ''
custom_stop_words = ["real","hillary"] # str | List of stop words. (optional)
num_topics = 8 # int | Number of topics to be extracted from the dataset. (optional) (default to 8)
num_keywords = 8 # int | Number of keywords per topic that is extracted from the dataset. (optional) (default to 8)
excluded_docs = '' # str | List of document IDs that should be excluded from the analysis. Example, ["docid1", "docid2", ..., "docidN"]  (optional)
remove_redundancies = False # bool | If True, this option removes quasi-duplicates from the analysis. A quasi-duplicate would have the same NLP representation, but not necessarily the exact same text. (optional) (default True)

try:
    payload = nucleus_api.DocumentRecommendModel(dataset=dataset,
                                                 query=query,
                                                 custom_stop_words=custom_stop_words,
                                                 num_topics=num_topics,
                                                 num_keywords=num_keywords)
    api_response = api_instance.post_doc_recommend_api(payload)
except ApiException as e:
    api_error = json.loads(e.body)
    print('ERROR:', api_error['message'])

for i, res in enumerate(api_response.result):
    print('Document recommendations for topic', i, ':')
    print('    Keywords:', res.keywords)

    for j, doc in enumerate(res.recommendations):
        print('    Recommendation', j, ':')
        print('        Document ID:', doc.sourceid)
        print('        Title:', doc.title)
        print('        Attribute:', doc.attribute)
        print('        Author:', doc.attribute['author'])
        print('        Time:', datetime.datetime.fromtimestamp(float(doc.attribute['time'])))
    print('---------------')

Summarize a document

dataset = 'trump_tweets' # str | Dataset name.
doc_title = 'D_Trump2018_8_17_14_10' # str | The title of the document to be summarized.
custom_stop_words = ["real","hillary"] # List of stop words. (optional)
summary_length = 6 # int | The maximum number of bullet points a user wants to see in the document summary. (optional) (default to 6)
context_amount = 0 # int | The number of sentences surrounding key summary sentences in the documents that they come from. (optional) (default to 0)
short_sentence_length = 0 # int | The sentence length below which a sentence is excluded from summarization (optional) (default to 4)
long_sentence_length = 40 # int | The sentence length beyond which a sentence is excluded from summarization (optional) (default to 40)

try:
    payload = nucleus_api.DocumentSummaryModel(dataset=dataset,
                                               doc_title=doc_title,
                                               custom_stop_words=custom_stop_words,
                                                summary_length=summary_length,
                                                context_amount=context_amount,
                                                short_sentence_length=short_sentence_length,
                                                long_sentence_length=long_sentence_length)
    api_response = api_instance.post_doc_summary_api(payload)

    print('Summary for', api_response.result.doc_title)
    for sent in api_response.result.summary.sentences:
        print('    *', sent)

except ApiException as e:
    api_error = json.loads(e.body)
    print('ERROR:', api_error['message'])

Summarize what makes a document stand-out from the background

dataset = 'trump_tweets' # str | Dataset name.
metadata_selection = {"content": "Trump"} # dict | The metadata selection defining the two categories of documents to contrast and summarize against each other
query = '' # str | Dataset-language-specific fulltext query, using mysql MATCH boolean query format (optional)
custom_stop_words = ["real","hillary"] # List of stop words. (optional)
summary_length = 6 # int | The maximum number of bullet points a user wants to see in the contrasted summary. (optional) (default to 6)
context_amount = 0 # int | The number of sentences surrounding key summary sentences in the documents that they come from. (optional) (default to 0)
short_sentence_length = 0 # int | The sentence length below which a sentence is excluded from summarization (optional) (default to 4)
long_sentence_length = 40 # int | The sentence length beyond which a sentence is excluded from summarization (optional) (default to 40)
time_period = "1M" # str | Alternative 1: time period counting back from today over which the analysis is conducted (optional)
period_start = '2018-08-12' # str | Alternative 2: start of period over which the analysis is conducted (optional)
period_end = '2018-08-15' # str | Alternative 2: start of period over which the analysis is conducted (optional)
excluded_docs = '' # str | List of document IDs that should be excluded from the analysis. Example, ["docid1", "docid2", ..., "docidN"]  (optional)
syntax_variables = True # bool | Specifies whether to take into account syntax aspects of each category of documents to help with contrasting them (optional) (default to False)
compression = 0.002 # float | Parameter controlling the breadth of the contrasted summary. Contained between 0 and 1, the smaller it is, the more contrasting terms will be captured, with decreasing weight. (optional) (default to 0.000002)
remove_redundancies = False # bool | If True, this option removes quasi-duplicates from the analysis and retain only one copy of it. A quasi-duplicate would have the same NLP representation, but not necessarily the exact same text. (optional) (default False)

try:
    payload = nucleus_api.DocumentContrastSummaryModel(dataset=dataset,
                                                       metadata_selection=metadata_selection)
    api_response = api_instance.post_document_contrast_summary_api(payload)

    print('Summary for', [x for x in  metadata_selection.values()])
    for sent in api_response.result.class_1_content.sentences:
        print('    *', sent)
    print('======')
    for sent in api_response.result.class_2_content.sentences:
        print('    *', sent)    
except ApiException as e:
    api_error = json.loads(e.body)
    print('ERROR:', api_error['message'])

Measure the sentiment of a document

dataset = 'trump_tweets' # str | Dataset name.
doc_title = 'D_Trump2018_8_17_14_10' # str | The title of the document to be analyzed.
custom_stop_words = ["real","hillary"] # List of stop words. (optional)
num_topics = 8 # int | Number of topics to be extracted from the document. (optional) (default to 8)
num_keywords = 8 # int | Number of keywords per topic that is extracted from the document. (optional) (default to 8)

try:
    payload = nucleus_api.DocumentSentimentModel(dataset=dataset,
                                                 doc_title=doc_title,
                                                 custom_stop_words=custom_stop_words,
                                                 num_topics=num_topics,
                                                 num_keywords=num_keywords)
    api_response = api_instance.post_doc_sentiment_api(payload)

    print('Sentiment for', api_response.result.doc_title)
    print(api_response.result.sentiment)

except ValueError as e:
    print('ERROR:', e)
except ApiException as e:
    api_error = json.loads(e.body)
    print('ERROR:', api_error['message'])

Classify documents based on a topic contrasting two content categories

dataset = 'trump_tweets' # str | Dataset name.
fixed_topics = {"keywords": ["america", "jobs", "economy"], "weights": [0.5, 0.25, 0.25]} # dict | The contrasting topic used to separate the two categories of documents

# Here we want to classify documents that talk about Trump vs documents that don't talk about Trump based on their exposure to the topic [america, jobs, economy]
# A more natural classification task for the algo is to define metadata-based categories such as metadata_selection = {"document_category": ["speech", "press release"]}
metadata_selection = {"content": "Trump"} # dict | The metadata selection defining the two categories of documents that a document can be classified into
query = '' # str | Dataset-language-specific fulltext query, using mysql MATCH boolean query format (optional)
custom_stop_words = ["real","hillary"] # List of stop words. (optional)
time_period = "1M" # str | Alternative 1: time period counting back from today over which the analysis is conducted (optional)
period_start = '2018-08-12' # str | Alternative 2: start of period over which the analysis is conducted (optional)
period_end = '2018-08-15' # str | Alternative 2: start of period over which the analysis is conducted (optional)
excluded_docs = '' # str | List of document IDs that should be excluded from the analysis. Example, ["docid1", "docid2", ..., "docidN"]  (optional)
syntax_variables = True # bool | If True, the classifier will include syntax-related variables on top of content variables (optional) (default to False)
validation_phase = False # bool | If True, the classifier assumes that the dataset provided is labeled with the 2 classes and will use that to compute accuracy/precision/recall (optional) (default to False)
threshold = 0 # float | Threshold value for a document exposure to the contrastic topic, above which the document is assigned to class 1 specified through metadata_selection. (optional) (default to 0)
remove_redundancies = False # bool | If True, this option removes quasi-duplicates from the analysis and retain only one copy of it. A quasi-duplicate would have the same NLP representation, but not necessarily the exact same text. (optional) (default False)

try:
    payload = nucleus_api.DocClassifyModel(dataset=dataset,
                                            fixed_topics=fixed_topics,
                                            metadata_selection=metadata_selection)
    api_response = api_instance.post_doc_classify_api(payload)

    print('Detailed Results')
    print('    Docids:', api_response.result.detailed_results.docids)
    print('    Exposure:', api_response.result.detailed_results.exposures)
    print('    Estimated Category:', api_response.result.detailed_results.estimated_class)
    print('    Actual Category:', api_response.result.detailed_results.true_class)
    print('\n')
    if validation_phase:
        print('Perf Metrics')
        print('    Accuracy:', api_response.result.perf_metrics.hit_rate)
        print('    Recall:', api_response.result.perf_metrics.recall)
        print('    Precision:', api_response.result.perf_metrics.precision)

except ApiException as e:
    api_error = json.loads(e.body)
    print('ERROR:', api_error['message'])

Tag documents based on pre-determined named-entity recognition

dataset = 'trump_tweets' # str | Dataset name.
try:
    payload = nucleus_api.DatasetTagging(dataset=dataset,
                                         query='new york city OR big apple OR NYC OR New York',
                                         metadata_selection='',
                                         time_period='',
                                         period_start='2010-01-01',
                                         period_end='2019-04-30')

    api_response = api_instance.post_dataset_tagging(payload)
    print(api_response)
    print('    Entities tagged:', api_response.result.entities_tagged)
    print('    Docids tagged with the entities:', api_response.result.doc_ids)
    print('    Entities count:', api_response.result.entities_count)
except ApiException as e:
    api_error = json.loads(e.body)
    print('ERROR:', api_error['message'])

Summarize files from a URL

######################################################################################
# file_params fields descriptions:  
#   file_url              : string, the URL at which the file is stored (could be a S3 bucket address for instance)
#   filename              : OPTIONAL string, filename saved on the server. also serves as the doc_title for summarization
#   custom_stop_words     : OPTIONAL a string list, user-provided list of stopwords to be excluded from the content analysis leading to document summarization
#                            ["word1", "word2", ...]. DEFAULT: empty
#   summary_length        : OPTIONAL an integer, the maximum number of bullet points a user wants to see in the document summary. DEFAULT: 6
#   context_amount        : OPTIONAL an integer, the number of sentences surrounding key summary sentences in the original document that a user wants to see in the document summary. DEFAULT: 0
#   short_sentence_length : OPTIONAL an integer, the sentence length below which a sentence is excluded from summarization. DEFAULT: 4 words
#   long_sentence_length  : OPTIONAL an integer, the sentence length beyond which a sentence is excluded from summarization. DEFAULT: 40 words

file_params = {
    'file_url': 'https://s3-us-west-2.amazonaws.com/sumup-public/nucleus-sdk/quarles20181109a.docx',
    'filename': 'quarles20181109a-newname.pdf',   
    'custom_stop_words': ["document", "sometimes"],
    'summary_length': 6,
    'context_amount': 0,
    'short_sentence_length': 4,
    'long_sentence_length': 40}

result = nucleus_helper.summarize_file_url(api_instance, file_params)

print('Summary for', result.doc_title, ':')
for sent in result.summary.sentences:
    print('    *', sent)

Use Cases

In the following sections, some Notebooks will illustrate possible use cases of the Nucleus APIs.

They are divided in:

  1. Equity Trading
  1. Fixed Income Trading
  1. Media Intelligence / Compliance
  1. Summarization
  1. Entity Tagging
  1. Sentiment Dictionaries & Data Labeling
  1. Transfer Learning
  1. Contrast Analysis



SumUp Analytics, Proprietary & Confidential (Disclaimers and Terms of Service available at www.sumup.ai).

Copyright (c) 2019 SumUp Analytics, Inc. All Rights Reserved.

NOTICE: All information contained herein is, and remains the property of SumUp Analytics Inc. and its suppliers, if any. The intellectual and technical concepts contained herein are proprietary to SumUp Analytics Inc. and its suppliers and may be covered by U.S. and Foreign Patents, patents in process, and are protected by trade secret or copyright law.

Dissemination of this information or reproduction of this material is strictly forbidden unless prior written permission is obtained from SumUp Analytics Inc.

Equity Trading

Single Name Screen

Objective:

Data:

Nucleus APIs used:

Approach:

1. Dataset Preparation

import csv
import json
import numpy as np
import datetime
import regex as re
import nucleus_api.api.nucleus_api as nucleus_helper
import nucleus_api
from nucleus_api.rest import ApiException

configuration = nucleus_api.Configuration()
configuration.host = 'UPDATE-WITH-API-SERVER-HOSTNAME'
configuration.api_key['x-api-key'] = 'UPDATE-WITH-API-KEY'

# Create API instance
api_instance = nucleus_api.NucleusApi(nucleus_api.ApiClient(configuration))

print('--------- Append all files from local folder to dataset in parallel -----------')
folder = 'Corporate_documents'         
dataset = 'Corporate_docs'# str | Destination dataset where the file will be inserted.

# build file iterable from a folder recursively.
# Each item in the iterable is in the format below:
# {'filename': filename,   # filename to be uploaded. REQUIRED
#  'metadata': {           # metadata for the file. Optional
#      'key1': val1,       # keys can have arbiturary names as long as the names only
#      'key2': val2        # contain alphanumeric (0-9|a-z|A-Z) and underscore (_)
#   }
# }
file_iter = []
for root, dirs, files in os.walk(folder):
    for file in files:
        if Path(file).suffix == '.pdf': # .txt .doc .docx .rtf .html .csv also supported
            file_dict = {'filename': os.path.join(root, file),
                         'metadata': {'ticker': 'AAPL',
                                      'company': 'Apple',
                                      'category': 'Earning Calls',
                                      'date': '2019-01-01'}}
            file_iter.append(file_dict)

file_props = nucleus_helper.upload_files(api_instance, dataset, file_iter, processes=4)
for fp in file_props:
    print(fp.filename, '(', fp.size, 'bytes) has been added to dataset', dataset)































dataset = "Corporate_docs"
period_start = "2010-01-01"
period_end= "2019-06-01"

payload = nucleus_api.EdgarQuery(destination_dataset=dataset,
                                tickers=["FB", "AMZN", "INTL", "IBM", "NFLX", "GOOG"],
                                filing_types=["10-K", "10-K/A", "10-Q", "10-Q/A"],
                                sections=["Quantitative and Qualitative Disclosures about Market Risk",
                                          "Management's Discussion and Analysis of Financial Condition and Results of Operations",
                                          "Risk Factors"],
                                period_start=period_start,
                                period_end=period_end)

api_response = api_instance.post_create_dataset_from_sec_filings(payload)

You can subsequently work on specific time periods within your dataset directly in the APIs, as illustrated below








2. Sentiment and Topic Contribution = Screen Analysis

# Determine which companies are associated to the documents contributing to the topics
payload = nucleus_api.DocInfo(dataset='Corporate_docs')
api_response = api_instance.post_doc_info(payload)

company_sources = []
for res in api_response.result:        
    company_sources.append(res.attribute['ticker'])

company_list = np.unique(company_sources)


print('-------- Get topic sentiment and exposure per firm ----------------')

payload = nucleus_api.TopicSentimentModel(dataset='Corporate_docs',          
                                query='interest rates',                   
                                num_topics=20,
                                num_keywords=8)
try:
    api_response = api_instance.post_topic_sentiment_api(payload)    
    api_ok = True
except ApiException as e:
    api_error = json.loads(e.body)
    print('ERROR:', api_error['message'])
    api_ok = False

if api_ok:
    company_rankings = np.zeros(len(company_list), len(enumerate(api_response.result))
    for i, res in enumerate(api_response.result):
        print('Topic', i, 'sentiment:')
        print('    Keywords:', res.keywords)

        # Aggregate all document exposures within a topic into a company exposure, using the dataset metadata
        payload = nucleus_api.DocInfo(dataset='Corporate_docs', doc_ids = res.doc_ids)
        try:
            api_response1 = api_instance.post_doc_info(payload)
            api_ok = True
        except ApiException as e:
            api_error = json.loads(e.body)
            print('ERROR:', api_error['message'])
            api_ok = False

        if api_ok:
            company_sources = [] # This list will be much shorter than the whole dataset because not all documents contribute to a given topic
            for res1 in api_response1.result:        
                company_sources.append(res1.attribute['ticker'])

            company_contributions = np.zeros([len(company_list), 1])
            for j in range(len(company_list)):
                for k in range(len(company_sources)):
                    if company_sources[k] == company_list[j]:
                        company_contributions[j] += json.loads(res.doc_topic_exposures[0])[k]

            company_rankings[:, i] = [x[0] for x in  float(res.strength) * float(res.sentiment) * company_contributions[:]]  

            print('---------------')


    # Add up the ranking of companies per topic into the final credit screen
    Corporate_screen = np.mean(company_rankings, axis=1)




































print('------------ Retrieve all companies found in the dataset ----------')

payload = nucleus_api.DocInfo(dataset='Corporate_docs')
api_response = api_instance.post_doc_info(payload)

company_sources = []
for res in api_response.result:        
    company_sources.append(res.attribute['ticker'])

company_list = np.unique(company_sources)


print('--------------- Retrieve the time range of the dataset -------------')

payload = nucleus_api.DatasetInfo(dataset='Corporate_docs', query='')
api_response = api_instance.post_dataset_info(payload)

first_date = datetime.datetime.fromtimestamp(float(api_response.result.time_range[0]))
last_date = datetime.datetime.fromtimestamp(float(api_response.result.time_range[1]))
delta = last_date - first_date

# Now loop through time and at each date, compute the ranking of companies
T = 90 # The look-back period in days

Corporate_screen = []
for i in range(delta.days):  
    if i == 0:
        end_date = first_date + datetime.timedelta(days=T)

    # first and last date used for the lookback period of T days
    start_date = end_date - datetime.timedelta(days=T)
    start_date_str = start_date.strftime("%Y-%m-%d 00:00:00")

    # We want a daily indicator
    end_date = end_date + datetime.timedelta(days=1)
    end_date_str = end_date.strftime("%Y-%m-%d 00:00:00")

    payload = nucleus_api.TopicSentimentModel(dataset="Corporate_docs",      
                                query='',                   
                                num_topics=20,
                                num_keywords=8,
                                period_start=start_date_str,
                                period_end=end_date_str)
    api_response = api_instance.post_topic_sentiment_api(payload)

    company_rankings = np.zeros(len(company_list), len(enumerate(api_response.result))
    for l, res in enumerate(api_response.result):
        # Aggregate all document exposures within a topic into a company exposure, using the dataset metadata
        payload = nucleus_api.DocInfo(dataset='Corporate_docs', doc_ids=res.doc_ids)
        api_response1 = api_instance.post_doc_info(payload)

        company_sources = [] # This list will be much shorter than the whole dataset because not all documents contribute to a given topic
        for res1 in api_response1.result:        
            company_sources.append(res1.attribute['ticker'])

        company_contributions = np.zeros([len(company_list), 1])
        for j in range(len(company_list)):
            for k in range(len(company_sources)):
                if company_sources[k] == company_list[j]:
                    company_contributions[j] += json.loads(res.doc_topic_exposures[0])[k]

        company_rankings[:, l] = [x[0] for x in  float(res.strength) * float(res.sentiment) * company_contributions[:]]       

    # Add up the ranking of companies per topic into the final credit screen
    Corporate_screen.append(np.mean(company_rankings, axis=1))














































3. Results Interpretation

4. Fine Tuning

4.a Tailoring the topics

print('------------- Get list of topics from dataset --------------')

payload = nucleus_api.Topics(dataset='Corporate_docs',                       
                                query='',                       
                                num_topics=20,
                                num_keywords=8,
                                period_start="2018-11-01 00:00:00",
                                period_end="2019-01-01 00:00:00")
try:
    api_response = api_instance.post_topic_api(payload)        
    api_ok = True
except ApiException as e:
    api_error = json.loads(e.body)
    print('ERROR:', api_error['message'])
    api_ok = False

if api_ok:    
    for i, res in enumerate(api_response.result.topics):
        print('Topic', i, ' keywords: ', res.keywords)    
        print('---------------')













custom_stop_words = ["call","report"] # str | List of stop words. (optional)

You can then tailor the screen analysis by creating a custom_stop_words variable. Initialize the variable as follows, for instance, and pass it in the payload of the main code of section 2:

4.b Focusing the screen analysis on certain subjects

query = '(earnings OR debt OR competition OR lawsuit OR restructuring)' # str | Fulltext query, using mysql MATCH boolean query format. Example: "(word1 OR word2) AND (word3 OR word4)" (optional)

In case you decide to focus the screen analysis, for instance on financial health and corporate actions subjects, simply substitute the query variable in the main code of section 2. with:

4.c Exploring the impact of the type of documents, the lookback period, the number of topics being extracted

num_topics: You can compute the single names' screen using different breadth of topics by changing the variable num_topics in the payload in the main code of section 2. A larger value will provide more breadth in establishing rankings while a smaller value will provide a shallower measure. If num_topics is too large, some very marginal topics may bring in a lot of noise in measuring company rankings.

T: You can compute the single names' screen with different speeds of propagation by changing the variable T (lookback) in the main code of section 2. A larger value will provide a slowly changing ranking while a smaller value will lead to a very responsive ranking. If T is too small, too few documents may be used and this may lead to a lot of noise in ranking companies. If T is too long, the rankings won’t reflect quickly enough important new information.

metadata_selection = {"category": "Report"}   # str | json object of {\"metadata_field\":[\"selected_values\"]} (optional)

Document types: You can investigate how the single names' screen changes if it is measured using only one type of document among company reports, press releases, earning call transcripts compared to capturing the whole content by leveraging the metadata selector provided during the construction of the dataset. Rerun the main code of section 2. on a subset of the whole corpus. Create a variable metadata_selection and pass it in to the payload:

5. Next Steps

Stocks' Sentiment

Objective:

Data:

Nucleus APIs used:

Approach:

1. Dataset Preparation

import csv
import json
import nucleus_api.api.nucleus_api as nucleus_helper
import nucleus_api
from nucleus_api.rest import ApiException

configuration = nucleus_api.Configuration()
configuration.host = 'UPDATE-WITH-API-SERVER-HOSTNAME'
configuration.api_key['x-api-key'] = 'UPDATE-WITH-API-KEY'

# Create API instance
api_instance = nucleus_api.NucleusApi(nucleus_api.ApiClient(configuration))

print('--------- Append all files from local folder to dataset in parallel -----------')
folder = 'Sellside_research'         
dataset = 'Sellside_research'# str | Destination dataset where the file will be inserted.

# build file iterable from a folder recursively.
# Each item in the iterable is in the format below:
# {'filename': filename,   # filename to be uploaded. REQUIRED
#  'metadata': {           # metadata for the file. Optional
#      'key1': val1,       # keys can have arbiturary names as long as the names only
#      'key2': val2        # contain alphanumeric (0-9|a-z|A-Z) and underscore (_)
#   }
# }
file_iter = []
for root, dirs, files in os.walk(folder):
    for file in files:
        #if Path(file).suffix == '.pdf': # .txt .doc .docx .rtf .html .csv also supported
            file_dict = {'filename': os.path.join(root, file),
                         'metadata': {'ticker': 'AAPL',
                                      'company': 'Apple',
                                      'bank': 'Credit Suisse',
                                      'category': 'sell side research'
                                      'date': '2019-01-01'}}
            file_iter.append(file_dict)

file_props = nucleus_helper.upload_files(api_instance, dataset, file_iter, processes=4)
for fp in file_props:
    print(fp.filename, '(', fp.size, 'bytes) has been added to dataset', dataset)






























dataset = "Corporate_docs"
period_start = "2010-01-01"
period_end= "2019-06-01"

payload = nucleus_api.EdgarQuery(destination_dataset=dataset,
                                tickers=["FB", "AMZN", "INTL", "IBM", "NFLX", "GOOG"],
                                filing_types=["8-K", "8-K/A"],
                                sections=[],
                                period_start=period_start,
                                period_end=period_end)

api_response = api_instance.post_create_dataset_from_sec_filings(payload)

You can subsequently work on specific time periods within your dataset directly in the APIs, as illustrated below






2. Measuring the Sentiment of one Stock

# Extract all the documents that relate to a chosen company
import numpy as np

payload = nucleus_api.DocInfo(dataset='Sellside_research',
                             metadata_selection={'ticker': 'AAPL'})
api_response = api_instance.post_doc_info(payload)

doc_list = []
for res in api_response.result:        
    doc_list.append(res.title)

doc_list = np.unique(doc_list)

print('-------- Get the sentiment of each document ----------------')
reports_sentiment = []
for i in range(len(company_list)):
    payload = nucleus_api.DocumentSentimentModel(dataset='Sellside_research',
                                                doc_title=doc_list[i],
                                                custom_stop_words="",
                                                num_topics=10,
                                                num_keywords=10)
    try:
        api_response = api_instance.post_doc_sentiment_api(payload)
        api_ok = True
    except ApiException as e:
        api_error = json.loads(e.body)
        print('ERROR:', api_error['message'])
        api_ok = False

    if api_ok:    
        reports_sentiment.append(api_response.result.sentiment)

# Add up the sentiment from each report into a sentiment for the company
company_sentiment = np.mean(reports_sentiment, axis=1)





















import numpy as np

# List of companies you are interested in
company_list = ['AAPL', 'GOOG', 'FB', 'BABA', 'NFLX']

# Go through each of them and get the sentiment
company_sentiment = []
for i in range(len(company_list)):  
    # Get all docs discussing a given company
    payload = nucleus_api.DocInfo(dataset='Sellside_research',
                                 metadata_selection={'ticker': company_list[i]})
    try:
        api_response = api_instance.post_doc_info(payload)
        api0_ok = True
    except ApiException as e:
        api_error = json.loads(e.body)
        print('ERROR:', api_error['message'])
        api0_ok = False

    if api0_ok:
        doc_list = []
        for res in api_response.result:        
            doc_list.append(res.title)

        doc_list = np.unique(doc_list)

        # Get the sentiment of each document
        reports_sentiment = []
        for j in range(len(company_list)):
            payload = nucleus_api.DocumentSentimentModel(dataset='Sellside_research',
                                                        doc_title=doc_list[j],
                                                        custom_stop_words="",
                                                        num_topics=10,
                                                        num_keywords=10)
            try:
                api_response = api_instance.post_doc_sentiment_api(payload)
                api_ok = True
            except ApiException as e:
                api_error = json.loads(e.body)
                print('ERROR:', api_error['message'])
                api_ok = False

            if api_ok:
                reports_sentiment.append(api_response.result.sentiment)

    # Add up the sentiment from each report into a sentiment for the company
    company_sentiment.append(np.mean(reports_sentiment, axis=1))

































3. Results Interpretation

4. Fine Tuning

4.a Tailoring the topics

print('------------- Get list of topics from dataset --------------')

payload = nucleus_api.Topics(dataset='Sellside_research',                       
                            query='',                       
                            num_topics=20,
                            num_keywords=5,
                            metadata_selection={'ticker': 'AAPL'})
try:
    api_response = api_instance.post_topic_api(payload)        
    api_ok = True
except ApiException as e:
    api_error = json.loads(e.body)
    print('ERROR:', api_error['message'])
    api_ok = False

if api_ok:    
    for i, res in enumerate(api_response.result.topics):
        print('Topic', i, ' keywords: ', res.keywords)    
        print('---------------')











custom_stop_words = ["call","report"] # str | List of stop words. (optional)

You can then tailor the company' sentiment by creating a custom_stop_words variable. Initialize the variable as follows, for instance, and pass it in the payload of the main code of section 2:

4.b Exploring the impact of the type of documents, the lookback period, the number of topics being extracted

metadata_selection = {"category": "News"}   # str | json object of {\"metadata_field\":[\"selected_values\"]} (optional)

num_topics: You can compute the company's sentiment using different breadth of topics by changing the variable num_topics in the payload in the main code of section 2. A larger value will provide more breadth in establishing sentiment while a smaller value will provide a shallower measure. If num_topics is too large, some very marginal topics may bring in a lot of noise in measuring company sentiment.

Document types: You can investigate how the company's sentiment changes if it is measured using sell-side research vs news vs company publications. Rerun the main code of section 2. on those different datasets. You could also construct a dataset with all the content across providers and then select only certain types of documents using a metadata_selection:

5. Next Steps

Consensus in Sell-Side Research

Objective:

Data:

Nucleus APIs used:

Approach:

1. Dataset Preparation

import csv
import json
import nucleus_api.api.nucleus_api as nucleus_helper
import nucleus_api
from nucleus_api.rest import ApiException

configuration = nucleus_api.Configuration()
configuration.host = 'UPDATE-WITH-API-SERVER-HOSTNAME'
configuration.api_key['x-api-key'] = 'UPDATE-WITH-API-KEY'

# Create API instance
api_instance = nucleus_api.NucleusApi(nucleus_api.ApiClient(configuration))

print('--------- Append all files from local folder to dataset in parallel -----------')
folder = 'Sellside_research'         
dataset = 'Sellside_research'# str | Destination dataset where the file will be inserted.

# build file iterable from a folder recursively.
# Each item in the iterable is in the format below:
# {'filename': filename,   # filename to be uploaded. REQUIRED
#  'metadata': {           # metadata for the file. Optional
#      'key1': val1,       # keys can have arbiturary names as long as the names only
#      'key2': val2        # contain alphanumeric (0-9|a-z|A-Z) and underscore (_)
#   }
# }
file_iter = []
for root, dirs, files in os.walk(folder):
    for file in files:
        #if Path(file).suffix == '.pdf': # .txt .doc .docx .rtf .html .csv also supported
            file_dict = {'filename': os.path.join(root, file),
                         'metadata': {'ticker': 'AAPL',
                                      'company': 'Apple',
                                      'bank': 'Credit Suisse',
                                      'category': 'sell side research'
                                      'date': '2019-01-01'}}
            file_iter.append(file_dict)

file_props = nucleus_helper.upload_files(api_instance, dataset, file_iter, processes=4)
for fp in file_props:
    print(fp.filename, '(', fp.size, 'bytes) has been added to dataset', dataset)

This can be done directly into the APIs that perform content analysis, see below


























2. Measuring the Consensus on the sell-side for one Stock

# Extract all the documents that relate to a chosen company
import numpy as np

payload = nucleus_api.DocInfo(dataset='Sellside_research',
                             metadata_selection={'ticker': 'AAPL'})
api_response = api_instance.post_doc_info(payload)

doc_list = []
for res in api_response.result:        
    doc_list.append(res.title)

doc_list = np.unique(doc_list)

print('-------- Get the sentiment of each document ----------------')
reports_sentiment = []
for i in range(len(company_list)):
    payload = nucleus_api.DocumentSentimentModel(dataset='Sellside_research',
                                                doc_title=doc_list[i],
                                                custom_stop_words="",
                                                num_topics=10,
                                                num_keywords=10)
    try:
        api_response = api_instance.post_doc_sentiment_api(payload)
        api_ok = True
    except ApiException as e:
        api_error = json.loads(e.body)
        print('ERROR:', api_error['message'])
        api_ok = False

    if api_ok:       
        reports_sentiment.append(api_response.result.sentiment)

# Deduce the consensus among sell-side for the company
company_consensus = np.asscalar(max(reports_sentiment > 0, reports_sentiment < 0, reports_sentiment == 0) / len(reports_sentiment))






















import numpy as np

# List of companies you are interested in
company_list = ['AAPL', 'GOOG', 'FB', 'BABA', 'NFLX']

# Go through each of them and get the consensus
company_consensus = []
for i in range(len(company_list)):  
    # Get all docs discussing a given company
    payload = nucleus_api.DocInfo(dataset='Sellside_research',
                                 metadata_selection={'ticker': company_list[i]})
    api_response = api_instance.post_doc_info(payload)

    doc_list = []
    for res in api_response.result:        
        doc_list.append(res.title)

    doc_list = np.unique(doc_list)

    # Get the sentiment of each document
    reports_sentiment = []
    for j in range(len(company_list)):
        payload = nucleus_api.DocumentSentimentModel(dataset='Sellside_research',
                                                    doc_title=doc_list[j],
                                                    custom_stop_words="",
                                                    num_topics=10,
                                                    num_keywords=10)
        try:
            api_response = api_instance.post_doc_sentiment_api(payload)
            api_ok = True
        except ApiException as e:
            api_error = json.loads(e.body)
            print('ERROR:', api_error['message'])
            api_ok = False

        if api_ok:   
            reports_sentiment.append(api_response.result.sentiment)

    # Deduce the consensus among sell-side for the company
    company_consensus.append(np.asscalar(max(reports_sentiment > 0, reports_sentiment < 0, reports_sentiment == 0) / len(reports_sentiment)))





























3. Results Interpretation

4. Fine Tuning

4.a Tailoring the topics

print('------------- Get list of topics from dataset --------------')

payload = nucleus_api.Topics(dataset='Sellside_research',                       
                            query='',                       
                            num_topics=20,
                            num_keywords=5,
                            metadata_selection={'ticker': 'AAPL'})
try:
    api_response = api_instance.post_topic_api(payload)        
    api_ok = True
except ApiException as e:
    api_error = json.loads(e.body)
    print('ERROR:', api_error['message'])
    api_ok = False

if api_ok:       
    for i, res in enumerate(api_response.result.topics):
        print('Topic', i, ' keywords: ', res.keywords)    
        print('---------------')











custom_stop_words = ["call","report"] # str | List of stop words. (optional)

You can then tailor the company' consensus by creating a custom_stop_words variable. Initialize the variable as follows, for instance, and pass it in the payload of the main code of section 2:

4.b Exploring the impact of the type of documents, the lookback period, the number of topics being extracted

metadata_selection = {"category": "News"}   # str | json object of {\"metadata_field\":[\"selected_values\"]} (optional)

num_topics: You can compute the company's consensus using different breadth of topics by changing the variable num_topics in the payload in the main code of section 2. A larger value will provide more breadth in establishing consensus while a smaller value will provide a shallower measure. If num_topics is too large, some very marginal topics may bring in a lot of noise in measuring company consensus.

Document types: You can investigate how the company's consensus changes if it is measured using sell-side research vs news vs company publications. Rerun the main code of section 2. on those different datasets. You could also construct a dataset with all the content across providers and then select only certain types of documents using a metadata_selection:

5. Next Steps

Contrasting Sell-side Analysts

Objective:

In its current version, SumUp contrast analysis works comparing two categories against each other, where the user defines what the two categories are

Data:

Nucleus APIs used:

Approach:

1. Dataset Preparation

import csv
import json
import nucleus_api.api.nucleus_api as nucleus_helper
import nucleus_api
from nucleus_api.rest import ApiException

configuration = nucleus_api.Configuration()
configuration.host = 'UPDATE-WITH-API-SERVER-HOSTNAME'
configuration.api_key['x-api-key'] = 'UPDATE-WITH-API-KEY'

# Create API instance
api_instance = nucleus_api.NucleusApi(nucleus_api.ApiClient(configuration))

folder = 'Sellside_research'         
dataset = 'Sellside_research'# str | Destination dataset where the file will be inserted.

# build file iterable from a folder recursively.
# Each item in the iterable is in the format below:
# {'filename': filename,   # filename to be uploaded. REQUIRED
#  'metadata': {           # metadata for the file. Optional
#      'key1': val1,       # keys can have arbiturary names as long as the names only
#      'key2': val2        # contain alphanumeric (0-9|a-z|A-Z) and underscore (_)
#   }
# }
file_iter = []
for root, dirs, files in os.walk(folder):
    for file in files:
        file_dict = {'filename': os.path.join(root, file),
                     'metadata': {'company': 'Apple',
                                  'research_analyst': 'MS', # Modify the logic here to extract the name of the bank for a given report
                                  'date': '2019-01-01'}}
        file_iter.append(file_dict)

file_props = nucleus_helper.upload_files(api_instance, dataset, file_iter, processes=4)
for fp in file_props:
    print(fp.filename, '(', fp.size, 'bytes) has been added to dataset', dataset)


























2. Document Contrasted Summarization

print('---------------- Get doc contrasted summaries ------------------------')
metadata_selection_contrast = {"research_analyst": "MS"} # dict | The metadata selection defining the two categories of documents to contrast and summarize against each other

query = '' # str | Dataset-language-specific fulltext query, using mysql MATCH boolean query format (optional)
custom_stop_words = ["morgan stanley"] # List of stop words. (optional)
summary_length = 6 # int | The maximum number of bullet points a user wants to see in the contrasted summary. (optional) (default to 6)
context_amount = 0 # int | The number of sentences surrounding key summary sentences in the documents that they come from. (optional) (default to 0)
short_sentence_length = 0 # int | The sentence length below which a sentence is excluded from summarization (optional) (default to 4)
long_sentence_length = 40 # int | The sentence length beyond which a sentence is excluded from summarization (optional) (default to 40)
excluded_docs = '' # str | List of document IDs that should be excluded from the analysis. Example, ["docid1", "docid2", ..., "docidN"]  (optional)
syntax_variables = True # bool | Specifies whether to take into account syntax aspects of each category of documents to help with contrasting them (optional) (default to False)
num_keywords = 20 # integer | Number of keywords for the contrasted topic that is extracted from the dataset. (optional) (default to 50)
remove_redundancies = False # bool | If True, this option removes quasi-duplicates from the analysis and reatins only one copy of it. A quasi-duplicate would have the same NLP representation, but not necessarily the exact same text. (optional) (default False)

payload = nucleus_api.DocumentContrastSummaryModel(dataset="Sellside_research",
                                                    metadata_selection_contrast=metadata_selection_contrast,
                                                    custom_stop_words=custom_stop_words,
                                                    period_start='2018-01-01',
                                                    period_end='2019-01-01')
try:
    api_response = api_instance.post_document_contrast_summary_api(payload)
    api_ok = True
except ApiException as e:
    api_error = json.loads(e.body)
    print('ERROR:', api_error['message'])
    api_ok = False

if api_ok:   
    print('Summary for', [x for x in  metadata_selection_contrast.values()])
    for sent in api_response.result.class_1_content.sentences:
        print('    *', sent)
    print('======')
    for sent in api_response.result.class_2_content.sentences:
        print('    *', sent)   
















bank_list = ["MS", "GS", "JPM", "Citi", "BofA", "Barcap", "CS"]
custom_stop_words = ["morgan stanley", "goldman sachs", "jp morgan", "citigroup", "bank of america", "barclays", "credit suisse", "disclaimer", "disclosures"] # List of stop words. (optional)

query = '' # str | Dataset-language-specific fulltext query, using mysql MATCH boolean query format (optional)

contrasts = []
for i in range(len(bank_list)):
    metadata_selection_contrast = {"research_analyst": bank_list[i]} # dict | The metadata selection defining the two categories of documents to contrast and summarize against each other

    payload = nucleus_api.DocumentContrastSummaryModel(dataset="Sellside_research",
                                                        metadata_selection_contrast=metadata_selection_contrast,
                                                        custom_stop_words=custom_stop_words,
                                                        period_start='2019-01-01',
                                                        period_end='2019-06-01')
    try:
        api_response = api_instance.post_document_contrast_summary_api(payload)
        api_ok = True
    except ApiException as e:
        api_error = json.loads(e.body)
        print('ERROR:', api_error['message'])
        api_ok = False

    if api_ok:   
        contrasts.append({"bank": bank_list[i], "contrast":api_response.result.class_1_content.sentences})

Extend the analysis to a list of sell-side coverages

















3. Fine Tuning

3.a Specifying the metadata_selection_contrast for your contrasted topic


metadata_selection_contrast = {"research_analyst": ["MS", "JPM"]}


metadata_selection_contrast = {"bank": ["federal_reserve", "ECB"]}
metadata_selection_contrast = {"document_category": ["speech", "press release"]}

metadata_selection_contrast = {"content": "fundamentals"}

3.b Excluding certain content from the contrasted summaries

print('------------- Get list of topics from dataset --------------')

payload = nucleus_api.Topics(dataset='Sellside_research',                         
                            query='',                       
                            num_topics=8,
                            num_keywords=8,
                            metadata_selection=metadata_selection_contrast)
try:
    api_response = api_instance.post_topic_api(payload)        
    api_ok = True
except ApiException as e:
    api_error = json.loads(e.body)
    print('ERROR:', api_error['message'])
    api_ok = False

if api_ok:       
    for i, res in enumerate(api_response.result.topics):
        print('Topic', i, ' keywords: ', res.keywords)    
        print('---------------')











custom_stop_words = ["disclaimer","disclosure"] # str | List of stop words. (optional)

Using your domain expertise / client input / advisor input, you can determine whether certain of those topics or keywords are not differentiated enough to contribute to document contrasted summaries.

You can then tailor the document contrasted summaries by creating a custom_stop_words variable that contains those words. Initialize the variable as follows, for instance, and pass it in the payload of the main code of section 2:

3.c Focusing the contrasted summary on specific subjects potentially discussed in your corpus

query = '(earnings OR cash flows)' # str | Fulltext query, using mysql MATCH boolean query format. Example: "(word1 OR word2) AND (word3 OR word4)" (optional)

query: You can refine the contrasted summary by leveraging the query variable of the Doc Contrasted Summary API.

Rerun any of the 3 Contrast Analysis APIs on the content from your corpus that mentions a specific theme. Create a variable query and pass it in to the payload:

Single Names' ESG Scoring

Objective:

Data:

Nucleus APIs used:

Approach:

1. Dataset Preparation

import csv
import json
import datetime
import regex as re
import numpy as np
import nucleus_api.api.nucleus_api as nucleus_helper
import nucleus_api
from nucleus_api.rest import ApiException

configuration = nucleus_api.Configuration()
configuration.host = 'UPDATE-WITH-API-SERVER-HOSTNAME'
configuration.api_key['x-api-key'] = 'UPDATE-WITH-API-KEY'
# Create API instance
api_instance = nucleus_api.NucleusApi(nucleus_api.ApiClient(configuration))

dataset = "Corporate_docs"
period_start = "2015-01-01"
period_end= "2019-06-01"
tickers = ['AAPL','MSFT','INTC','CSCO','MA','ORCL','IBM','CRM','PYPL','ACN US','ADBE','TXN','NVDA','INTU','QCOM','ADSK','CTSH','XLNX','HPQ','SPLK','TEL US','HPE','FISV','AMD','LRCX','MCHP','DXC','NOW','SYMC','ON','CDW','AKAM','FIS','NTAP','MXIM','DELL','ADS','VRSN','JNPR','LDOS','ANET','TER','GPN','TSS','IT','GDDY','CTXS','FTNT','DATA','ZBRA','WU','TYL','PAYC','CGNX','DOX']

payload = nucleus_api.EdgarQuery(destination_dataset=dataset,
                                tickers=tickers,
                                filing_types=["10-K", "10-K/A", "10-Q", "10-Q/A", "8-K", "8-K/A"],
                                sections=[],
                                period_start=period_start,
                                period_end=period_end)

api_response = api_instance.post_create_dataset_from_sec_filings(payload)

You can subsequently work on specific time periods within your dataset directly in the APIs, as illustrated below
















2. Define ESG queries to focus the content analysis

query_E = "Biodiversity OR Carbon OR Cleantech OR Clean OR Climate OR Coal OR Conservation OR Ecosystem OR Emission OR Energy OR Fuel OR Green OR Land OR Natural OR Pollution OR (Raw AND materials) OR Renewable OR Resources OR Sustainability OR Sustainable OR Toxic OR Waste OR Water"

query_S = "Accident OR (Adult AND entertainment) OR Alcohol OR Anti-personnel OR Behavior OR Charity OR (Child AND Labor) OR Community OR Controversial OR Controversy OR Discrimination OR Gambling OR Health OR Human capital OR Human rights OR Inclusion OR Injury OR Labor OR Munitions OR Opposition OR Pay OR Philanthropic OR Quality OR Responsible"

query_G = "Advocacy OR Bribery OR Compensation OR Competitive OR Corruption OR (Data AND breach) OR Divestment OR Fraud OR (Global AND Compact) OR GRI OR (Global AND Reporting AND Initiative) OR Independent OR Justice OR Stability OR Stewardship OR Transparency"





3. Rank companies on each ESG subject

# Determine which companies are associated to the documents contributing to the topics
company_list = tickers


print('-------- Get topic sentiment and exposure per firm ----------------')

payload = nucleus_api.TopicSentimentModel(dataset='Corporate_docs',          
                                query=query_E,                   
                                num_topics=8,
                                num_keywords=8,
                                period_start = "2017-01-01",
                                period_end= "2017-03-01")
try:
    api_response = api_instance.post_topic_sentiment_api(payload)    
    api_ok = True
except ApiException as e:
    api_error = json.loads(e.body)
    print('ERROR:', api_error['message'])
    api_ok = False

if api_ok:
    company_rankings = np.zeros([len(company_list), len(api_response.result)])
    for i, res in enumerate(api_response.result):
        print('Topic', i, 'sentiment:')
        print('    Keywords:', res.keywords)

        # Aggregate all document exposures within a topic into a company exposure, using the dataset metadata
        payload = nucleus_api.DocInfo(dataset='Corporate_docs', doc_ids = res.doc_ids)
        try:
            api_response1 = api_instance.post_doc_info(payload)
            api_ok = True
        except ApiException as e:
            api_error = json.loads(e.body)
            print('ERROR:', api_error['message'])
            api_ok = False

        if api_ok:
            company_sources = [] # This list might shorter than the whole dataset because not all companies necessarily contribute to a given topic
            for res1 in api_response1.result:        
                company_sources.append(re.split("\s",res1.attribute['filename'])[0])

            company_contributions = np.zeros([len(company_list), 1])
            for j in range(len(company_list)):
                for k in range(len(company_sources)):
                    if company_sources[k] == company_list[j]:
                        company_contributions[j] += json.loads(res.doc_topic_exposures[0])[k]

            company_rankings[:, i] = [x[0] for x in  float(res.strength) * float(res.sentiment) * company_contributions[:]]   

            print('---------------')


    # Add up the ranking of companies per topic into the final ESG score on the subject (E, S, G) currently analyzed
    ESG_score = np.mean(company_rankings, axis=1)
































print('------------ Retrieve all companies found in the dataset ----------')
import datetime
company_list = tickers


print('--------------- Retrieve the time range of the dataset -------------')

payload = nucleus_api.DatasetInfo(dataset='Corporate_docs', query='')
api_response = api_instance.post_dataset_info(payload)

first_date = datetime.datetime.fromtimestamp(float(api_response.result.time_range[0]))
last_date = datetime.datetime.fromtimestamp(float(api_response.result.time_range[1]))
delta = last_date - first_date

# Now loop through time and at each date, compute the ranking of companies
T = 180 # The look-back period in days

ESG_score2 = []
for i in range(delta.days):  
    if i == 0:
        end_date = first_date + datetime.timedelta(days=T)

    # first and last date used for the lookback period of T days
    start_date = end_date - datetime.timedelta(days=T)
    start_date_str = start_date.strftime("%Y-%m-%d 00:00:00")

    # We want a daily indicator
    end_date = end_date + datetime.timedelta(days=1)
    end_date_str = end_date.strftime("%Y-%m-%d 00:00:00")
    try:
        payload = nucleus_api.TopicSentimentModel(dataset="Corporate_docs",      
                                    query=query_E,                   
                                    num_topics=8,
                                    num_keywords=8,
                                    period_start=start_date_str,
                                    period_end=end_date_str)
        api_response = api_instance.post_topic_sentiment_api(payload)

        company_rankings = np.zeros([len(company_list), len(api_response.result)])
        for l, res in enumerate(api_response.result):
            # Aggregate all document exposures within a topic into a company exposure, using the dataset metadata
            payload = nucleus_api.DocInfo(dataset='Corporate_docs', doc_ids=res.doc_ids)
            api_response1 = api_instance.post_doc_info(payload)

            company_sources = [] # This list will be much shorter than the whole dataset because not all documents contribute to a given topic
            for res1 in api_response1.result:        
                company_sources.append(re.split('\s',res1.attribute['filename'])[0])

            company_contributions = np.zeros([len(company_list), 1])
            for j in range(len(company_list)):
                for k in range(len(company_sources)):
                    if company_sources[k] == company_list[j]:
                        company_contributions[j] += json.loads(res.doc_topic_exposures[0])[k]

            company_rankings[:, l] = [x[0] for x in  float(res.strength) * float(res.sentiment) * company_contributions[:]]      

        # Add up the ranking of companies per topic into the final ESG score on the subject (E, S, G) currently analyzed
        ESG_score2.append(np.mean(company_rankings, axis=1))
    except:
        ESG_score2.append(ESG_score2[-1])









































4. Results Interpretation

import datetime
time_stamps = [first_date + i * datetime.timedelta(days=1) for i in range(delta.days)]

import matplotlib.pyplot as plt

plt.figure(figsize=(20,10))
plt.plot(time_stamps, ESG_score)
#plt.xticks(time_stamps, rotation='vertical')

labels = tickers
plt.legend(labels)
plt.ylabel('Raw score', fontsize=14, fontweight="bold")        
plt.title("Info Tech sector, Environment Pillar", fontsize=14, fontweight="bold")
plt.show()



plt.figure(figsize=(20,10))
plt.plot(time_stamps, ESG_score1)
#plt.xticks(time_stamps, rotation='vertical')

labels = tickers
plt.legend(labels)
plt.ylabel('Raw score', fontsize=14, fontweight="bold")        
plt.title("Info Tech sector, Social Pillar", fontsize=14, fontweight="bold")
plt.show()



plt.figure(figsize=(20,10))
plt.plot(time_stamps, ESG_score2)
#plt.xticks(time_stamps, rotation='vertical')

labels = tickers
plt.legend(labels)
plt.ylabel('Raw score', fontsize=14, fontweight="bold")        
plt.title("Info Tech sector, Governance Pillar", fontsize=14, fontweight="bold")
plt.show()



ESG_combined = []
for i in range(len(ESG_score1)):
    ESG_combined.append(ESG_score[i] + ESG_score1[i] + ESG_score2[i])

plt.figure(figsize=(20,10))
plt.plot(time_stamps, ESG_combined)
#plt.xticks(time_stamps, rotation='vertical')

labels = tickers
plt.legend(labels)
plt.ylabel('Raw score', fontsize=14, fontweight="bold")        
plt.title("Info Tech sector, Combined ESG Score", fontsize=14, fontweight="bold")
plt.show()







































5. Fine Tuning

5.a Tailoring the topics

print('------------- Get list of topics from dataset --------------')

payload = nucleus_api.Topics(dataset='Corporate_docs',                       
                                query=query_E,                       
                                num_topics=20,
                                num_keywords=8,
                                period_start="2015-01-01",
                                period_end="2019-06-01")
try:
    api_response = api_instance.post_topic_api(payload)        
    api_ok = True
except ApiException as e:
    api_error = json.loads(e.body)
    print('ERROR:', api_error['message'])
    api_ok = False

if api_ok:    
    for i, res in enumerate(api_response.result.topics):
        print('Topic', i, ' keywords: ', res.keywords)    
        print('---------------')












custom_stop_words = ["call","report"] # str | List of stop words. (optional)

You can then tailor the scoring analysis by creating a custom_stop_words variable. Initialize the variable as follows, for instance, and pass it in the payload of the main code of section 2:

5.b Exploring the impact of the type of documents, the lookback period, the number of topics being extracted

payload = nucleus_api.EdgarQuery(destination_dataset=dataset,
                                tickers=tickers,
                                filing_types=["10-Q", "10-Q/A"],
                                sections=[],
                                period_start=period_start,
                                period_end=period_end)

api_response = api_instance.post_create_dataset_from_sec_filings(payload)

num_topics: You can compute the companies' ESG score using different breadth of topics by changing the variable num_topics in the payload in the main code of section 2. A larger value will provide more breadth in establishing scores while a smaller value will provide a shallower measure. If num_topics is too large, some very marginal topics may bring in a lot of noise in measuring company ESG scores.

T: You can compute the companies' ESG score with different speeds of propagation by changing the variable T (lookback) in the main code of section 2. A larger value will provide a slowly changing ESG score while a smaller value will lead to a very responsive scoring. If T is too small, too few documents may be used and this may lead to a lot of noise in scoring companies. If T is too long, the ESG scores won’t reflect quickly enough important new information.

Document types: You can investigate how the companies' ESG score changes if it is measured using only one type of document among the different kinds of company filings by rebuilding a dataset off the feed and picking fewer types of filings:

6. Next Steps

Fixed Income Trading

Directional Rate Trading

The Python Notebook containing the use case can be directly downloaded at this link.

包含用例的python笔记本可以通过这个链接直接下载

Objective:

Data:

The Nucleus Datafeed can be leveraged for all content from major Central Banks

Nucleus APIs used:

Approach:

1. Dataset Preparation

print('---- Append all files from local folder to dataset in parallel ----')
dataset = 'sumup/central_banks_chinese'# embedded datafeeds in Nucleus.
metadata_selection = {'bank': 'people_bank_of_china', 'document_category': ('speech', 'press release', 'publication')}

This can be done directly into the APIs that perform content analysis, see below

2. Sentiment Analysis

import csv
import json
import nucleus_api.api.nucleus_api as nucleus_helper
import nucleus_api
from nucleus_api.rest import ApiException

configuration = nucleus_api.Configuration()
configuration.host = 'UPDATE-WITH-API-SERVER-HOSTNAME'
configuration.api_key['x-api-key'] = 'UPDATE-WITH-API-KEY'

# Create API instance
api_instance = nucleus_api.NucleusApi(nucleus_api.ApiClient(configuration))

import numpy as np

print('---------------- Get topic sentiment ------------------------')

payload = nucleus_api.TopicSentimentModel(dataset='sumup/central_banks_chinese',           
                                query='earnings OR ESG',                   
                                num_topics=8,
                                num_keywords=8,
                                metadata_selection=metadata_selection,
                                period_start="2018-11-01 00:00:00",
                                period_end="2019-01-01 00:00:00")
try:
    api_response = api_instance.post_topic_sentiment_api(payload)
    api_ok = True
except ApiException as e:
    api_error = json.loads(e.body)
    print('ERROR:', api_error['message'])
    api_ok = False

if api_ok:       
    for i, res in enumerate(api_response.result):
        print('Topic', i, ' keywords: ', res.keywords)    
        print('    Sentiment:', res.sentiment)
        print('---------------')

    # Aggregate all topic’ sentiments
    PBOC_sent = np.dot(api_response.result.sentiment, api_response.result.strength)


























import datetime
import numpy as np

print('--------------- Retrieve the time range of the dataset -------------')

payload = nucleus_api.DatasetInfo(dataset='sumup/central_banks_chinese', query='')
api_response = api_instance.post_dataset_info(payload)

first_date = datetime.datetime.fromtimestamp(float(api_response.result.time_range[0]))
last_date = datetime.datetime.fromtimestamp(float(api_response.result.time_range[1]))
delta = last_date – first_date

# Now loop through time and at each date, compute the sentiment indicator for PBOC
T = 90 # The look-back period in days

PBOC_sentiments = []
for i in range(delta.days):  
    if i == 1:
        end_date = first_date + datetime.timedelta(days=T)
 
    # first and last date used for the lookback period of T days
    start_date = end_date - datetime.timedelta(days=T)
    start_date_str = start_date.strftime("%Y-%m-%d 00:00:00")

    # We want a daily indicator
    end_date = end_date + datetime.timedelta(days=1) 
    end_date_str = end_date.strftime("%Y-%m-%d 00:00:00")

    payload = nucleus_api.TopicSentimentModel(dataset="sumup/central_banks_chinese",        
                                            query='',                   
                                            num_topics=8,
                                            num_keywords=8,
                                            metadata_selection=metadata_selection,
                                            period_start= start_date_str,
                                            period_end= end_date_str)
    try:
        api_response = api_instance.post_topic_sentiment_api(payload)
        api_ok = True
    except ApiException as e:
        api_error = json.loads(e.body)
        print('ERROR:', api_error['message'])
        api_ok = False

    if api_ok:   
        # Aggregate all topic’ sentiments
        PBOC_sentiments.append(np.dot(api_response.result.sentiment, api_response.result.strength))

































3. Results Interpretation

4. Fine Tuning

4.a Tailoring the topics

print('------------- Get list of topics from dataset --------------')

payload = nucleus_api.Topics(dataset='sumup/central_banks_chinese',                         
                            query='',                       
                            num_topics=8, 
                            num_keywords=8,
                            metadata_selection=metadata_selection,
                            period_start="2018-11-01 00:00:00",
                            period_end="2019-01-01 00:00:00")
try:
    api_response = api_instance.post_topic_api(payload)        
    api_ok = True
except ApiException as e:
    api_error = json.loads(e.body)
    print('ERROR:', api_error['message'])
    api_ok = False

if api_ok:       
    for i, res in enumerate(api_response.result.topics):
        print('Topic', i, ' keywords: ', res.keywords)    
        print('---------------')














custom_stop_words = ["conference","government"] # str | List of stop words. (optional)

You can then tailor the sentiment analysis by creating a custom_stop_words variable. Initialize the variable as follows, for instance, and pass it in the payload of the main code of section 2:

4.b Focusing the sentiment analysis on certain subjects

query = '(inflation OR growth OR unemployment OR stability OR regulation)' # str | Fulltext query, using mysql MATCH boolean query format. Example: "(word1 OR word2) AND (word3 OR word4)" (optional)

In case you decide to focus the sentiment analysis, for instance on policy and macro-economic subjects, simply substitute the query variable in the main code of section 2. with:

4.c Exploring the impact of the type of documents, the lookback period, the number of topics being extracted

metadata_selection = {"document_category": "speech"}   # str | json object of {\"metadata_field\":[\"selected_values\"]} (optional)

num_topics: You can compute the sentiment indicator using different breadth of topics by changing the variable num_topics in the payload in the main code of section 2. A larger value will provide more breadth in establishing a sentiment indicator while a smaller value will provide a shallower measure. If num_topics is too large, some very marginal topics may bring in a lot of noise in measuring sentiment.

T: You can compute the sentiment indicator with different speeds of propagation by changing the variable T (lookback) in the main code of section 2. A larger value will provide a slowly changing measure of sentiment while a smaller value will lead to a very responsive sentiment measure. If T is too small, too few documents may be used and this may lead to a lot of noise in measuring sentiment. If T is too long, the sentiment indicator won’t reflect quickly enough important new information.

Document types: You can investigate how the sentiment indicator changes if it is measured using only one type of document among speech, press release, publications compared to capturing the whole content by leveraging the metadata selector provided during the construction of the dataset. Rerun the main code of section 2. on a subset of the whole corpus. Create a variable metadata_selection and pass it in to the payload:

5. Next Steps

# Create a regression object

# Train the model using the training sets
# change in SentimentIndicator is your x
# change in rate or any other tradable asset is your y

# 1. Predict direction of rates
regr.fit(change in SentimentIndicator(t- p), sign(change in rate )(t)) # There may be a lag = p between the indicator and market response. To test.

# 2. Predict both direction and size of the move
regr.fit(change in SentimentIndicator(t- p), (change in rates)(t))

fitted_score = regr.score(change in SentimentIndicator, change in rates)

# Forecast stock return for the latest released doc using the exposures of that doc to the topics + the fitting
y_predicted = regr.fit(x, y).predict(x.reshape(1,-1))

Corporate Credit Screen

The Python Notebook containing the use case can be directly downloaded at this link.

包含用例的python笔记本可以通过这个链接直接下载

Objective:

Data:

Nucleus APIs used:

Approach:

1. Dataset Preparation

import csv
import json
import nucleus_api.api.nucleus_api as nucleus_helper
import nucleus_api
from nucleus_api.rest import ApiException

configuration = nucleus_api.Configuration()
configuration.host = 'UPDATE-WITH-API-SERVER-HOSTNAME'
configuration.api_key['x-api-key'] = 'UPDATE-WITH-API-KEY'

# Create API instance
api_instance = nucleus_api.NucleusApi(nucleus_api.ApiClient(configuration))

print('--------- Append all files from local folder to dataset in parallel -----------')
folder = 'Corporate_documents'         
dataset = 'Corporate_docs'# str | Destination dataset where the file will be inserted.

# build file iterable from a folder recursively. 
# Each item in the iterable is in the format below:
# {'filename': filename,   # filename to be uploaded. REQUIRED
#  'metadata': {           # metadata for the file. Optional
#      'key1': val1,       # keys can have arbiturary names as long as the names only
#      'key2': val2        # contain alphanumeric (0-9|a-z|A-Z) and underscore (_)
#   } 
# }
file_iter = []
for root, dirs, files in os.walk(folder):
    for file in files:
        #if Path(file).suffix == '.pdf': # .txt .doc .docx .rtf .html .csv also supported
            file_dict = {'filename': os.path.join(root, file),
                         'metadata': {'ticker': 'AAPL',
                                      'company': 'Apple',
                                      'category': 'Press Release',
                                      'date': '2019-01-01'}}
            file_iter.append(file_dict)

file_props = nucleus_helper.upload_files(api_instance, dataset, file_iter, processes=4)
for fp in file_props:
    print(fp.filename, '(', fp.size, 'bytes) has been added to dataset', dataset)





























dataset = "Corporate_docs" 
period_start = "2010-01-01" 
period_end= "2019-06-01"

payload = nucleus_api.EdgarQuery(destination_dataset=dataset,
                                tickers=["FB", "AMZN", "INTL", "IBM", "NFLX", "GOOG"], 
                                filing_types=["10-K", "10-K/A", "10-Q", "10-Q/A"], 
                                sections=["Quantitative and Qualitative Disclosures about Market Risk",
                                          "Management's Discussion and Analysis of Financial Condition and Results of Operations",
                                          "Risk Factors"],
                                period_start=period_start,
                                period_end=period_end)

api_response = api_instance.post_create_dataset_from_sec_filings(payload)

You can subsequently work on specific time periods within your dataset directly in the APIs, as illustrated below








2. Sentiment and Topic Contribution = Screen Analysis

# Determine which companies are associated to the documents contributing to the topics
import numpy as np

payload = nucleus_api.DocInfo(dataset='Corporate_docs')
api_response = api_instance.post_doc_info(payload)

company_sources = []
for res in api_response.result:        
    company_sources.append(res.attribute['ticker']) 

company_list = np.unique(company_sources)


print('-------- Get topic sentiment and exposure per firm ----------------')

payload = nucleus_api.TopicSentimentModel(dataset='Corporate_docs',          
                                        query='',                   
                                        num_topics=20,
                                        num_keywords=8,
                                        period_start="2018-11-01 00:00:00",
                                        period_end="2019-01-01 00:00:00")
try:
    api_response = api_instance.post_topic_sentiment_api(payload)    
    api_ok = True
except ApiException as e:
    api_error = json.loads(e.body)
    print('ERROR:', api_error['message'])
    api_ok = False

if api_ok:   
    company_rankings = np.zeros([len(company_list), len(api_response.result)])
    for i, res in enumerate(api_response.result):
        print('Topic', i, 'sentiment:')
        print('    Keywords:', res.keywords)

        # Aggregate all document exposures within a topic into a company exposure, using the dataset metadata
        payload = nucleus_api.DocInfo(dataset='Corporate_docs', doc_ids = res.doc_ids)
        api_response1 = api_instance.post_doc_info(payload)

        company_sources = [] # This list will be much shorter than the whole dataset because not all documents contribute to a given topic
        for res1 in api_response1.result:        
            company_sources.append(res1.attribute['ticker']) 

        company_contributions = np.zeros([len(company_list), 1])
        for j in range(len(company_list)):
            for k in range(len(company_sources)):
                if company_sources[k] == company_list[j]:
                    company_contributions[j] += json.loads(res.doc_topic_exposures[0])[k]

        company_rankings[:, i] = [x[0] for x in  float(res.strength) * float(res.sentiment) * company_contributions[:]]  

        print('---------------')


    # Add up the ranking of companies per topic into the final credit screen
    Corporate_screen = np.mean(company_rankings, axis=1)

































import datetime
import numpy as np

print('------------ Retrieve all companies found in the dataset ----------')

payload = nucleus_api.DocInfo(dataset='Corporate_docs')
api_response = api_instance.post_doc_info(payload)

company_sources = []
for res in api_response.result:        
    company_sources.append(res.attribute['ticker']) 

company_list = np.unique(company_sources)


print('--------------- Retrieve the time range of the dataset -------------')

payload = nucleus_api.DatasetInfo(dataset='Corporate_docs', query='')
api_response = api_instance.post_dataset_info(payload)

first_date = datetime.datetime.fromtimestamp(float(api_response.result.time_range[0]))
last_date = datetime.datetime.fromtimestamp(float(api_response.result.time_range[1]))
delta = last_date - first_date

# Now loop through time and at each date, compute the ranking of companies
T = 90 # The look-back period in days

Corporate_screen = []
for i in range(delta.days):  
    if i == 0:
        end_date = first_date + datetime.timedelta(days=T)
 
    # first and last date used for the lookback period of T days
    start_date = end_date - datetime.timedelta(days=T)
    start_date_str = start_date.strftime("%Y-%m-%d 00:00:00")

    # We want a daily indicator
    end_date = end_date + datetime.timedelta(days=1) 
    end_date_str = end_date.strftime("%Y-%m-%d 00:00:00")

    payload = nucleus_api.TopicSentimentModel(dataset="Corporate_docs",      
                                            query='',                   
                                            num_topics=20,
                                            num_keywords=8,
                                            period_start=start_date_str,
                                            period_end=end_date_str)
    try:
        api_response = api_instance.post_topic_sentiment_api(payload)
        api_ok = True
    except ApiException as e:
        api_error = json.loads(e.body)
        print('ERROR:', api_error['message'])
        api_ok = False

    if api_ok:   
        company_rankings = np.zeros([len(company_list), len(api_response.result)])
        for l, res in enumerate(api_response.result):
            # Aggregate all document exposures within a topic into a company exposure, using the dataset metadata
            payload = nucleus_api.DocInfo(dataset='Corporate_docs', doc_ids=res.doc_ids)
            api_response1 = api_instance.post_doc_info(payload)

            company_sources = [] # This list will be much shorter than the whole dataset because not all documents contribute to a given topic
            for res1 in api_response1.result:        
                company_sources.append(res1.attribute['ticker']) 

            company_contributions = np.zeros([len(company_list), 1])
            for j in range(len(company_list)):
                for k in range(len(company_sources)):
                    if company_sources[k] == company_list[j]:
                        company_contributions[j] += json.loads(res.doc_topic_exposures[0])[k]

            company_rankings[:, l] = [x[0] for x in  float(res.strength) * float(res.sentiment) * company_contributions[:]]      

        # Add up the ranking of companies per topic into the final credit screen
        Corporate_screen.append(np.mean(company_rankings, axis=1))
























































3. Results Interpretation

4. Fine Tuning

4.a Tailoring the topics

print('------------- Get list of topics from dataset --------------')

payload = nucleus_api.Topics(dataset='Corporate_docs',                       
                            query='',                       
                            num_topics=20, 
                            num_keywords=8,
                            period_start="2018-11-01 00:00:00",
                            period_end="2019-01-01 00:00:00")
try:
    api_response = api_instance.post_topic_api(payload)        
    api_ok = True
except ApiException as e:
    api_error = json.loads(e.body)
    print('ERROR:', api_error['message'])
    api_ok = False

if api_ok:       
    for i, res in enumerate(api_response.result.topics):
        print('Topic', i, ' keywords: ', res.keywords)    
        print('---------------')













custom_stop_words = ["call","report"] # str | List of stop words. (optional)

You can then tailor the screen analysis by creating a custom_stop_words variable. Initialize the variable as follows, for instance, and pass it in the payload of the main code of section 2:

4.b Focusing the screen analysis on certain subjects

query = '(earnings OR debt OR competition OR lawsuit OR restructuring)' # str | Fulltext query, using mysql MATCH boolean query format. Example: "(word1 OR word2) AND (word3 OR word4)" (optional)

In case you decide to focus the screen analysis, for instance on financial health and corporate actions subjects, simply substitute the query variable in the main code of section 2. with:

4.c Exploring the impact of the type of documents, the lookback period, the number of topics being extracted

metadata_selection = {"category": "Report"}   # str | json object of {\"metadata_field\":[\"selected_values\"]} (optional)

num_topics: You can compute the corporate screen using different breadth of topics by changing the variable num_topics in the payload in the main code of section 2. A larger value will provide more breadth in establishing rankings while a smaller value will provide a shallower measure. If num_topics is too large, some very marginal topics may bring in a lot of noise in measuring corporate rankings.

T: You can compute the corporate screen with different speeds of propagation by changing the variable T (lookback) in the main code of section 2. A larger value will provide a slowly changing ranking while a smaller value will lead to a very responsive ranking. If T is too small, too few documents may be used and this may lead to a lot of noise in ranking companies. If T is too long, the rankings won’t reflect quickly enough important new information.

Document types: You can investigate how the corporate screen changes if it is measured using only one type of document among company reports, press releases, earning call transcripts compared to capturing the whole content by leveraging the metadata selector provided during the construction of the dataset. Rerun the main code of section 2. on a subset of the whole corpus. Create a variable metadata_selection and pass it in to the payload:

5. Next Steps

Single Names' ESG Scoring

The Python Notebook containing the use case can be directly downloaded at this link.

包含用例的python笔记本可以通过这个链接直接下载

Objective:

Data:

Nucleus APIs used:

Approach:

1. Dataset Preparation

import csv
import json
import datetime
import regex as re
import numpy as np
import nucleus_api.api.nucleus_api as nucleus_helper
import nucleus_api
from nucleus_api.rest import ApiException

configuration = nucleus_api.Configuration()
configuration.host = 'UPDATE-WITH-API-SERVER-HOSTNAME'
configuration.api_key['x-api-key'] = 'UPDATE-WITH-API-KEY'
# Create API instance
api_instance = nucleus_api.NucleusApi(nucleus_api.ApiClient(configuration))

dataset = "Corporate_docs" 
period_start = "2015-01-01" 
period_end= "2019-06-01"
tickers = ['AAPL','MSFT','INTC','CSCO','MA','ORCL','IBM','CRM','PYPL','ACN US','ADBE','TXN','NVDA','INTU','QCOM','ADSK','CTSH','XLNX','HPQ','SPLK','TEL US','HPE','FISV','AMD','LRCX','MCHP','DXC','NOW','SYMC','ON','CDW','AKAM','FIS','NTAP','MXIM','DELL','ADS','VRSN','JNPR','LDOS','ANET','TER','GPN','TSS','IT','GDDY','CTXS','FTNT','DATA','ZBRA','WU','TYL','PAYC','CGNX','DOX']

payload = nucleus_api.EdgarQuery(destination_dataset=dataset,
                                tickers=tickers, 
                                filing_types=["10-K", "10-K/A", "10-Q", "10-Q/A", "8-K", "8-K/A"], 
                                sections=[],
                                period_start=period_start,
                                period_end=period_end)

api_response = api_instance.post_create_dataset_from_sec_filings(payload)

You can subsequently work on specific time periods within your dataset directly in the APIs, as illustrated below
















2. Define ESG queries to focus the content analysis

query_E = "Biodiversity OR Carbon OR Cleantech OR Clean OR Climate OR Coal OR Conservation OR Ecosystem OR Emission OR Energy OR Fuel OR Green OR Land OR Natural OR Pollution OR (Raw AND materials) OR Renewable OR Resources OR Sustainability OR Sustainable OR Toxic OR Waste OR Water"

query_S = "Accident OR (Adult AND entertainment) OR Alcohol OR Anti-personnel OR Behavior OR Charity OR (Child AND Labor) OR Community OR Controversial OR Controversy OR Discrimination OR Gambling OR Health OR Human capital OR Human rights OR Inclusion OR Injury OR Labor OR Munitions OR Opposition OR Pay OR Philanthropic OR Quality OR Responsible"

query_G = "Advocacy OR Bribery OR Compensation OR Competitive OR Corruption OR (Data AND breach) OR Divestment OR Fraud OR (Global AND Compact) OR GRI OR (Global AND Reporting AND Initiative) OR Independent OR Justice OR Stability OR Stewardship OR Transparency"






3. Rank companies on each ESG subject

# Determine which companies are associated to the documents contributing to the topics
company_list = tickers


print('-------- Get topic sentiment and exposure per firm ----------------')

payload = nucleus_api.TopicSentimentModel(dataset='Corporate_docs',          
                                query=query_E,                   
                                num_topics=8,
                                num_keywords=8,
                                period_start = "2017-01-01", 
                                period_end= "2017-03-01")
try:
    api_response = api_instance.post_topic_sentiment_api(payload)    
    api_ok = True
except ApiException as e:
    api_error = json.loads(e.body)
    print('ERROR:', api_error['message'])
    api_ok = False

if api_ok:
    company_rankings = np.zeros([len(company_list), len(api_response.result)])
    for i, res in enumerate(api_response.result):
        print('Topic', i, 'sentiment:')
        print('    Keywords:', res.keywords)

        # Aggregate all document exposures within a topic into a company exposure, using the dataset metadata
        payload = nucleus_api.DocInfo(dataset='Corporate_docs', doc_ids = res.doc_ids)
        try:
            api_response1 = api_instance.post_doc_info(payload)
            api_ok = True
        except ApiException as e:
            api_error = json.loads(e.body)
            print('ERROR:', api_error['message'])
            api_ok = False

        if api_ok:
            company_sources = [] # This list might shorter than the whole dataset because not all companies necessarily contribute to a given topic
            for res1 in api_response1.result:        
                company_sources.append(re.split("\s",res1.attribute['filename'])[0]) 

            company_contributions = np.zeros([len(company_list), 1])
            for j in range(len(company_list)):
                for k in range(len(company_sources)):
                    if company_sources[k] == company_list[j]:
                        company_contributions[j] += json.loads(res.doc_topic_exposures[0])[k]

            company_rankings[:, i] = [x[0] for x in  float(res.strength) * float(res.sentiment) * company_contributions[:]]   

            print('---------------')


    # Add up the ranking of companies per topic into the final ESG score on the subject (E, S, G) currently analyzed
    ESG_score = np.mean(company_rankings, axis=1)
































print('------------ Retrieve all companies found in the dataset ----------')
import datetime
company_list = tickers


print('--------------- Retrieve the time range of the dataset -------------')

payload = nucleus_api.DatasetInfo(dataset='Corporate_docs', query='')
api_response = api_instance.post_dataset_info(payload)

first_date = datetime.datetime.fromtimestamp(float(api_response.result.time_range[0]))
last_date = datetime.datetime.fromtimestamp(float(api_response.result.time_range[1]))
delta = last_date - first_date

# Now loop through time and at each date, compute the ranking of companies
T = 180 # The look-back period in days

ESG_score2 = []
for i in range(delta.days):  
    if i == 0:
        end_date = first_date + datetime.timedelta(days=T)
 
    # first and last date used for the lookback period of T days
    start_date = end_date - datetime.timedelta(days=T)
    start_date_str = start_date.strftime("%Y-%m-%d 00:00:00")

    # We want a daily indicator
    end_date = end_date + datetime.timedelta(days=1) 
    end_date_str = end_date.strftime("%Y-%m-%d 00:00:00")
    try: 
        payload = nucleus_api.TopicSentimentModel(dataset="Corporate_docs",      
                                    query=query_E,                   
                                    num_topics=8,
                                    num_keywords=8,
                                    period_start=start_date_str,
                                    period_end=end_date_str)
        api_response = api_instance.post_topic_sentiment_api(payload)

        company_rankings = np.zeros([len(company_list), len(api_response.result)])
        for l, res in enumerate(api_response.result):
            # Aggregate all document exposures within a topic into a company exposure, using the dataset metadata
            payload = nucleus_api.DocInfo(dataset='Corporate_docs', doc_ids=res.doc_ids)
            api_response1 = api_instance.post_doc_info(payload)

            company_sources = [] # This list will be much shorter than the whole dataset because not all documents contribute to a given topic
            for res1 in api_response1.result:        
                company_sources.append(re.split('\s',res1.attribute['filename'])[0]) 

            company_contributions = np.zeros([len(company_list), 1])
            for j in range(len(company_list)):
                for k in range(len(company_sources)):
                    if company_sources[k] == company_list[j]:
                        company_contributions[j] += json.loads(res.doc_topic_exposures[0])[k]

            company_rankings[:, l] = [x[0] for x in  float(res.strength) * float(res.sentiment) * company_contributions[:]]      

        # Add up the ranking of companies per topic into the final ESG score on the subject (E, S, G) currently analyzed
        ESG_score2.append(np.mean(company_rankings, axis=1))
    except:
        ESG_score2.append(ESG_score2[-1])










































4. Results Interpretation

import datetime
time_stamps = [first_date + i * datetime.timedelta(days=1) for i in range(delta.days)]

import matplotlib.pyplot as plt

plt.figure(figsize=(20,10))
plt.plot(time_stamps, ESG_score) 
#plt.xticks(time_stamps, rotation='vertical')

labels = tickers
plt.legend(labels)
plt.ylabel('Raw score', fontsize=14, fontweight="bold")        
plt.title("Info Tech sector, Environment Pillar", fontsize=14, fontweight="bold")
plt.show() 



plt.figure(figsize=(20,10))
plt.plot(time_stamps, ESG_score1) 
#plt.xticks(time_stamps, rotation='vertical')

labels = tickers
plt.legend(labels)
plt.ylabel('Raw score', fontsize=14, fontweight="bold")        
plt.title("Info Tech sector, Social Pillar", fontsize=14, fontweight="bold")
plt.show() 



plt.figure(figsize=(20,10))
plt.plot(time_stamps, ESG_score2) 
#plt.xticks(time_stamps, rotation='vertical')

labels = tickers
plt.legend(labels)
plt.ylabel('Raw score', fontsize=14, fontweight="bold")        
plt.title("Info Tech sector, Governance Pillar", fontsize=14, fontweight="bold")
plt.show() 



ESG_combined = []
for i in range(len(ESG_score1)):
    ESG_combined.append(ESG_score[i] + ESG_score1[i] + ESG_score2[i])
    
plt.figure(figsize=(20,10))
plt.plot(time_stamps, ESG_combined) 
#plt.xticks(time_stamps, rotation='vertical')

labels = tickers
plt.legend(labels)
plt.ylabel('Raw score', fontsize=14, fontweight="bold")        
plt.title("Info Tech sector, Combined ESG Score", fontsize=14, fontweight="bold")
plt.show() 








































5. Fine Tuning

5.a Tailoring the topics

print('------------- Get list of topics from dataset --------------')

payload = nucleus_api.Topics(dataset='Corporate_docs',                       
                                query=query_E,                       
                                num_topics=20, 
                                num_keywords=8,
                                period_start="2015-01-01",
                                period_end="2019-06-01")
try:
    api_response = api_instance.post_topic_api(payload)        
    api_ok = True
except ApiException as e:
    api_error = json.loads(e.body)
    print('ERROR:', api_error['message'])
    api_ok = False

if api_ok:    
    for i, res in enumerate(api_response.result.topics):
        print('Topic', i, ' keywords: ', res.keywords)    
        print('---------------')












<

custom_stop_words = ["call","report"] # str | List of stop words. (optional)

You can then tailor the scoring analysis by creating a custom_stop_words variable. Initialize the variable as follows, for instance, and pass it in the payload of the main code of section 2:

5.b Exploring the impact of the type of documents, the lookback period, the number of topics being extracted

payload = nucleus_api.EdgarQuery(destination_dataset=dataset,
                                tickers=tickers, 
                                filing_types=["10-Q", "10-Q/A"], 
                                sections=[],
                                period_start=period_start,
                                period_end=period_end)

api_response = api_instance.post_create_dataset_from_sec_filings(payload)

num_topics: You can compute the companies' ESG score using different breadth of topics by changing the variable num_topics in the payload in the main code of section 2. A larger value will provide more breadth in establishing scores while a smaller value will provide a shallower measure. If num_topics is too large, some very marginal topics may bring in a lot of noise in measuring company ESG scores.

T: You can compute the companies' ESG score with different speeds of propagation by changing the variable T (lookback) in the main code of section 2. A larger value will provide a slowly changing ESG score while a smaller value will lead to a very responsive scoring. If T is too small, too few documents may be used and this may lead to a lot of noise in scoring companies. If T is too long, the ESG scores won’t reflect quickly enough important new information.

Document types: You can investigate how the companies' ESG score changes if it is measured using only one type of document among the different kinds of company filings by rebuilding a dataset off the feed and picking fewer types of filings:

6. Next Steps

Media Intelligence / Compliance

News Tracking & Analysis

The Python Notebook containing the use case can be directly downloaded at this link.

包含用例的python笔记本可以通过这个链接直接下载

Objective:

Data:

The Nucleus Datafeed can be leveraged for content from 200 News Media RSS

Nucleus APIs used:

Approach:

1. Dataset Preparation

import csv
import json
import nucleus_api.api.nucleus_api as nucleus_helper
import nucleus_api
from nucleus_api.rest import ApiException

configuration = nucleus_api.Configuration()
configuration.host = 'UPDATE-WITH-API-SERVER-HOSTNAME'
configuration.api_key['x-api-key'] = 'UPDATE-WITH-API-KEY'

# Create API instance
api_instance = nucleus_api.NucleusApi(nucleus_api.ApiClient(configuration))

# Leverage your own corpus
print('---- Case 1: you are using your own corpus, coming from a local folder ----')
folder = 'Twitter_feed'         
dataset = 'Twitter_feed'# str | Destination dataset where the file will be inserted.

# build file iterable from a folder recursively.
# Each item in the iterable is in the format below:
# {'filename': filename,   # filename to be uploaded. REQUIRED
#  'metadata': {           # metadata for the file. Optional
#      'key1': val1,       # keys can have arbiturary names as long as the names only
#      'key2': val2        # contain alphanumeric (0-9|a-z|A-Z) and underscore (_)
#   }
# }
file_iter = []
for root, dirs, files in os.walk(folder):
    for file in files:
        #if Path(file).suffix == '.pdf': # .txt .doc .docx .rtf .html .csv also supported
        file_dict = {'filename': os.path.join(root, file),
                     'metadata': {'source': 'Tech Crunch',
                                  'author': 'Sarah Moore'
                                  'category': 'Media',
                                  'date': '2019-01-01'}}
        file_iter.append(file_dict)

file_props = nucleus_helper.upload_files(api_instance, dataset, file_iter, processes=4)
for fp in file_props:
    print(fp.filename, '(', fp.size, 'bytes) has been added to dataset', dataset)


# Leverage a Nucleus embedded feed    
print('---- Case 2: you are using an embedded datafeed ----')
dataset = 'sumup/rss_feed_ai'# embedded datafeeds in Nucleus.

This can be done directly into the APIs that perform content analysis, see below





























2. Content Analysis & Tracking

print('------------ Get topics + historical analysis ----------------')

payload = nucleus_api.TopicHistoryModel(dataset='Twitter_feed',
                                    update_period='d',
                                    query='',
                                    num_topics=20,
                                    num_keywords=8,
                                    inc_step=1,
                                    period_start="2018-12-01 00:00:00",
                                    period_end="2019-01-01 00:00:00")
try:
    api_response = api_instance.post_topic_historical_analysis_api(payload)
    api_ok = True
except ApiException as e:
    api_error = json.loads(e.body)
    print('ERROR:', api_error['message'])
    api_ok = False

if api_ok:
    print('Plotting historical metrics data...')
    historical_metrics = []
    for res in api_response.result:
        # construct a list of historical metric' dictionaries for charting
        historical_metrics.append({
            'topic'    : res.keywords,
            'time_stamps' : np.array(res.time_stamps),
            'strength' : np.array(res.strengths, dtype=np.float32),
            'consensus': np.array(res.consensuses, dtype=np.float32),
            'sentiment': np.array(res.sentiments, dtype=np.float32)})

    selected_topics = range(len(historical_metrics))
    nucleus_helper.topic_charts_historical(historical_metrics, selected_topics, True)
























print('------------- Get the summaries of recent topics in your feed --------------')

payload = nucleus_api.TopicSummaryModel(dataset='Twitter_feed',                         
                            query='',                       
                            num_topics=20,
                            num_keywords=8,
                            period_start="2018-12-31 00:00:00",
                            period_end="2019-01-01 00:00:00")
try:
    api_response = api_instance.post_topic_summary_api(payload)        
    api_ok = True
except ApiException as e:
    api_error = json.loads(e.body)
    print('ERROR:', api_error['message'])
    api_ok = False

if api_ok:    
    for res in api_response.result:
        print('Topic', i, 'summary:')
        print('    Keywords:', res.topic)
        for j in range(len(res.summary)):
            print(res.summary[j])
            print('    Document ID:', res.summary[j].sourceid)
            print('        Title:', res.summary[j].title)
            print('        Sentences:', res.summary[j].sentences)
            print('        Author:', res.summary[j].attribute['author'])
            print('        Time:', datetime.datetime.fromtimestamp(float(res.summary[j].attribute['time'])))   





















print('----------------- Get author connectivity -------------------')

payload = nucleus_api.AuthorConnection(dataset='Twitter_feed',
                                        target_author='Yann LeCun',
                                        query='',
                                        period_start="2018-12-31 00:00:00",
                                        period_end="2019-01-01 00:00:00")
try:
    api_response = api_instance.post_author_connectivity_api(payload)    
    api_ok = True
except ApiException as e:
    api_error = json.loads(e.body)
    print('ERROR:', api_error['message'])
    api_ok = False

if api_ok:
    print('Mainstream connections:')
    for mc in api_response.result.mainstream_connections:
        print('    Topic:', mc.keywords)
        print('    Authors:', " ".join(str(x) for x in mc.authors))

    print('Niche connections:')
    for nc in api_response.result.niche_connections:
        print('    Topic:', nc.keywords)
        print('    Authors:', " ".join(str(x) for x in nc.authors))  













3. Fine Tuning

3.a Tailoring the topics

print('------------- Get the recent topics in your feed --------------')

payload = nucleus_api.Topics(dataset='Twitter_feed',                         
                            query='',                       
                            num_topics=20,
                            num_keywords=8,
                            period_start="2018-12-31 00:00:00",
                            period_end="2019-01-01 00:00:00")
try:
    api_response = api_instance.post_topic_api(payload)        
    api_ok = True
except ApiException as e:
    api_error = json.loads(e.body)
    print('ERROR:', api_error['message'])
    api_ok = False

if api_ok:    
    for i, res in enumerate(api_response.result.topics):
        print('Topic', i, ' keywords: ', res.keywords)    
        print('---------------')












custom_stop_words = ["supervised learning", "training"] # str | List of stop words. (optional)

You can then tailor the content analysis by creating a custom_stop_words variable. Initialize the variable as follows, for instance, and pass it in the payload of the main code of section 2:

3.b Focusing the content analysis on certain subjects

query = '(deep-learning OR LSTM OR RNN OR Neural network)' # str | Fulltext query, using mysql MATCH boolean query format. Example: "(word1 OR word2) AND (word3 OR word4)" (optional)

In case you decide to focus the content analysis, for instance on deep-learning subjects, simply substitute the query variable in the main code of section 2. with:

3.c Exploring the impact of the type of documents, the lookback period, the number of topics being extracted

metadata_selection = {"category": "Academia"}   # str | json object of {\"metadata_field\":[\"selected_values\"]} (optional)

period_starts: You can perform content analysis on any lookback you want, with granularities ranging from intraday to monthly. Depending on your objectives, such options give you the flexibility to slice the data in the most relevant time horizon.

Document types: You can investigate how topics and their evolution over time change, based on the types of sources contributing to your content. Whether it is sources in different languages, or contributors from academia, the private sector or independent individuals, as long as your corpus has that info available, dicing and slicing is a piece of cake thanks to the metadata selector provided during the construction of the dataset. Rerun the main code of section 2. on a subset of the whole corpus. Create a variable metadata_selection and pass it in to the payload (works if using your docs or the Central Bank feed, News Media RSS feed doesn't have metadata that can be selected):

4. Next Steps

Low Quality Content Detection

The Python Notebook containing the use case can be directly downloaded at this link.

包含用例的python笔记本可以通过这个链接直接下载

Objective:

In its current version, SumUp contrast analysis works comparing two categories against each other, where the user defines what the two categories are.

Data:

Nucleus APIs used:

Approach:

1. Dataset Preparation

import csv
import json
import nucleus_api.api.nucleus_api as nucleus_helper
import nucleus_api
from nucleus_api.rest import ApiException

configuration = nucleus_api.Configuration()
configuration.host = 'UPDATE-WITH-API-SERVER-HOSTNAME'
configuration.api_key['x-api-key'] = 'UPDATE-WITH-API-KEY'

# Create API instance
api_instance = nucleus_api.NucleusApi(nucleus_api.ApiClient(configuration))

csv_file = 'social-media.csv'
dataset = 'social-media'# str | Destination dataset where the file will be inserted.

with open(csv_file, encoding='utf-8-sig') as csvfile:
    reader = csv.DictReader(csvfile)
    json_props = nucleus_helper.upload_jsons(api_instance, dataset, reader, processes=4)

    total_size = 0
    total_jsons = 0
    for jp in json_props:
        total_size += jp.size
        total_jsons += 1

    print(total_jsons, 'JSON records (', total_size, 'bytes) appended to', dataset)














2. Contrasted Topic Modeling

metadata_selection = {"label": ["Violence", "All clear"]} # dict | The metadata selection defining the two categories of documents to contrast and summarize against each other
query = '' # str | Dataset-language-specific fulltext query, using mysql MATCH boolean query format (optional)
custom_stop_words = [""] # List of stop words. (optional)
excluded_docs = '' # str | List of document IDs that should be excluded from the analysis. Example, ["docid1", "docid2", ..., "docidN"]  (optional)
syntax_variables = False # bool | Specifies whether to take into account syntax aspects of each category of documents to help with contrasting them (optional) (default to False)
compression = 0.0002 # float | Parameter controlling the breadth of the contrasted topic. Contained between 0 and 1, the smaller it is, the more contrasting terms will be captured, with decreasing weight. (optional) (default to 0.000002)
remove_redundancies = False # bool | If True, this option removes quasi-duplicates from the analysis and reatins only one copy of it. A quasi-duplicate would have the same NLP representation, but not necessarily the exact same text. (optional) (default False)

payload = nucleus_api.TopicContrastModel(dataset=dataset,
                                         metadata_selection=metadata_selection,
                                         compression=compression,
                                         syntax_variables=syntax_variables,
                                         period_start='2018-01-01',
                                         period_end='2018-01-01')
try:
    api_response = api_instance.post_topic_contrast_api(payload)
    api_ok = True
except ApiException as e:
    api_error = json.loads(e.body)
    print('ERROR:', api_error['message'])
    api_ok = False

if api_ok:
    topic_contrast_result = api_response.result
    print('Contrasted Topic')
    print('    Keywords:', topic_contrast_result.keywords)
    print('    Keywords Weight:', topic_contrast_result.keywords_weight)

    print('In-Sample Perf Metrics')
    print('    Accuracy:', topic_contrast_result.perf_metrics.hit_rate)
    print('    Recall:', topic_contrast_result.perf_metrics.recall)
    print('    Precision:', topic_contrast_result.perf_metrics.precision)






















3. Documents Classification

# Here we re-use the contrasted topic from section 2
fixed_topics = {"keywords": topic_contrast_result.keywords,
                "weights": topic_contrast_result.keywords_weight} # dict | The contrasting topic used to separate the two categories of documents. Weights optional

metadata_selection = {"label": ["Violence", "All clear"]} # dict | The metadata selection defining the two categories of documents that a document can be classified into

query = '' # str | Dataset-language-specific fulltext query, using mysql MATCH boolean query format (optional)
custom_stop_words = [""] # List of stop words. (optional)
excluded_docs = '' # str | List of document IDs that should be excluded from the analysis. Example, ["docid1", "docid2", ..., "docidN"]  (optional)
syntax_variables = False # bool | If True, the classifier will include syntax-related variables on top of content variables (optional) (default to False)
threshold = 0 # float | Threshold value for a document exposure to the contrasted topic, above which the document is assigned to class 1 specified through metadata_selection. (optional) (default to 0)
remove_redundancies = False # bool | If True, this option removes quasi-duplicates from the analysis and reatins only one copy of it. A quasi-duplicate would have the same NLP representation, but not necessarily the exact same text. (optional) (default False)


payload = nucleus_api.DocClassifyModel(dataset=dataset,
                                        fixed_topics=fixed_topics,
                                        metadata_selection=metadata_selection,
                                        validation_phase=True,
                                        syntax_variables = syntax_variables,
                                        period_start='2019-01-01',
                                        period_end='2019-01-01')

try:
    api_response = api_instance.post_doc_classify_api(payload)
    api_ok = True
except ApiException as e:
    api_error = json.loads(e.body)
    print('ERROR:', api_error['message'])
    api_ok = False

if api_ok:
    print('Detailed Results')
    print('    Docids:', api_response.result.detailed_results.docids)
    print('    Exposure:', api_response.result.detailed_results.exposures)
    print('    Estimated Category:', api_response.result.detailed_results.estimated_class)
    print('    Actual Category:', api_response.result.detailed_results.true_class)
    print('\n')

    print('Out-Sample Perf Metrics')
    print('    Accuracy:', api_response.result.perf_metrics.hit_rate)
    print('    Recall:', api_response.result.perf_metrics.recall)
    print('    Precision:', api_response.result.perf_metrics.precision)

This task requires 3 steps:























payload = nucleus_api.DocClassifyModel(dataset=dataset,
                                        fixed_topics=fixed_topics,
                                        metadata_selection=metadata_selection,
                                        custom_stop_words=custom_stop_words,
                                        validation_phase=False,
                                        period_start='2019-01-01',
                                        period_end='2019-01-01')

try:
    api_response = api_instance.post_doc_classify_api(payload)
    api_ok = True
except ApiException as e:
    api_error = json.loads(e.body)
    print('ERROR:', api_error['message'])
    api_ok = False

if api_ok:
    print('Detailed Results')
    print('    Docids:', api_response.result.detailed_results.docids)
    print('    Exposure:', api_response.result.detailed_results.exposures)
    print('    Estimated Category:', api_response.result.detailed_results.estimated_class)

Then, we can move to the testing phase
















4. Fine Tuning

4.a Specifying the metadata_selection for your contrasted topic


metadata_selection = {"content": "kill hate torture"}

metadata_selection = {"author": "@suspicious_author"}

4.b Reducing noise in your low-quality content detection

print('------------- Get list of topics from dataset --------------')

payload = nucleus_api.Topics(dataset=dataset,                         
                            query='',                       
                            num_topics=20,
                            num_keywords=8,
                            metadata_selection=metadata_selection)

try:
    api_response = api_instance.post_topic_api(payload)
    api_ok = True
except ApiException as e:
    api_error = json.loads(e.body)
    print('ERROR:', api_error['message'])
    api_ok = False

if api_ok:
    for i, res in enumerate(api_response.result.topics):
        print('Topic', i, ' keywords: ', res.keywords)    
        print('---------------')













custom_stop_words = ["tough dude","bad boy"] # str | List of stop words. (optional)

Using your domain expertise / client input / advisor input, you can determine whether certain of those topics or keywords are not differentiated enough to contribute to low-quality content detection.

You can then tailor the low-quality content detection by creating a custom_stop_words variable that contains those words. Initialize the variable as follows, for instance, and pass it in the payload of the main code of section 2:

4.c Focusing the content detection on specific subjects potentially discussed in your corpus

query = '(LOL OR league of legends OR WOW OR world of warcraft)' # str | Fulltext query, using mysql MATCH boolean query format. Example: "(word1 OR word2) AND (word3 OR word4)" (optional)

query: You can refine the content detection by leveraging the query variable of the Contrasted Topic and Document Classify APIs.

Rerun any of these 2 APIs on the content from your corpus that mentions a specific theme. Create a variable query and pass it in to the payload:

Summarization

Document Summarization

The Python Notebook containing the use case can be directly downloaded at this link.

包含用例的python笔记本可以通过这个链接直接下载

Objective:

Data:

The Nucleus Datafeed can be leveraged for all content from major Central Banks

Nucleus APIs used:

Approach:

1. Dataset Preparation

import csv
import json
import nucleus_api.api.nucleus_api as nucleus_helper
import nucleus_api
from nucleus_api.rest import ApiException

configuration = nucleus_api.Configuration()
configuration.host = 'UPDATE-WITH-API-SERVER-HOSTNAME'
configuration.api_key['x-api-key'] = 'UPDATE-WITH-API-KEY'

# Create API instance
api_instance = nucleus_api.NucleusApi(nucleus_api.ApiClient(configuration))


print('---- Case 1: you are using your own corpus, coming from a local folder ----')
folder = 'Corporate_documents'         
dataset = 'Corporate_docs'# str | Destination dataset where the file will be inserted.

# build file iterable from a folder recursively. 
# Each item in the iterable is in the format below:
# {'filename': filename,   # filename to be uploaded. REQUIRED
#  'metadata': {           # metadata for the file. Optional
#      'key1': val1,       # keys can have arbiturary names as long as the names only
#      'key2': val2        # contain alphanumeric (0-9|a-z|A-Z) and underscore (_)
#   } 
# }
file_iter = []
for root, dirs, files in os.walk(folder):
    for file in files:
        #if Path(file).suffix == '.pdf': # .txt .doc .docx .rtf .html .csv also supported
            file_dict = {'filename': os.path.join(root, file),
                         'metadata': {'company': 'Apple',
                                      'category': 'Press Release',
                                      'date': '2019-01-01'}}
            file_iter.append(file_dict)

file_props = nucleus_helper.upload_files(api_instance, dataset, file_iter, processes=4)
for fp in file_props:
    print(fp.filename, '(', fp.size, 'bytes) has been added to dataset', dataset)

    
    
print('---- Case 2: you are using an embedded datafeed ----')
dataset = 'sumup/central_banks_chinese'# embedded datafeeds in Nucleus.
metadata_selection = {'bank': 'people_bank_of_china', 'document_category': ('speech', 'press release', 'publication')}

































2. Document Summarization

print('---------------- Get doc summaries ------------------------')
# These are all possible input arguments to the summarization API
custom_stop_words = ["decree","motion"] # List of stop words. (optional)
summary_length = 6 # int | The maximum number of bullet points a user wants to see in the document summary. (optional) (default to 6)
context_amount = 0 # int | The number of sentences surrounding key summary sentences in the documents that they come from. (optional) (default to 0)
short_sentence_length = 0 # int | The sentence length below which a sentence is excluded from summarization (optional) (default to 4)
long_sentence_length = 40 # int | The sentence length beyond which a sentence is excluded from summarization (optional) (default to 40)

payload = nucleus_api.DocumentSummaryModel(dataset='Corporate_docs', 
                                        doc_title='my_title', 
                                        summary_length=summary_length)
try:
    api_response = api_instance.post_doc_summary_api(payload)
    api_ok = True
except ApiException as e:
    api_error = json.loads(e.body)
    print('ERROR:', api_error['message'])
    api_ok = False

if api_ok:   
    print('Summary for', api_response.result.doc_title)
    for sent in api_response.result.summary.sentences:
        print('    *', sent)












3. Fine Tuning

3.a Extracting topics found across documents of your corpus

print('------------- Get list of topics from dataset --------------')

payload = nucleus_api.Topics(dataset='Corporate_docs',                         
                            query='',                       
                            num_topics=8, 
                            num_keywords=8,
                            metadata_selection=metadata_selection)
try:
    api_response = api_instance.post_topic_api(payload)        
    api_ok = True
except ApiException as e:
    api_error = json.loads(e.body)
    print('ERROR:', api_error['message'])
    api_ok = False

if api_ok:       
    for i, res in enumerate(api_response.result.topics):
        print('Topic', i, ' keywords: ', res.keywords)    
        print('---------------')










custom_stop_words = ["decree","motion"] # str | List of stop words. (optional)

Using your domain expertise / client input / advisor input, you can determine whether certain of those topics or keywords are not differentiated enough to contribute to document summaries.

You can then tailor the document summaries by creating a custom_stop_words variable that contains those words. Initialize the variable as follows, for instance, and pass it in the payload of the main code of section 2:

3.b Isolating specific subsets of documents within your corpus

# If you created a dataset where one metadata is the category of the document, 
# and one possible value for this category is 'speech'
# you could focus the topic analysis and the creation of a customized stopword list for all speech documents 
# within your corpus and later on in production
metadata_selection = {"document_category": "speech"}   # str | json object of {\"metadata_field\":[\"selected_values\"]} (optional)

Document types: You can refine the extraction of topics and isolation of non-information-bearing topics by leveraging the metadata selector provided during the construction of the dataset, to get any level of granularity you are interested in.

Rerun the code from two blocks above on a subset of the whole corpus. Create a variable metadata_selection and pass it in to the payload:

3.c Creating custom stopword lists on certain themes within your corpus

query = '(veto rights OR jury decision OR verdict)' # str | Fulltext query, using mysql MATCH boolean query format. Example: "(word1 OR word2) AND (word3 OR word4)" (optional)

query: You can refine the extraction of topics and isolation of non-information-bearing topics by leveraging the query variable of the Topic API.

Rerun the code from 3 blocks above on the content from your corpus that mentions a specific theme. Create a variable query and pass it in to the payload:

Entity Tagging

Generate metadata with Entity Tagging

The Python Notebook containing the use case can be directly downloaded at this link.

包含用例的python笔记本可以通过这个链接直接下载

Objective:

Data:

Nucleus APIs used:

Approach:

1. Dataset Preparation

import csv
import json
import nucleus_api.api.nucleus_api as nucleus_helper
import nucleus_api
from nucleus_api.rest import ApiException

configuration = nucleus_api.Configuration()
configuration.host = 'UPDATE-WITH-API-SERVER-HOSTNAME'
configuration.api_key['x-api-key'] = 'UPDATE-WITH-API-KEY'

# Create API instance
api_instance = nucleus_api.NucleusApi(nucleus_api.ApiClient(configuration))

print('---- Upload documents from a local folder into a new Nucleus dataset ----')
folder = 'Corporate_documents'         
dataset = 'Corporate_docs'# str | Destination dataset where the file will be inserted.

# build file iterable from a folder recursively. 
# Each item in the iterable is in the format below:
# {'filename': filename,   # filename to be uploaded. REQUIRED
#  'metadata': {           # metadata for the file. Optional
#      'key1': val1,       # keys can have arbiturary names as long as the names only
#      'key2': val2        # contain alphanumeric (0-9|a-z|A-Z) and underscore (_)
#   } 
# }
file_iter = []
for root, dirs, files in os.walk(folder):
    for file in files:
        #if Path(file).suffix == '.pdf': # .txt .doc .docx .rtf .html .csv also supported
        file_dict = {'filename': os.path.join(root, file),
                     'metadata': {'category': 'News'}} # You don't have the tickers from each news, let's tag them
        file_iter.append(file_dict)

file_props = nucleus_helper.upload_files(api_instance, dataset, file_iter, processes=4)
for fp in file_props:
    print(fp.filename, '(', fp.size, 'bytes) has been added to dataset', dataset)



























2. Dataset Tagging

print('---------------- Tag dataset ------------------------')

payload = nucleus_api.DatasetTaggingModel(dataset='Corporate_docs', 
                                    query='AAPL OR Apple', 
                                    metadata_selection='', 
                                    time_period='')
try:
    api_response = api_instance.post_dataset_tagging(payload)
    api_ok = True
except ApiException as e:
    api_error = json.loads(e.body)
    print('ERROR:', api_error['message'])
    api_ok = False

if api_ok:
    print('Information about dataset', dataset)
    print('    Entity Tagged:', api_response.result.entities_tagged)
    print('    Docids tagged with Entity:', api_response.result.doc_ids)
    print('    Entity occurrences in each docid:', api_response.result.entities_count)












entities = [['AAPL', 'Apple'], ['GOOG', 'Google', 'Alphabet']]

docs_tagged = []
entities_tagged = []
for i in range(len(entities)):
    query = " OR ".join(entities[i])
    payload = nucleus_api.DatasetTaggingModel(dataset='Corporate_docs', 
                                    query=query, 
                                    metadata_selection='', 
                                    time_period='')
    try:
        api_response = api_instance.post_dataset_tagging(payload)
        api_ok = True
    except ApiException as e:
        api_error = json.loads(e.body)
        print('ERROR:', api_error['message'])
        api_ok = False

    if api_ok:
        for docid in api_response['doc_ids']::
            docs_tagged.append(docid)
            entities_tagged.append(api_response['entities_tagged'][0]) # Retain the first naming of an entity as label

        # Let's regroup the entities that are tagged per document so we have a unique list of docids
        # and all entities tagged in them

        # This table will be useful to generate an updated dataset with tickers provided as metadata
        # so what we really care about are filenames rather than docids  
        from collections import defaultdict
        d = defaultdict(list)
        for i, entity in enumerate(entities_tagged):
            payload = nucleus_api.DocInfo(
                dataset='Corporate_docs', 
                doc_ids=docs_tagged[i],
                metadata_selection='')
            api_response = api_instance.post_doc_info(payload)
            key = api_response.result[0].attribute['filename']
            d[key].append(entity)
        d = dict(d)

Now we create our list of entities and loop through it































dataset = 'Corporate_docs_2'# str | Destination dataset where the file will be inserted.

file_iter = []
for root, dirs, files in os.walk(folder):
    for file in files:
        #if Path(file).suffix == '.pdf': # .txt .doc .docx .rtf .html .csv also supported
        
        # We know the filename of the file currently being injected, we can match it against the 
        # table of tagged documents
        if d[os.path.join(root, file)] != [] # Only build the new dataset with the documents that have tagged entities
            tickers = d[os.path.join(root, file)]
            file_dict = {'filename': os.path.join(root, file),
                         'metadata': {'companies': tickers,
                                      'category': 'News'}}
            
            file_iter.append(file_dict)

file_props = nucleus_helper.upload_files(api_instance, dataset, file_iter, processes=4)
for fp in file_props:
    print(fp.filename, '(', fp.size, 'bytes) has been added to dataset', dataset)

Using these tags, we can construct a second dataset enriched with this extra metadata, which will be very convenient notably in signal research and compliance analytics.

We can use the filename to match raw documents with documents that have been tagged












3. Fine Tuning

3.a Expanding the list of synonyms for a given entity

entities = [['AAPL', 'Apple', 'iPhone'], ['GOOG', 'Google', 'Alphabet', 'Android'], ['NTDOY', 'Nintendo', '任天堂株式会社']]

docs_tagged = []
entities_tagged = []
for i in range(len(entities)):
    query = " OR ".join(entities[i])
    payload = nucleus_api.DatasetTaggingModel(dataset='Corporate_docs', 
                                    query=query, 
                                    metadata_selection='', 
                                    time_period='')
    try:
        api_response = api_instance.post_dataset_tagging(payload)
        api_ok = True
    except ApiException as e:
        api_error = json.loads(e.body)
        print('ERROR:', api_error['message'])
        api_ok = False

    if api_ok:
        for docid in api_response['doc_ids']:
            docs_tagged.append(docid)
            entities_tagged.append(api_response['entities_tagged'][0]) # Retain the first naming of an entity as label

        # Let's regroup the entities that are tagged per document so we have a unique list of docids
        # and all entities tagged in them

        # This table will be useful to generate an updated dataset with tickers provided as metadata
        # so what we really care about are filenames rather than docids  
        d = defaultdict(list)
        for i, entity in enumerate(entities_tagged):
            payload = nucleus_api.DocInfo(
                dataset='Corporate_docs', 
                doc_ids=docs_tagged[i],
                metadata_selection='')
            api_response = api_instance.post_doc_info(payload)
            key = api_response.result[0].attribute['filename']
            d[key].append(entity)
        d = dict(d)

query: You can refine the dataset tagging and expand your list of tickers (or other entity of relevance) to contain as many alternatives as you want.

You can also create a conservative superset list of tickers once, keep that list saved and reuse it for every of the datasets you want to tag.

Finally, you can also do the same with foreign companies. For instance, you could define an entry of your list as ['Nintendo', 'NTDOY', '任天堂株式会社']

Pass that expanded list, looping through all distinct tickers, to the query argument in the main code of section 2. and rerun that code:

Sentiment Dictionaries & Data Labeling

Constructing a Sentiment Dictionary

The Python Notebook containing the use case can be directly downloaded at this link.

包含用例的python笔记本可以通过这个链接直接下载

Objective:

In its current version, SumUp contrast analysis works on the premise of two distinct categories of documents within a corpus, whichever way those two categories are defined ex-ante by the user based on metadata or content

Data:

Nucleus APIs used:

Approach:

1. Training, Validation, Testing Dataset Preparation

print('---- Train / Validate / Test dataset ----')
folder = 'Sellside_research'         
dataset = 'Sellside_research'# str | Destination dataset where the file will be inserted.

# build file iterable from a folder recursively. 
# Each item in the iterable is in the format below:
# {'filename': filename,   # filename to be uploaded. REQUIRED
#  'metadata': {           # metadata for the file. Optional
#      'key1': val1,       # keys can have arbiturary names as long as the names only
#      'key2': val2        # contain alphanumeric (0-9|a-z|A-Z) and underscore (_)
#   } 
# }

# If the documents are not in a CSV or JSON, then you must specify sentiment labels in the file_iter object
# as an extra metadata field.

# If you are reading from a file where the sentiment label is already provided, 
# no need to pass the 'metadata' in the file_dict

file_iter = []
for root, dirs, files in os.walk(folder):
    for file in files:
        file_dict = {'filename': os.path.join(root, file),
                     'metadata': {'sentiment': 'positive' # Here build some logic to decide how to assign POS / NEU / NEG
                                }}
        file_iter.append(file_dict)

file_props = nucleus_helper.upload_files(api_instance, dataset, file_iter, processes=4)
for fp in file_props:
    print(fp.filename, '(', fp.size, 'bytes) has been added to dataset', dataset)
















2. Unlabeled Dataset Preparation

print('---- Dataset to label ----')
folder = 'Sellside_research_unlabeled'         
dataset = 'Sellside_research_unlabeled'# str | Destination dataset where the file will be inserted.

# build file iterable from a folder recursively. 
# Each item in the iterable is in the format below:
# {'filename': filename,   # filename to be uploaded. REQUIRED }

file_iter = []
for root, dirs, files in os.walk(folder):
    for file in files:
        file_dict = {'filename': os.path.join(root, file)}
        file_iter.append(file_dict)

file_props = nucleus_helper.upload_files(api_instance, dataset, file_iter, processes=4)
for fp in file_props:
    print(fp.filename, '(', fp.size, 'bytes) has been added to dataset', dataset)











3. Accelerating the Labeling of Data

metadata_selection = {"sentiment": ["positive", "negative"]} # dict | The metadata selection defining the two categories of documents to contrast and summarize against each other

query = '' # str | Dataset-language-specific fulltext query, using mysql MATCH boolean query format (optional)
custom_stop_words = [""] # List of stop words. (optional)
excluded_docs = '' # str | List of document IDs that should be excluded from the analysis. Example, ["docid1", "docid2", ..., "docidN"]  (optional)
syntax_variables = False # bool | Specifies whether to take into account syntax aspects of each category of documents to help with contrasting them (optional) (default to False)
compression = 0.002 # float | Parameter controlling the breadth of the contrasted topic. Contained between 0 and 1, the smaller it is, the more contrasting terms will be captured, with decreasing weight. (optional) (default to 0.000002)
remove_redundancies = False # bool | If True, this option removes quasi-duplicates from the analysis. A quasi-duplicate would have the same NLP representation, but not necessarily the exact same text. (optional) (default False)

payload = nucleus_api.TopicContrastModel(dataset='Sellside_research', 
                                        metadata_selection=metadata_selection,
                                        custom_stop_words=custom_stop_words,
                                        period_start='2017-01-01',
                                        period_end='2018-01-01')
api_response = api_instance.post_topic_contrast_api(payload)

print('Contrasted Topic')
print('    Keywords:', api_response.result.keywords)
print('    Keywords Weight:', api_response.result.keywords_weight)











fixed_topics = {"keywords": api_response.result.keywords, "weights": api_response.result.keywords_weight} # dict | The contrasting topic used to separate the two categories of documents. Weights optional
metadata_selection = {"sentiment": ["positive", "negative"]} # dict | The metadata selection defining the two categories of documents that a document can be classified into

query = '' # str | Dataset-language-specific fulltext query, using mysql MATCH boolean query format (optional)
custom_stop_words = [""] # List of stop words. (optional)
excluded_docs = '' # str | List of document IDs that should be excluded from the analysis. Example, ["docid1", "docid2", ..., "docidN"]  (optional)
syntax_variables = False # bool | If True, the classifier will include syntax-related variables on top of content variables (optional) (default to False)
threshold = 0 # float | Threshold value for a document exposure to the contrastic topic, above which the document is assigned to class 1 specified through metadata_selection. (optional) (default to 0)
remove_redundancies = False # bool | If True, this option removes quasi-duplicates from the analysis. A quasi-duplicate would have the same NLP representation, but not necessarily the exact same text. (optional) (default False)


payload = nucleus_api.DocClassifyModel(dataset="Sellside_research",
                                        fixed_topics=fixed_topics,
                                        metadata_selection=metadata_selection,
                                        custom_stop_words=custom_stop_words,
                                        validation_phase=True, # This argument tells the API that data is labeled, so produces perf metrics
                                        period_start='2018-01-01',
                                        period_end='2019-01-01')
api_response = api_instance.post_doc_classify_api(payload)

print('Detailed Results')
print('    Docids:', api_response.result.detailed_results.docids)
print('    Exposure:', api_response.result.detailed_results.exposures)
print('    Estimated Category:', api_response.result.detailed_results.estimated_class)
print('    Actual Category:', api_response.result.detailed_results.true_class)
print('\n')

print('Perf Metrics')
print('    Accuracy:', api_response.result.perf_metrics.hit_rate)
print('    Recall:', api_response.result.perf_metrics.recall)
print('    Precision:', api_response.result.perf_metrics.precision)






















fixed_topics = {"keywords": api_response.result.keywords, "weights": api_response.result.keywords_weight} # dict | The contrasting topic used to separate the two categories of documents. Weights optional
metadata_selection = {"sentiment": ["positive", "negative"]} # dict | The metadata selection defining the two categories of documents that a document can be classified into

query = '' # str | Dataset-language-specific fulltext query, using mysql MATCH boolean query format (optional)
custom_stop_words = [""] # List of stop words. (optional)
excluded_docs = '' # str | List of document IDs that should be excluded from the analysis. Example, ["docid1", "docid2", ..., "docidN"]  (optional)
syntax_variables = False # bool | If True, the classifier will include syntax-related variables on top of content variables (optional) (default to False)
threshold = 0 # float | Threshold value for a document exposure to the contrastic topic, above which the document is assigned to class 1 specified through metadata_selection. (optional) (default to 0)
remove_redundancies = False # bool | If True, this option removes quasi-duplicates from the analysis. A quasi-duplicate would have the same NLP representation, but not necessarily the exact same text. (optional) (default False)


payload = nucleus_api.DocClassifyModel(dataset="Sellside_research",
                                        fixed_topics=fixed_topics,
                                        metadata_selection=metadata_selection,
                                        custom_stop_words=custom_stop_words,
                                        validation_phase=True, # This argument tells the API that data is labeled, so produces perf metrics
                                        period_start='2019-01-01',
                                        period_end='2019-07-01')
api_response = api_instance.post_doc_classify_api(payload)

print('Detailed Results')
print('    Docids:', api_response.result.detailed_results.docids)
print('    Exposure:', api_response.result.detailed_results.exposures)
print('    Estimated Category:', api_response.result.detailed_results.estimated_class)
print('    Actual Category:', api_response.result.detailed_results.true_class)
print('\n')

print('Perf Metrics')
print('    Accuracy:', api_response.result.perf_metrics.hit_rate)
print('    Recall:', api_response.result.perf_metrics.recall)
print('    Precision:', api_response.result.perf_metrics.precision)
























fixed_topics = {"keywords": api_response.result.keywords, "weights": api_response.result.keywords_weight} # dict | The contrasting topic used to separate the two categories of documents. Weights optional
metadata_selection = {"sentiment": ["positive", "negative"]} # dict | The metadata selection defining the two categories of documents that a document can be classified into

query = '' # str | Dataset-language-specific fulltext query, using mysql MATCH boolean query format (optional)
custom_stop_words = [""] # List of stop words. (optional)
excluded_docs = '' # str | List of document IDs that should be excluded from the analysis. Example, ["docid1", "docid2", ..., "docidN"]  (optional)
syntax_variables = False # bool | If True, the classifier will include syntax-related variables on top of content variables (optional) (default to False)
threshold = 0 # float | Threshold value for a document exposure to the contrastic topic, above which the document is assigned to class 1 specified through metadata_selection. (optional) (default to 0)
remove_redundancies = False # bool | If True, this option removes quasi-duplicates from the analysis. A quasi-duplicate would have the same NLP representation, but not necessarily the exact same text. (optional) (default False)


payload = nucleus_api.DocClassifyModel(dataset="Sellside_research_unlabeled",
                                        fixed_topics=fixed_topics,
                                        metadata_selection=metadata_selection,
                                        custom_stop_words=custom_stop_words,
                                        validation_phase=False)
api_response = api_instance.post_doc_classify_api(payload)

print('Detailed Results')
print('    Docids:', api_response.result.detailed_results.docids)
print('    Estimated Category:', api_response.result.detailed_results.estimated_class)












4. Generating a Sentiment Dictionary

print('---- Complete dataset ----')
folder = 'Sellside_research_combined'         
dataset = 'Sellside_research_combined'# str | Destination dataset where the file will be inserted.

# build file iterable from a folder recursively. 
# Each item in the iterable is in the format below:
# {'filename': filename,   # filename to be uploaded. REQUIRED
#  'metadata': {           # metadata for the file. Optional
#      'key1': val1,       # keys can have arbiturary names as long as the names only
#      'key2': val2        # contain alphanumeric (0-9|a-z|A-Z) and underscore (_)
#   } 
# }

# If the documents are not in a CSV or JSON, then you must specify sentiment labels in the file_iter object
# as an extra metadata field.

# If you are reading from a file where the sentiment label is already provided, 
# no need to pass the 'metadata' in the file_dict

file_iter = []
for root, dirs, files in os.walk(folder):
    for file in files:
        file_dict = {'filename': os.path.join(root, file),
                     'metadata': {'sentiment': 'positive' # Here pass in the labels obtained in the previous step
                                }}
        file_iter.append(file_dict)

file_props = nucleus_helper.upload_files(api_instance, dataset, file_iter, processes=4)
for fp in file_props:
    print(fp.filename, '(', fp.size, 'bytes) has been added to dataset', dataset)
    

metadata_selection = {"sentiment": ["positive", "negative"]} # dict | The metadata selection defining the two categories of documents to contrast and summarize against each other

query = '' # str | Dataset-language-specific fulltext query, using mysql MATCH boolean query format (optional)
custom_stop_words = [""] # List of stop words. (optional)
excluded_docs = '' # str | List of document IDs that should be excluded from the analysis. Example, ["docid1", "docid2", ..., "docidN"]  (optional)
syntax_variables = False # bool | Specifies whether to take into account syntax aspects of each category of documents to help with contrasting them (optional) (default to False)
compression = 0.002 # float | Parameter controlling the breadth of the contrasted topic. Contained between 0 and 1, the smaller it is, the more contrasting terms will be captured, with decreasing weight. (optional) (default to 0.000002)
remove_redundancies = False # bool | If True, this option removes quasi-duplicates from the analysis. A quasi-duplicate would have the same NLP representation, but not necessarily the exact same text. (optional) (default False)

payload = nucleus_api.TopicContrastModel(dataset='Sellside_research_combined', 
                                        metadata_selection=metadata_selection,
                                        custom_stop_words=custom_stop_words)
api_response = api_instance.post_topic_contrast_api(payload)

print('Contrasted Topic')
print('    Keywords:', api_response.result.keywords)
print('    Keywords Weight:', api_response.result.keywords_weight)


































5. Fine Tuning

5.a Excluding certain content from the contrast analysis

print('------------- Get list of topics from dataset --------------')

payload = nucleus_api.Topics(dataset='Sellside_research',                         
                            query='',                       
                            num_topics=8, 
                            num_keywords=8,
                            metadata_selection=metadata_selection)
api_response = api_instance.post_topic_api(payload)        
    
for i, res in enumerate(api_response.result.topics):
    print('Topic', i, ' keywords: ', res.keywords)    
    print('---------------')





custom_stop_words = ["disclaimer","disclosure"] # str | List of stop words. (optional)

Using your domain expertise / client input / advisor input, you can determine whether certain of those topics or keywords are not differentiated enough to contribute to contrast analysis.

You can then tailor the contrast analysis by creating a custom_stop_words variable that contains those words. Initialize the variable as follows, for instance, and pass it in the payload of the main code of section 2:

5.b Specifying the metadata_selection for your contrasted topic


metadata_selection = {"research_analyst": ["MS", "JPM"]}

metadata_selection = {"content": "fundamentals"}

5.c Fine-tuning the contrasting topic

compression: The smaller the value, the more words will be retained in the contrasting topic, with increasingly lesser impact on separating the two categories of sentiment you work with

syntax_variables: If True, then certain Part-of-Speech features are automatically included in the contrasting topic model. It may help if certain authors have vastly different writing styles. This is frequent with social media data and news. It is less likely in institutional publications

threshold: This is the minimum exposure a document must have to the contrasting topic to be assigned to category_1 that you defined. A perfect model would have a threshold of 0, the default value. You may observe that higher performance metrics are obtained in validation from choosing a different value. This may be explained in particular in smaller samples for training and validation, or if there are generic words that appear as keywords in the contrasting topic

Transfer Learning

Topics Transfer Learning

The Python Notebook containing the use case can be directly downloaded at this link.

包含用例的python笔记本可以通过这个链接直接下载

Objective:

Data:

The Nucleus Datafeed can be leveraged for all content from major Central Banks

Nucleus APIs used:

Approach:

1. Dataset Preparation

import csv
import json
import nucleus_api.api.nucleus_api as nucleus_helper
import nucleus_api
from nucleus_api.rest import ApiException

configuration = nucleus_api.Configuration()
configuration.host = 'UPDATE-WITH-API-SERVER-HOSTNAME'
configuration.api_key['x-api-key'] = 'UPDATE-WITH-API-KEY'

# Create API instance
api_instance = nucleus_api.NucleusApi(nucleus_api.ApiClient(configuration))

# Leverage your own corpus
print('---- Case 1: you are using your own corpus, coming from a local folder ----')
folder = 'News_feed'         
dataset = 'News_feed'# str | Destination dataset where the file will be inserted.

# build file iterable from a folder recursively. 
# Each item in the iterable is in the format below:
# {'filename': filename,   # filename to be uploaded. REQUIRED
#  'metadata': {           # metadata for the file. Optional
#      'key1': val1,       # keys can have arbiturary names as long as the names only
#      'key2': val2        # contain alphanumeric (0-9|a-z|A-Z) and underscore (_)
#   } 
# }
file_iter = []
for root, dirs, files in os.walk(folder):
    for file in files:
        #if Path(file).suffix == '.pdf': # .txt .doc .docx .rtf .html .csv also supported
        file_dict = {'filename': os.path.join(root, file),
                     'metadata': {'source': 'Tech Crunch',
                                  'author': 'Sarah Moore'
                                  'category': 'Media',
                                  'date': '2019-01-01'}}
        file_iter.append(file_dict)

file_props = nucleus_helper.upload_files(api_instance, dataset, file_iter, processes=4)
for fp in file_props:
    print(fp.filename, '(', fp.size, 'bytes) has been added to dataset', dataset)

    
# Leverage a Nucleus embedded feed    
print('---- Case 2: you are using an embedded datafeed ----')
dataset = 'sumup/rss_feed_finance'# embedded datafeeds in Nucleus.

This can be done directly into the APIs that perform content analysis, see below





























2. Transfer Learning

print('------------------- Get topic transfer -----------------------')

payload = nucleus_api.TopicTransferModel(dataset0='News_feed', 
                                         dataset1="test_feed",
                                        query='', 
                                        custom_stop_words='', 
                                        num_topics=8, 
                                        num_keywords=8,
                                        metadata_selection='')
try:
    api_response = api_instance.post_topic_transfer_api(payload)
    api_ok = True
except ApiException as e:
    api_error = json.loads(e.body)
    print('ERROR:', api_error['message'])
    api_ok = False

if api_ok:
    doc_ids_t1 = api_response.result.doc_ids_t1
    topics = api_response.result.topics
    for i,res in enumerate(topics):
        print('Topic', i, 'exposure within validation dataset:')
        print('    Keywords:', res.keywords)
        print('    Strength:', res.strength)
        print('    Document IDs:', doc_ids_t1)
        print('    Exposure per Doc in Validation Dataset:', res.doc_topic_exposures_t1)
        print('---------------')
    
print('-------------------------------------------------------------')

















print('------------------- Get topic sentiment transfer -----------------------')

payload = nucleus_api.TopicSentimentTransferModel(dataset0='News_feed', 
                                        query='', 
                                        custom_stop_words='', 
                                        num_topics=8, 
                                        num_keywords=8,
                                        period_0_start='2018-08-12',
                                        period_0_end='2018-08-15',
                                        period_1_start='2018-08-16',
                                        period_1_end='2018-08-19',
                                        metadata_selection='')
try:
    api_response = api_instance.post_topic_sentiment_transfer_api(payload)
    api_ok = True
except ApiException as e:
    api_error = json.loads(e.body)
    print('ERROR:', api_error['message'])
    api_ok = False

if api_ok:
    topics = api_response.result
    for i,res in enumerate(topics):
        print('Topic', i, 'exposure within validation dataset:')
        print('    Keywords:', res.keywords)
        print('    Strength:', res.strength)
        print('    Sentiment:', res.sentiment)
        print('    Document IDs:', res.doc_ids_t1)
        print('    Sentiment per Doc in Validation Dataset:', res.doc_sentiments_t1)
        print('---------------')
    
print('-------------------------------------------------------------')

print('------------------- Get topic consensus transfer -----------------------')

payload = nucleus_api.TopicConsensusTransferModel(dataset0='News_feed', 
                                        query='', 
                                        custom_stop_words='', 
                                        num_topics=8, 
                                        num_keywords=8,
                                        period_0_start='2018-08-12',
                                        period_0_end='2018-08-15',
                                        period_1_start='2018-08-16',
                                        period_1_end='2018-08-19',
                                        metadata_selection='')
try:
    api_response = api_instance.post_topic_consensus_transfer_api(payload)
    api_ok = True
except ApiException as e:
    api_error = json.loads(e.body)
    print('ERROR:', api_error['message'])
    api_ok = False

if api_ok:
    topics = api_response.result
    for i,res in enumerate(topics):
        print('Topic', i, 'exposure within validation dataset:')
        print('    Keywords:', res.keywords)
        print('    Consensus:', res.consensus)
        print('---------------')
    
print('-------------------------------------------------------------')












































3. Results Interpretation

4. Fine Tuning

4.a Tailoring the topics

print('------------- Get list of topics from dataset --------------')

payload = nucleus_api.Topics(dataset='News_feed',                         
                            query='',                       
                            num_topics=8, 
                            num_keywords=8,
                            metadata_selection=metadata_selection,
                            period_start='2018-08-12',
                            period_end='2018-08-15')
try:
    api_response = api_instance.post_topic_api(payload)        
    api_ok = True
except ApiException as e:
    api_error = json.loads(e.body)
    print('ERROR:', api_error['message'])
    api_ok = False

if api_ok:    
    for i, res in enumerate(api_response.result.topics):
        print('Topic', i, ' keywords: ', res.keywords)    
        print('---------------')













custom_stop_words = ["conference","interview"] # str | List of stop words. (optional)

You can then tailor the transfer learning by creating a custom_stop_words variable. Initialize the variable as follows, for instance, and pass it in the payload of the main code of section 2:

4.b Focusing the transfer learning on certain subjects

query = '(inflation OR growth OR unemployment OR stability OR regulation)' # str | Fulltext query, using mysql MATCH boolean query format. Example: "(word1 OR word2) AND (word3 OR word4)" (optional)

In case you decide to focus the transfer learning, for instance on policy and macro-economic subjects, simply substitute the query variable in the main code of section 2. with:

4.c Alternative specifications for the validation dataset

validation dataset: Two approaches are possible.

  1. the reference and validation datasets are time ordered. In such case, simply append the documents belonging to the validation dataset to the reference dataset, and use time selectors to define which time period is reference, and which is validation

  2. the reference and validation datasets are not necessarily time ordered. In such case, you need to pass in two different datasets to the Topic Transfer APIs. dataset0 will be your reference dataset and dataset1 will be the validation dataset.

Note that Topic Transfer may not lead to any result if the topics extracted from the reference dataset aren't present in the validation dataset.

4.d Specifying topics exogenously

# Example 1: in English and you decide of weights
fixed_topics = [{"keywords":["inflation expectations", "forward rates", "board projections"], "weights":[0.7, 0.2, 0.1]}]

# Example 2: in English and you don't provide weights. Equal weights will then be used
fixed_topics = [{"keywords":["inflation expectations", "forward rates", "board projections"]}]

# Example 3: in Chinese (if your dataset is in Chinese) and you don't provide weights
fixed_topics = [{"keywords":["操作", "流动性", "基点", "元", "点", "央行", "进一步", "投资"]},
                {"keywords":["认为", "价格", "数据", "调查", "全国", "统计", "金融市场", "要求"]}]


payload = nucleus_api.TopicTransferModel(dataset0='News_feed', 
                                        dataset1="test_feed",
                                        fixed_topics=fixed_topics,
                                        query='', 
                                        custom_stop_words='', 
                                        num_topics=8, 
                                        num_keywords=8,
                                        metadata_selection='')
try:
    api_response = api_instance.post_topic_transfer_api(payload)
    api_ok = True
except ApiException as e:
    api_error = json.loads(e.body)
    print('ERROR:', api_error['message'])
    api_ok = False

if api_ok:
    doc_ids_t1 = api_response.result.doc_ids_t1
    topics = api_response.result.topics
    for i,res in enumerate(topics):
        print('Topic', i, 'exposure within validation dataset:')
        print('    Keywords:', res.keywords)
        print('    Strength:', res.strength)
        print('    Document IDs:', doc_ids_t1)
        print('    Exposure per Doc in Validation Dataset:', res.doc_topic_exposures_t1)
        print('---------------')

You can impose topics to be transferred to your validation dataset. These fixed topics can be chosen through whichever approach you decide. To pass them to any Transfer Learning API, use the fixed_topics optional input argument in the payload.

Contrast Analysis

Contrasted Topic Extraction, Summarization and Content Classification

The Python Notebook containing the use case can be directly downloaded at this link.

包含用例的python笔记本可以通过这个链接直接下载

Objective:

In its current version, SumUp contrast analysis works on the premise of two distinct categories of documents within a corpus, whichever way those two categories are defined ex-ante by the user based on metadata or content

Data:

The Nucleus Datafeed can be leveraged for all content from major Central Banks and SEC filings

Nucleus APIs used:

Approach:

1. Dataset Preparation

import csv
import json
import nucleus_api.api.nucleus_api as nucleus_helper
import nucleus_api
from nucleus_api.rest import ApiException

configuration = nucleus_api.Configuration()
configuration.host = 'UPDATE-WITH-API-SERVER-HOSTNAME'
configuration.api_key['x-api-key'] = 'UPDATE-WITH-API-KEY'

# Create API instance
api_instance = nucleus_api.NucleusApi(nucleus_api.ApiClient(configuration))

print('---- Case 1: you are using your own corpus, coming from a local folder ----')
folder = 'Sellside_research'         
dataset = 'Sellside_research'# str | Destination dataset where the file will be inserted.

# build file iterable from a folder recursively. 
# Each item in the iterable is in the format below:
# {'filename': filename,   # filename to be uploaded. REQUIRED
#  'metadata': {           # metadata for the file. Optional
#      'key1': val1,       # keys can have arbiturary names as long as the names only
#      'key2': val2        # contain alphanumeric (0-9|a-z|A-Z) and underscore (_)
#   } 
# }
file_iter = []
for root, dirs, files in os.walk(folder):
    for file in files:
        #if Path(file).suffix == '.pdf': # .txt .doc .docx .rtf .html .csv also supported
            file_dict = {'filename': os.path.join(root, file),
                         'metadata': {'company': 'Apple',
                                      'research_analyst': 'MS',
                                      'date': '2019-01-01'}}
            file_iter.append(file_dict)

file_props = nucleus_helper.upload_files(api_instance, dataset, file_iter, processes=4)
for fp in file_props:
    print(fp.filename, '(', fp.size, 'bytes) has been added to dataset', dataset)

    
    
print('---- Case 2: you are using an embedded datafeed ----')
dataset = 'sumup/central_banks_chinese'# embedded datafeeds in Nucleus.
metadata_selection = {'bank': 'people_bank_of_china', 'document_category': ('speech', 'press release')}
































2. Contrasted Topic Modeling

metadata_selection = {"research_analyst": "MS"} # dict | The metadata selection defining the two categories of documents to contrast and summarize against each other

query = '' # str | Dataset-language-specific fulltext query, using mysql MATCH boolean query format (optional)
custom_stop_words = ["morgan stanley"] # List of stop words. (optional)
excluded_docs = '' # str | List of document IDs that should be excluded from the analysis. Example, ["docid1", "docid2", ..., "docidN"]  (optional)
syntax_variables = False # bool | Specifies whether to take into account syntax aspects of each category of documents to help with contrasting them (optional) (default to False)
compression = 0.002 # float | Parameter controlling the breadth of the contrasted topic. Contained between 0 and 1, the smaller it is, the more contrasting terms will be captured, with decreasing weight. (optional) (default to 0.000002)
remove_redundancies = False # bool | If True, this option removes quasi-duplicates from the analysis and reatins only one copy of it. A quasi-duplicate would have the same NLP representation, but not necessarily the exact same text. (optional) (default False)

payload = nucleus_api.TopicContrastModel(dataset='Sellside_research', 
                                        metadata_selection=metadata_selection,
                                        custom_stop_words=custom_stop_words,
                                        period_start='2018-01-01',
                                        period_end='2019-01-01')
try:
    api_response = api_instance.post_topic_contrast_api(payload)
    api_ok = True
except ApiException as e:
    api_error = json.loads(e.body)
    print('ERROR:', api_error['message'])
    api_ok = False

if api_ok:   
    print('Contrasted Topic')
    print('    Keywords:', api_response.result.keywords)
    print('    Keywords Weight:', api_response.result.keywords_weight)

















3. Document Contrasted Summarization

print('---------------- Get doc contrasted summaries ------------------------')
metadata_selection = {"research_analyst": "MS"} # dict | The metadata selection defining the two categories of documents to contrast and summarize against each other

query = '' # str | Dataset-language-specific fulltext query, using mysql MATCH boolean query format (optional)
custom_stop_words = ["morgan stanley"] # List of stop words. (optional)
summary_length = 6 # int | The maximum number of bullet points a user wants to see in the contrasted summary. (optional) (default to 6)
context_amount = 0 # int | The number of sentences surrounding key summary sentences in the documents that they come from. (optional) (default to 0)
short_sentence_length = 0 # int | The sentence length below which a sentence is excluded from summarization (optional) (default to 4)
long_sentence_length = 40 # int | The sentence length beyond which a sentence is excluded from summarization (optional) (default to 40)
excluded_docs = '' # str | List of document IDs that should be excluded from the analysis. Example, ["docid1", "docid2", ..., "docidN"]  (optional)
syntax_variables = True # bool | Specifies whether to take into account syntax aspects of each category of documents to help with contrasting them (optional) (default to False)
compression = 0.002 # float | Parameter controlling the breadth of the contrasted summary. Contained between 0 and 1, the smaller it is, the more contrasting terms will be captured, with decreasing weight. (optional) (default to 0.000002)
remove_redundancies = False # bool | If True, this option removes quasi-duplicates from the analysis and reatins only one copy of it. A quasi-duplicate would have the same NLP representation, but not necessarily the exact same text. (optional) (default False)

payload = nucleus_api.DocumentContrastSummaryModel(dataset="Sellside_research", 
                                                    metadata_selection=metadata_selection,
                                                    custom_stop_words=custom_stop_words,
                                                    period_start='2018-01-01',
                                                    period_end='2019-01-01')
try:
    api_response = api_instance.post_document_contrast_summary_api(payload)
    api_ok = True
except ApiException as e:
    api_error = json.loads(e.body)
    print('ERROR:', api_error['message'])
    api_ok = False

if api_ok:   
    print('Summary for', [x for x in  metadata_selection.values()])
    for sent in api_response.result.class_1_content.sentences:
        print('    *', sent)
    print('======')
    for sent in api_response.result.class_2_content.sentences:
        print('    *', sent)   














4. Documents Classification

fixed_topics = {"keywords": ["price target", "projected revenue", "economy"], "weights": [0.5, 0.25, 0.25]} # dict | The contrasting topic used to separate the two categories of documents. Weights optional
metadata_selection = {"research_analyst": "MS"} # dict | The metadata selection defining the two categories of documents that a document can be classified into

query = '' # str | Dataset-language-specific fulltext query, using mysql MATCH boolean query format (optional)
custom_stop_words = ["morgan stanley"] # List of stop words. (optional)
excluded_docs = '' # str | List of document IDs that should be excluded from the analysis. Example, ["docid1", "docid2", ..., "docidN"]  (optional)
syntax_variables = True # bool | If True, the classifier will include syntax-related variables on top of content variables (optional) (default to False)
threshold = 0 # float | Threshold value for a document exposure to the contrastic topic, above which the document is assigned to class 1 specified through metadata_selection. (optional) (default to 0)
remove_redundancies = False # bool | If True, this option removes quasi-duplicates from the analysis and reatins only one copy of it. A quasi-duplicate would have the same NLP representation, but not necessarily the exact same text. (optional) (default False)

payload = nucleus_api.DocClassifyModel(dataset="Sellside_research",
                                        fixed_topics=fixed_topics,
                                        metadata_selection=metadata_selection,
                                        custom_stop_words=custom_stop_words,
                                        validation_phase=True,
                                        period_start='2018-01-01',
                                        period_end='2019-01-01')
try:
    api_response = api_instance.post_doc_classify_api(payload)
    api_ok = True
except ApiException as e:
    api_error = json.loads(e.body)
    print('ERROR:', api_error['message'])
    api_ok = False

if api_ok:   
    print('Detailed Results')
    print('    Docids:', api_response.result.detailed_results.docids)
    print('    Exposure:', api_response.result.detailed_results.exposures)
    print('    Estimated Category:', api_response.result.detailed_results.estimated_class)
    print('    Actual Category:', api_response.result.detailed_results.true_class)
    print('\n')

    print('Perf Metrics')
    print('    Accuracy:', api_response.result.perf_metrics.hit_rate)
    print('    Recall:', api_response.result.perf_metrics.recall)
    print('    Precision:', api_response.result.perf_metrics.precision)

This task requires 3 steps:



















fixed_topics = {"keywords": ["price target", "projected revenue", "economy"], "weights": [0.5, 0.25, 0.25]} # dict | The contrasting topic used to separate the two categories of documents
metadata_selection = {"research_analyst": "MS"} # dict | The metadata selection defining the two categories of documents that a document can be classified into

payload = nucleus_api.DocClassifyModel(dataset="Sellside_research",
                                        fixed_topics=fixed_topics,
                                        metadata_selection=metadata_selection,
                                        custom_stop_words=custom_stop_words,
                                        validation_phase=False,
                                        period_start='2019-01-02',
                                        period_end='2019-06-01')
try:
    api_response = api_instance.post_doc_classify_api(payload)
    api_ok = True
except ApiException as e:
    api_error = json.loads(e.body)
    print('ERROR:', api_error['message'])
    api_ok = False

if api_ok:   
    print('Detailed Results')
    print('    Docids:', api_response.result.detailed_results.docids)
    print('    Exposure:', api_response.result.detailed_results.exposures)
    print('    Estimated Category:', api_response.result.detailed_results.estimated_class)

Then, we can move to the testing phase


















5. Fine Tuning

5.a Excluding certain content from the contrast analysis

print('------------- Get list of topics from dataset --------------')

payload = nucleus_api.Topics(dataset='Sellside_research',                         
                            query='',                       
                            num_topics=8, 
                            num_keywords=8,
                            metadata_selection=metadata_selection)
try:
    api_response = api_instance.post_topic_api(payload)        
    api_ok = True
except ApiException as e:
    api_error = json.loads(e.body)
    print('ERROR:', api_error['message'])
    api_ok = False

if api_ok:       
    for i, res in enumerate(api_response.result.topics):
        print('Topic', i, ' keywords: ', res.keywords)    
        print('---------------')












custom_stop_words = ["disclaimer","disclosure"] # str | List of stop words. (optional)

Using your domain expertise / client input / advisor input, you can determine whether certain of those topics or keywords are not differentiated enough to contribute to contrast analysis.

You can then tailor the contrast analysis by creating a custom_stop_words variable that contains those words. Initialize the variable as follows, for instance, and pass it in the payload of the main code of section 2:

5.b Focusing the contrasted summary on specific subjects potentially discussed in your corpus

query = '(earnings OR cash flows)' # str | Fulltext query, using mysql MATCH boolean query format. Example: "(word1 OR word2) AND (word3 OR word4)" (optional)

query: You can refine the contrast analysis by leveraging the query variable of the Doc Contrasted Summary API.

Rerun any of the 3 Contrast Analysis APIs on the content from your corpus that mentions a specific theme. Create a variable query and pass it in to the payload:

5.c Specifying the metadata_selection for your contrasted topic


metadata_selection = {"research_analyst": ["MS", "JPM"]}


metadata_selection = {"bank": ["federal_reserve", "ECB"]}



metadata_selection = {"document_category": ["speech", "press release"]}



metadata_selection = {"content": "fundamentals"}

API's Call Structure

Dashboard

post_custom_tracker_api

from __future__ import print_function
import time
import nucleus_api
from nucleus_api.rest import ApiException

# Uncomment below to setup prefix (e.g. Bearer) for API key, if needed
# configuration.api_key_prefix['x-api-key'] = 'Bearer'

# create an instance of the API class, using the configuration as showed in the Authentication
api_instance = nucleus_api.NucleusApi(nucleus_api.ApiClient(configuration))
payload = nucleus_api.CustomTrackerModel() # CustomTrackerModel |

try:
    api_response = api_instance.post_custom_tracker_api(payload)
    print(api_response)
except ApiException as e:
    print("Exception when calling NucleusApi->post_custom_tracker_api: %s\n" % e)

The above command returns JSON structured like this:

{
    "job_id": "str",
    "result": [
        {
            "consensuses": ["str"],
            "query": "str",
            "sentiments": ["str"],
            "strengths": ["str"],
            "time_stamps": ["str"]
        }
    ]
}

Http method: POST /dashboard/custom_tracker/post_custom_tracker_api

Get custom tracker on chosen dataset and queries.

Parameters

Name Type Description Notes
payload CustomTrackerModel

CustomTrackerModel

Name Type Description Notes
dataset str Dataset name.
query str Dataset-language-specific fulltext query, using mysql MATCH boolean query format. Example: "(word1 OR word2) AND (word3 OR word4)" [optional]
custom_stop_words list[str] [optional]
num_topics int Number of topics to be extracted from the dataset per query to aggregate back into a tracker. [optional]
num_keywords int Number of keywords per topic that is extracted from the dataset per query. [optional]
metadata_selection object JSON object specifying metadata-based queries on the dataset, of type {"metadata_field": "selected_values"} [optional]
time_period str Alternative 1: Time period selection [optional]
period_start str Alternative 2: Start date for the period to analyze within the dataset. Format: "YYYY-MM-DD" [optional]
period_end str Alternative 2: End date for the period to analyze within the dataset. Format: "YYYY-MM-DD" [optional]
n_steps int Number of steps in the historical analysis over the requested period. Each step is such that they contain an equal number of documents. [optional]
excluded_docs list[str] [optional]
custom_dict_file object Custom sentiment dictionary JSON file. [optional]

Return type:

CustomTrackerRespModel

Name Type Description Notes
job_id str If the job is taking too long, job_id is returned, GET /jobs can then be used to poll for results [optional]
result list[CustomTrackerL1RespModel] [optional]

CustomTrackerL1RespModel

Name Type Description Notes
query str Query analyzed. If no query was provided, then these are the hot topics. [optional]
strengths list[str] [optional]
sentiments list[str] [optional]
consensuses list[str] [optional]
time_stamps list[str] [optional]

[Back to top] [Back to API list]

post_key_authors_api

from __future__ import print_function
import time
import nucleus_api
from nucleus_api.rest import ApiException

# Uncomment below to setup prefix (e.g. Bearer) for API key, if needed
# configuration.api_key_prefix['x-api-key'] = 'Bearer'

# create an instance of the API class, using the configuration as showed in the Authentication
api_instance = nucleus_api.NucleusApi(nucleus_api.ApiClient(configuration))
payload = nucleus_api.KeyAuthorsModel() # KeyAuthorsModel |

try:
    api_response = api_instance.post_key_authors_api(payload)
    print(api_response)
except ApiException as e:
    print("Exception when calling NucleusApi->post_key_authors_api: %s\n" % e)

The above command returns JSON structured like this:

{
    "job_id": "str",
    "result": [
        {
            "query": "str",
            "top_authors": [
                {
                    "author_docs": ["str"],
                    "author_names": ["str"]
                }
            ]
        }
    ]
}

Http method: POST /dashboard/key_authors/post_key_authors_api

Get key authors on chosen dataset and queries.

Parameters

Name Type Description Notes
payload KeyAuthorsModel

KeyAuthorsModel

Name Type Description Notes
dataset str Dataset name.
query str Dataset-language-specific fulltext query, using mysql MATCH boolean query format. Example: "(word1 OR word2) AND (word3 OR word4)" [optional]
custom_stop_words list[str] [optional]
num_topics int Number of topics to be extracted from the dataset per query to aggregate back into the key authors analysis. [optional]
num_keywords int Number of keywords per topic that is extracted from the dataset per query. [optional]
metadata_selection object JSON object specifying metadata-based queries on the dataset, of type {"metadata_field": "selected_values"} [optional]
time_period str Alternative 1: Time period selection [optional]
period_start str Alternative 2: Start date for the period to analyze within the dataset. Format: "YYYY-MM-DD" [optional]
period_end str Alternative 2: End date for the period to analyze within the dataset. Format: "YYYY-MM-DD" [optional]
num_authors int Max number of key contributors that the user wants to see returned by the analysis. [optional]
num_keydocs int Max number of key contributions from key contributors that the user wants to see returned by the analysis. [optional]
excluded_docs list[str] [optional]
custom_dict_file object Custom sentiment dictionary JSON file. [optional]

Return type:

KeyAuthorsRespModel

Name Type Description Notes
job_id str If the job is taking too long, job_id is returned, GET /jobs can then be used to poll for results [optional]
result list[KeyAuthorsL1RespModel] [optional]

KeyAuthorsL1RespModel

Name Type Description Notes
query str Query analyzed. If no query was provided, then these are the hot topics. [optional]
top_authors list[KeyAuthorsL2RespModel] [optional]

[Back to top] [Back to API list]

post_smart_alerts_api

from __future__ import print_function
import time
import nucleus_api
from nucleus_api.rest import ApiException

# Uncomment below to setup prefix (e.g. Bearer) for API key, if needed
# configuration.api_key_prefix['x-api-key'] = 'Bearer'

# create an instance of the API class, using the configuration as showed in the Authentication
api_instance = nucleus_api.NucleusApi(nucleus_api.ApiClient(configuration))
payload = nucleus_api.SmartAlertsModel() # SmartAlertsModel |

try:
    api_response = api_instance.post_smart_alerts_api(payload)
    print(api_response)
except ApiException as e:
    print("Exception when calling NucleusApi->post_smart_alerts_api: %s\n" % e)

The above command returns JSON structured like this:

{
    "job_id": "str",
    "result": [
        {
            "new_sents": ["str"],
            "new_words": ["str"],
            "novel_docs": [
                {
                    "contrasted_summary": ["str"],
                    "title": "str"
                }
            ],
            "query": "str"
        }
    ]
}

Http method: POST /dashboard/smart_alerts/post_smart_alerts_api

Get smart alerts on chosen dataset and queries.

Parameters

Name Type Description Notes
payload SmartAlertsModel

SmartAlertsModel

Name Type Description Notes
dataset str Dataset name.
query str Dataset-language-specific fulltext query, using mysql MATCH boolean query format. Example: "(word1 OR word2) AND (word3 OR word4)" [optional]
custom_stop_words list[str] [optional]
num_topics int Number of topics to be extracted from the dataset per query to aggregate back into a tracker. [optional]
num_keywords int Number of keywords per topic that is extracted from the dataset per query. [optional]
metadata_selection object JSON object specifying metadata-based queries on the dataset, of type {"metadata_field": "selected_values"} [optional]
time_period str Alternative 1: Time period selection [optional]
period_start str Alternative 2: Start date for the period to analyze within the dataset. Format: "YYYY-MM-DD" [optional]
period_end str Alternative 2: End date for the period to analyze within the dataset. Format: "YYYY-MM-DD" [optional]
num_days_back int Number of days from the latest date of the dataset, which are considered as foreground for the novelty analysis. [optional]
novelty_threshold float Value of novelty of a document above which a document is considered novel. This value should be between 0 and 1 and reflects the percentage of the document information unexplained by the background. [optional]
num_new_words int Maximum number of new words returned from the new words detection [optional]
excluded_docs list[str] [optional]
remove_redundancies bool If True, this option removes quasi-duplicates from the analysis. A quasi-duplicate would have the same NLP representation, but not necessarily the exact same text. [optional] [default to False]

Return type:

SmartAlertsRespModel

Name Type Description Notes
job_id str If the job is taking too long, job_id is returned, GET /jobs can then be used to poll for results [optional]
result list[SmartAlertsL1RespModel] [optional]

SmartAlertsL1RespModel

Name Type Description Notes
query str Query analyzed. If no query was provided, then these are the hot topics. [optional]
novel_docs list[SmartAlertsL2RespModel] [optional]
new_words list[str] [optional]
new_sents list[str] [optional]

[Back to top] [Back to API list]

Datasets

get_list_datasets

from __future__ import print_function
import time
import nucleus_api
from nucleus_api.rest import ApiException

# Uncomment below to setup prefix (e.g. Bearer) for API key, if needed
# configuration.api_key_prefix['x-api-key'] = 'Bearer'

# create an instance of the API class, using the configuration as showed in the Authentication
api_instance = nucleus_api.NucleusApi(nucleus_api.ApiClient(configuration))

try:
    api_response = api_instance.get_list_datasets()
    print(api_response)
except ApiException as e:
    print("Exception when calling NucleusApi->get_list_datasets: %s\n" % e)

The above command returns JSON structured like this:

{
    "job_id": "str",
    "result": [
        {
            "date_modified": "datetime",
            "name": "str"
        }
    ]
}

Http method: GET /datasets/get_list_datasets

List the datasets owned by the user.

Parameters

This endpoint does not need any parameter.

Return type:

ListDatasetsModel

Name Type Description Notes
job_id str If the job is taking too long, job_id is returned, GET /jobs can then be used to poll for results [optional]
result list[DatasetModel] [optional]

DatasetModel

Name Type Description Notes
name str Unique identifier of the dataset [optional]
date_modified datetime Datetime of last insertion or deletion of documents [optional]

[Back to top] [Back to API list]

post_append_json_to_dataset

from __future__ import print_function
import time
import nucleus_api
from nucleus_api.rest import ApiException

# Uncomment below to setup prefix (e.g. Bearer) for API key, if needed
# configuration.api_key_prefix['x-api-key'] = 'Bearer'

# create an instance of the API class, using the configuration as showed in the Authentication
api_instance = nucleus_api.NucleusApi(nucleus_api.ApiClient(configuration))
payload = nucleus_api.Appendjsonparams() # Appendjsonparams |

try:
    api_response = api_instance.post_append_json_to_dataset(payload)
    print(api_response)
except ApiException as e:
    print("Exception when calling NucleusApi->post_append_json_to_dataset: %s\n" % e)

The above command returns JSON structured like this:

{
    "job_id": "str",
    "result": {
        "n_documents": "int",
        "size": "int"
    }
}

Http method: POST /datasets/append_json_to_dataset/post_append_json_to_dataset

Add a document to a dataset, in JSON form.

Parameters

Name Type Description Notes
payload Appendjsonparams

Appendjsonparams

Name Type Description Notes
dataset str Name of the dataset.
language str If you want to override language detection. [optional]
document Document [optional]

Return type:

AppendJsonRespModel

Name Type Description Notes
job_id str If the job is taking too long, job_id is returned, GET /jobs can then be used to poll for results [optional]
result JsonPropertyModel [optional]

JsonPropertyModel

Name Type Description Notes
n_documents int Number of documents in the dataset [optional]
size int Size in bytes of the JSON record [optional]

[Back to top] [Back to API list]

post_bulk_insert_json

from __future__ import print_function
import time
import nucleus_api
from nucleus_api.rest import ApiException

# Uncomment below to setup prefix (e.g. Bearer) for API key, if needed
# configuration.api_key_prefix['x-api-key'] = 'Bearer'

# create an instance of the API class, using the configuration as showed in the Authentication
api_instance = nucleus_api.NucleusApi(nucleus_api.ApiClient(configuration))
payload = nucleus_api.BulkInsertParams() # BulkInsertParams |

try:
    api_response = api_instance.post_bulk_insert_json(payload)
    print(api_response)
except ApiException as e:
    print("Exception when calling NucleusApi->post_bulk_insert_json: %s\n" % e)

The above command returns JSON structured like this:

{
    "job_id": "str",
    "result": {
        "n_documents": "int",
        "size": "int"
    }
}

Http method: POST /datasets/bulk_insert_json/post_bulk_insert_json

Add many document to a dataset, in JSON form. Bulk insertion is much faster than making one api call for each document.

Parameters

Name Type Description Notes
payload BulkInsertParams

BulkInsertParams

Name Type Description Notes
dataset str Name of the dataset.
language str If you want to override language detection. [optional]
documents list[Document] [optional]

Return type:

BulkInsertRespModel

Name Type Description Notes
job_id str If the job is taking too long, job_id is returned, GET /jobs can then be used to poll for results [optional]
result JsonPropertyModel [optional]

JsonPropertyModel

Name Type Description Notes
n_documents int Number of documents in the dataset [optional]
size int Size in bytes of the JSON record [optional]

[Back to top] [Back to API list]

post_dataset_info

from __future__ import print_function
import time
import nucleus_api
from nucleus_api.rest import ApiException

# Uncomment below to setup prefix (e.g. Bearer) for API key, if needed
# configuration.api_key_prefix['x-api-key'] = 'Bearer'

# create an instance of the API class, using the configuration as showed in the Authentication
api_instance = nucleus_api.NucleusApi(nucleus_api.ApiClient(configuration))
payload = nucleus_api.DatasetInfo() # DatasetInfo |

try:
    api_response = api_instance.post_dataset_info(payload)
    print(api_response)
except ApiException as e:
    print("Exception when calling NucleusApi->post_dataset_info: %s\n" % e)

The above command returns JSON structured like this:

{
    "job_id": "str",
    "result": {
        "dataset": "str",
        "detected_language": "str",
        "metadata": "object",
        "num_documents": "str",
        "time_range": ["str"]
    }
}

Http method: POST /datasets/dataset_info/post_dataset_info

Get information about a dataset.

Parameters

Name Type Description Notes
payload DatasetInfo

DatasetInfo

Name Type Description Notes
dataset str Dataset name.
query str Dataset-language-specific fulltext query, using mysql MATCH boolean query format. Example: "(word1 OR word2) AND (word3 OR word4)" [optional]
metadata_selection object JSON object specifying metadata-based queries on the dataset, of type {"metadata_field": "selected_values"} [optional]
time_period str Alternative 1: Time period selection [optional]
period_start str Alternative 2: Start date for the period to analyze within the dataset. Format: "YYYY-MM-DD HH:MM:SS" [optional]
period_end str Alternative 2: End date for the period to analyze within the dataset. Format: "YYYY-MM-DD HH:MM:SS" [optional]
owner_email str Has to be specified to access a shared dataset that is owned by another user [optional]

Return type:

DatasetInfoRespModel

Name Type Description Notes
job_id str If the job is taking too long, job_id is returned, GET /jobs can then be used to poll for results [optional]
result DatasetInfoModel [optional]

DatasetInfoModel

Name Type Description Notes
dataset str Dataset name [optional]
num_documents str Number of documents in the dataset [optional]
time_range list[str] [optional]
detected_language str Language of the dataset [optional]
metadata object Metadata information [optional]

[Back to top] [Back to API list]

post_dataset_tagging

from __future__ import print_function
import time
import nucleus_api
from nucleus_api.rest import ApiException

# Uncomment below to setup prefix (e.g. Bearer) for API key, if needed
# configuration.api_key_prefix['x-api-key'] = 'Bearer'

# create an instance of the API class, using the configuration as showed in the Authentication
api_instance = nucleus_api.NucleusApi(nucleus_api.ApiClient(configuration))
payload = nucleus_api.DatasetTagging() # DatasetTagging |

try:
    api_response = api_instance.post_dataset_tagging(payload)
    print(api_response)
except ApiException as e:
    print("Exception when calling NucleusApi->post_dataset_tagging: %s\n" % e)

The above command returns JSON structured like this:

{
    "job_id": "str",
    "result": {
        "dataset": "str",
        "doc_ids": ["str"],
        "entities_count": ["str"],
        "entities_tagged": ["str"]
    }
}

Http method: POST /datasets/dataset_tagging/post_dataset_tagging

Tag documents containig specified entities within a dataset.

Parameters

Name Type Description Notes
payload DatasetTagging

DatasetTagging

Name Type Description Notes
dataset str Dataset name.
query str Dataset-language-specific fulltext query specifying an entity to tag in the dataset, with possibly several alternative namings, using mysql MATCH boolean query format. Example: "word1 OR word2 OR word3 OR word4"
metadata_selection object JSON object specifying metadata-based queries on the dataset, of type {"metadata_field": "selected_values"} [optional]
time_period str Alternative 1: Time period selection from now [optional]
period_start str Alternative 2: Start date for the period to analyze within the dataset. Format: "YYYY-MM-DD" [optional]
period_end str Alternative 2: End date for the period to analyze within the dataset. Format: "YYYY-MM-DD" [optional]

Return type:

DatasetTaggingRespModel

Name Type Description Notes
job_id str If the job is taking too long, job_id is returned, GET /jobs can then be used to poll for results [optional]
result DatasetTaggingL1RespModel [optional]

DatasetTaggingL1RespModel

Name Type Description Notes
dataset str Dataset name [optional]
doc_ids list[str] [optional]
entities_tagged list[str] [optional]
entities_count list[str] [optional]

[Back to top] [Back to API list]

post_delete_dataset

from __future__ import print_function
import time
import nucleus_api
from nucleus_api.rest import ApiException

# Uncomment below to setup prefix (e.g. Bearer) for API key, if needed
# configuration.api_key_prefix['x-api-key'] = 'Bearer'

# create an instance of the API class, using the configuration as showed in the Authentication
api_instance = nucleus_api.NucleusApi(nucleus_api.ApiClient(configuration))
payload = nucleus_api.Deletedatasetmodel() # Deletedatasetmodel |

try:
    api_response = api_instance.post_delete_dataset(payload)
    print(api_response)
except ApiException as e:
    print("Exception when calling NucleusApi->post_delete_dataset: %s\n" % e)

The above command returns JSON structured like this:

{
    "job_id": "str",
    "result": "object"
}

Http method: POST /datasets/delete_dataset/post_delete_dataset

Delete an existing dataset from the user storage.

Parameters

Name Type Description Notes
payload Deletedatasetmodel

Deletedatasetmodel

Name Type Description Notes
dataset str Name of the dataset.

Return type:

DeleteDatasetRespModel

Name Type Description Notes
job_id str If the job is taking too long, job_id is returned, GET /jobs can then be used to poll for results [optional]
result object Dataset deleted [optional]

[Back to top] [Back to API list]

post_delete_document

from __future__ import print_function
import time
import nucleus_api
from nucleus_api.rest import ApiException

# Uncomment below to setup prefix (e.g. Bearer) for API key, if needed
# configuration.api_key_prefix['x-api-key'] = 'Bearer'

# create an instance of the API class, using the configuration as showed in the Authentication
api_instance = nucleus_api.NucleusApi(nucleus_api.ApiClient(configuration))
payload = nucleus_api.Deletedocumentmodel() # Deletedocumentmodel |

try:
    api_response = api_instance.post_delete_document(payload)
    print(api_response)
except ApiException as e:
    print("Exception when calling NucleusApi->post_delete_document: %s\n" % e)

The above command returns JSON structured like this:

{
    "job_id": "str",
    "result": "str"
}

Http method: POST /datasets/delete_document/post_delete_document

Delete documents from a dataset.

Parameters

Name Type Description Notes
payload Deletedocumentmodel

Deletedocumentmodel

Name Type Description Notes
dataset str Name of the dataset.
doc_ids list[str] [optional]

Return type:

DeleteDocumentRespModel

Name Type Description Notes
job_id str If the job is taking too long, job_id is returned, GET /jobs can then be used to poll for results [optional]
result str Documents deleted from dataset [optional]

[Back to top] [Back to API list]

post_rename_dataset

from __future__ import print_function
import time
import nucleus_api
from nucleus_api.rest import ApiException

# Uncomment below to setup prefix (e.g. Bearer) for API key, if needed
# configuration.api_key_prefix['x-api-key'] = 'Bearer'

# create an instance of the API class, using the configuration as showed in the Authentication
api_instance = nucleus_api.NucleusApi(nucleus_api.ApiClient(configuration))
payload = nucleus_api.Renamedatasetmodel() # Renamedatasetmodel |

try:
    api_response = api_instance.post_rename_dataset(payload)
    print(api_response)
except ApiException as e:
    print("Exception when calling NucleusApi->post_rename_dataset: %s\n" % e)

The above command returns JSON structured like this:

{
    "job_id": "str",
    "result": "object"
}

Http method: POST /datasets/rename_dataset/post_rename_dataset

Rename an existing dataset.

Parameters

Name Type Description Notes
payload Renamedatasetmodel

Renamedatasetmodel

Name Type Description Notes
dataset str Old name of the dataset.
new_name str New name of the dataset.

Return type:

RenameDatasetRespModel

Name Type Description Notes
job_id str If the job is taking too long, job_id is returned, GET /jobs can then be used to poll for results [optional]
result object new dataset name [optional]

[Back to top] [Back to API list]

post_upload_file

from __future__ import print_function
import time
import nucleus_api
from nucleus_api.rest import ApiException

# Uncomment below to setup prefix (e.g. Bearer) for API key, if needed
# configuration.api_key_prefix['x-api-key'] = 'Bearer'

# create an instance of the API class, using the configuration as showed in the Authentication
api_instance = nucleus_api.NucleusApi(nucleus_api.ApiClient(configuration))
file = '/path/to/file.txt' # file |
dataset = 'dataset_example' # str | Destination dataset where the file will be inserted.
metadata = 'metadata_example' # str | Optional json containing additional document metadata. Eg: {\"time\":\"01/01/2001\",\"author\":\"me\"} (optional)
filename = 'filename_example' # str | Specify the filename if you want to override the original filename (Nucleus guesses the file type from the file name extension) (optional)

try:
    api_response = api_instance.post_upload_file(file, dataset, metadata=metadata, filename=filename)
    print(api_response)
except ApiException as e:
    print("Exception when calling NucleusApi->post_upload_file: %s\n" % e)

The above command returns JSON structured like this:

{
    "job_id": "str",
    "result": {
        "filename": "str",
        "size": "int"
    }
}

Http method: POST /datasets/upload_file/post_upload_file

Parameters

Name Type Description Notes
file file
dataset str Destination dataset where the file will be inserted.
metadata str Optional json containing additional document metadata. Eg: {"time":"01/01/2001","author":"me"} [optional]
filename str Specify the filename if you want to override the original filename (Nucleus guesses the file type from the file name extension) [optional]

file-file

Return type:

UploadFileRespModel

Name Type Description Notes
job_id str If the job is taking too long, job_id is returned, GET /jobs can then be used to poll for results [optional]
result FilePropertyModel [optional]

FilePropertyModel

Name Type Description Notes
filename str Filename of the file uploaded [optional]
size int File size in bytes of the file uploaded [optional]

[Back to top] [Back to API list]

post_upload_url

from __future__ import print_function
import time
import nucleus_api
from nucleus_api.rest import ApiException

# Uncomment below to setup prefix (e.g. Bearer) for API key, if needed
# configuration.api_key_prefix['x-api-key'] = 'Bearer'

# create an instance of the API class, using the configuration as showed in the Authentication
api_instance = nucleus_api.NucleusApi(nucleus_api.ApiClient(configuration))
payload = nucleus_api.UploadURLModel() # UploadURLModel |

try:
    api_response = api_instance.post_upload_url(payload)
    print(api_response)
except ApiException as e:
    print("Exception when calling NucleusApi->post_upload_url: %s\n" % e)

The above command returns JSON structured like this:

{
    "job_id": "str",
    "result": {
        "file_url": "str",
        "size": "int"
    }
}

Http method: POST /datasets/import_file_from_url/post_upload_url

Parameters

Name Type Description Notes
payload UploadURLModel

UploadURLModel

Name Type Description Notes
dataset str Name of the dataset where the file will be inserted.
file_url str Public URL pointing to the file (pdf/txt/docx...)
filename str Specify the filename if different from the URL (Nucleus guesses the file type from the extension at the end of the URL or file name) [optional]
metadata object JSON containing document metadata key:value pairs (eg. author, time..) [optional]

Return type:

UploadUrlRespModel

Name Type Description Notes
job_id str If the job is taking too long, job_id is returned, GET /jobs can then be used to poll for results [optional]
result UrlPropertyModel [optional]

UrlPropertyModel

Name Type Description Notes
file_url str URL to the file uploaded [optional]
size int File size in bytes of the URL uploaded [optional]

[Back to top] [Back to API list]

Documents

post_doc_classify_api

from __future__ import print_function
import time
import nucleus_api
from nucleus_api.rest import ApiException

# Uncomment below to setup prefix (e.g. Bearer) for API key, if needed
# configuration.api_key_prefix['x-api-key'] = 'Bearer'

# create an instance of the API class, using the configuration as showed in the Authentication
api_instance = nucleus_api.NucleusApi(nucleus_api.ApiClient(configuration))
payload = nucleus_api.DocClassifyModel() # DocClassifyModel |

try:
    api_response = api_instance.post_doc_classify_api(payload)
    print(api_response)
except ApiException as e:
    print("Exception when calling NucleusApi->post_doc_classify_api: %s\n" % e)

The above command returns JSON structured like this:

{
    "job_id": "str",
    "result": {
        "detailed_results": {
            "docids": ["int"],
            "estimated_class": ["str"],
            "true_class": ["str"]
        },
        "perf_metrics": {
            "accuracy": "float",
            "balanced_accuracy": "float",
            "f1": "float",
            "precision": "float",
            "recall": "float"
        }
    }
}

Http method: POST /documents/document_classify/post_doc_classify_api

Document one-layer classifier on a chosen dataset.

Parameters

Name Type Description Notes
payload DocClassifyModel

DocClassifyModel

Name Type Description Notes
dataset str Dataset name.
query str Dataset-language-specific fulltext query, using mysql MATCH boolean query format. Example: "(word1 OR word2) AND (word3 OR word4)" [optional]
fixed_topics object JSON object specifying the contrasting topic that is exogenously fixed, of type {"keywords": ["keyword_1", "keyword_2", "keyword_3"], "weights": [weight_1, weight_2, weight_3]}
classifier_config object JSON object specifying the configuration of the classifier from the training phase, of type {"coefs": [coef_1, coef_2, coef_3], "intercept": intercept}
metadata_selection object JSON object specifying the 2 classes subject to classification, of type: if based on content-selection {"content": "word1 word2 ... wordN"} ; if based on other field {"metadata_field": ["values_class1", "values_class2"}
custom_stop_words list[str] [optional]
time_period str Alternative 1: Time period selection from now [optional]
period_start str Alternative 2: Start date for the period to analyze within the dataset. Format: "YYYY-MM-DD" [optional]
period_end str Alternative 2: End date for the period to analyze within the dataset. Format: "YYYY-MM-DD" [optional]
excluded_docs list[str] [optional]
syntax_variables bool If True, the classifier will include syntax-related variables on top of content variables [optional] [default to False]
validation_phase bool If True, the classifier assumes that the dataset provided is labeled with the 2 classes and will use that to compute accuracy/precision/recall [optional] [default to False]
remove_redundancies bool If True, this option removes quasi-duplicates from the analysis and retains only one copy of it. A quasi-duplicate would have the same NLP representation, but not necessarily the exact same text. [optional] [default to False]

Return type:

DocClassifyRespModel

Name Type Description Notes
job_id str If the job is taking too long, job_id is returned, GET /jobs can then be used to poll for results [optional]
result DocClassifyL1RespModel [optional]

DocClassifyL1RespModel

Name Type Description Notes
detailed_results DocClassifyL2DRRespModel [optional]
perf_metrics DocClassifyL2PMRespModel [optional]

[Back to top] [Back to API list]

post_doc_display

from __future__ import print_function
import time
import nucleus_api
from nucleus_api.rest import ApiException

# Uncomment below to setup prefix (e.g. Bearer) for API key, if needed
# configuration.api_key_prefix['x-api-key'] = 'Bearer'

# create an instance of the API class, using the configuration as showed in the Authentication
api_instance = nucleus_api.NucleusApi(nucleus_api.ApiClient(configuration))
payload = nucleus_api.DocDisplay() # DocDisplay |

try:
    api_response = api_instance.post_doc_display(payload)
    print(api_response)
except ApiException as e:
    print("Exception when calling NucleusApi->post_doc_display: %s\n" % e)

The above command returns JSON structured like this:

{
    "job_id": "str",
    "result": [
        {
            "attribute": "object",
            "content": "str",
            "sourceid": "str",
            "title": "str"
        }
    ]
}

Http method: POST /documents/document_display/post_doc_display

Document display.

Parameters

Name Type Description Notes
payload DocDisplay

DocDisplay

Name Type Description Notes
dataset str Dataset name.
doc_titles list[str] [optional]
doc_ids list[str] [optional]
metadata_selection object JSON object specifying metadata-based queries on the dataset. If titles or docids are also provided, then this selection is ignored. Format: {"metadata_field": "selected_values"} [optional]

Return type:

DocDisplayRespModel

Name Type Description Notes
job_id str If the job is taking too long, job_id is returned, GET /jobs can then be used to poll for results [optional]
result list[DocDisplayL1RespModel] [optional]

DocDisplayL1RespModel

Name Type Description Notes
title str Document title [optional]
sourceid str Document ID [optional]
content str Document content [optional]
attribute object JSON containing document metadata key:value pairs (eg. author, time..) [optional]

[Back to top] [Back to API list]

post_doc_info

from __future__ import print_function
import time
import nucleus_api
from nucleus_api.rest import ApiException

# Uncomment below to setup prefix (e.g. Bearer) for API key, if needed
# configuration.api_key_prefix['x-api-key'] = 'Bearer'

# create an instance of the API class, using the configuration as showed in the Authentication
api_instance = nucleus_api.NucleusApi(nucleus_api.ApiClient(configuration))
payload = nucleus_api.DocInfo() # DocInfo |

try:
    api_response = api_instance.post_doc_info(payload)
    print(api_response)
except ApiException as e:
    print("Exception when calling NucleusApi->post_doc_info: %s\n" % e)

The above command returns JSON structured like this:

{
    "job_id": "str",
    "result": [
        {
            "attribute": "object",
            "sourceid": "str",
            "title": "str"
        }
    ]
}

Http method: POST /documents/document_info/post_doc_info

Retrieve metadata of documents matching the provided filter (limited to 10000 documents).

Parameters

Name Type Description Notes
payload DocInfo

DocInfo

Name Type Description Notes
dataset str Dataset name.
doc_titles list[str] [optional]
doc_ids list[str] [optional]
metadata_selection object JSON object specifying metadata-based queries on the dataset. If titles or docids are also provided, then this selection is ignored. Format: {"metadata_field": "selected_values"} [optional]

Return type:

DocInfoRespModel

Name Type Description Notes
job_id str If the job is taking too long, job_id is returned, GET /jobs can then be used to poll for results [optional]
result list[NestedDocInfoModel] [optional]

NestedDocInfoModel

Name Type Description Notes
title str Document title [optional]
sourceid str Document ID [optional]
attribute object JSON containing document metadata key:value pairs (eg. author, time..) [optional]

[Back to top] [Back to API list]

post_doc_new_words_api

from __future__ import print_function
import time
import nucleus_api
from nucleus_api.rest import ApiException

# Uncomment below to setup prefix (e.g. Bearer) for API key, if needed
# configuration.api_key_prefix['x-api-key'] = 'Bearer'

# create an instance of the API class, using the configuration as showed in the Authentication
api_instance = nucleus_api.NucleusApi(nucleus_api.ApiClient(configuration))
payload = nucleus_api.DocumentNewWordsModel() # DocumentNewWordsModel |

try:
    api_response = api_instance.post_doc_new_words_api(payload)
    print(api_response)
except ApiException as e:
    print("Exception when calling NucleusApi->post_doc_new_words_api: %s\n" % e)

The above command returns JSON structured like this:

{
    "job_id": "str",
    "result": [
        {
            "new_sents": ["str"],
            "new_words": ["str"],
            "query": ["str"]
        }
    ]
}

Http method: POST /documents/document_new_words/post_doc_new_words_api

Document new words.

Parameters

Name Type Description Notes
payload DocumentNewWordsModel

DocumentNewWordsModel

Name Type Description Notes
dataset str Dataset name.
query str Dataset-language-specific fulltext query, using mysql MATCH boolean query format. Example: "(word1 OR word2) AND (word3 OR word4)" [optional]
custom_stop_words list[str] [optional]
num_days_back int Number of days from the latest date of the dataset, which are considered as foreground for the new words analysis. [optional]
num_new_words int Maximum number of new words returned from the new words detection [optional]
metadata_selection object JSON object specifying metadata-based queries on the dataset, of type {"metadata_field": "selected_values"} [optional]
time_period str Alternative 1: Time period selection for the period considered as background for the novelty analysis [optional]
period_start str Alternative 2: Start date for the period considered as background for the novelty analysis. Format: "YYYY-MM-DD" [optional]
period_end str Alternative 2: End date for the period considered as background for the novelty analysis. Format: "YYYY-MM-DD HH:MM:SS" [optional]
excluded_docs list[str] [optional]
remove_redundancies bool If True, this option removes quasi-duplicates from the analysis. A quasi-duplicate would have the same NLP representation, but not necessarily the exact same text. [optional] [default to True]

Return type:

DocumentNewWordsRespModel

Name Type Description Notes
job_id str If the job is taking too long, job_id is returned, GET /jobs can then be used to poll for results [optional]
result list[DocumentNewWordsL1Model] [optional]

DocumentNewWordsL1Model

Name Type Description Notes
query list[str] [optional]
new_words list[str] [optional]
new_sents list[str] [optional]

[Back to top] [Back to API list]

post_doc_novelty_api

from __future__ import print_function
import time
import nucleus_api
from nucleus_api.rest import ApiException

# Uncomment below to setup prefix (e.g. Bearer) for API key, if needed
# configuration.api_key_prefix['x-api-key'] = 'Bearer'

# create an instance of the API class, using the configuration as showed in the Authentication
api_instance = nucleus_api.NucleusApi(nucleus_api.ApiClient(configuration))
payload = nucleus_api.DocumentNoveltyModel() # DocumentNoveltyModel |

try:
    api_response = api_instance.post_doc_novelty_api(payload)
    print(api_response)
except ApiException as e:
    print("Exception when calling NucleusApi->post_doc_novelty_api: %s\n" % e)

The above command returns JSON structured like this:

{
    "job_id": "str",
    "result": {
        "novel_docs": ["str"],
        "novelty_scores": ["str"],
        "query": ["str"]
    }
}

Http method: POST /documents/document_novelty/post_doc_novelty_api

Document novelty.

Parameters

Name Type Description Notes
payload DocumentNoveltyModel

DocumentNoveltyModel

Name Type Description Notes
dataset str Dataset name.
query str Dataset-language-specific fulltext query, using mysql MATCH boolean query format. Example: "(word1 OR word2) AND (word3 OR word4)" [optional]
custom_stop_words list[str] [optional]
num_days_back int Number of days from the latest date of the dataset, which are considered as foreground for the novelty analysis. [optional]
novelty_threshold float Value of novelty of a document above which a document is considered novel. This value should be between 0 and 1 and reflects the percentage of the document information unexplained by the background. [optional]
metadata_selection object JSON object specifying metadata-based queries on the dataset, of type {"metadata_field": "selected_values"} [optional]
time_period str Alternative 1: Time period selection for the period considered as background for the novelty analysis [optional]
period_start str Alternative 2: Start date for the period considered as background for the novelty analysis. Format: "YYYY-MM-DD" [optional]
period_end str Alternative 2: End date for the period considered as background for the novelty analysis. Format: "YYYY-MM-DD HH:MM:SS" [optional]
excluded_docs list[str] [optional]
remove_redundancies bool If True, this option removes quasi-duplicates from the analysis. A quasi-duplicate would have the same NLP representation, but not necessarily the exact same text. [optional] [default to False]

Return type:

DocumentNoveltyRespModel

Name Type Description Notes
job_id str If the job is taking too long, job_id is returned, GET /jobs can then be used to poll for results [optional]
result DocumentNoveltyL1Model [optional]

DocumentNoveltyL1Model

Name Type Description Notes
query list[str] [optional]
novel_docs list[str] [optional]
novelty_scores list[str] [optional]

[Back to top] [Back to API list]

post_doc_recommend_api

from __future__ import print_function
import time
import nucleus_api
from nucleus_api.rest import ApiException

# Uncomment below to setup prefix (e.g. Bearer) for API key, if needed
# configuration.api_key_prefix['x-api-key'] = 'Bearer'

# create an instance of the API class, using the configuration as showed in the Authentication
api_instance = nucleus_api.NucleusApi(nucleus_api.ApiClient(configuration))
payload = nucleus_api.DocumentRecommendModel() # DocumentRecommendModel |

try:
    api_response = api_instance.post_doc_recommend_api(payload)
    print(api_response)
except ApiException as e:
    print("Exception when calling NucleusApi->post_doc_recommend_api: %s\n" % e)

The above command returns JSON structured like this:

{
    "job_id": "str",
    "result": [
        {
            "keywords": "str",
            "recommendations": [
                {
                    "attribute": "object",
                    "sourceid": "str",
                    "title": "str"
                }
            ]
        }
    ]
}

Http method: POST /documents/document_recommend/post_doc_recommend_api

Recommendation of documents on given topics that have been extracted from a given dataset.

Parameters

Name Type Description Notes
payload DocumentRecommendModel

DocumentRecommendModel

Name Type Description Notes
dataset str Dataset name.
query str Dataset-language-specific fulltext query, using mysql MATCH boolean query format. Example: "(word1 OR word2) AND (word3 OR word4)" [optional]
custom_stop_words list[str] [optional]
num_topics int Number of topics to be extracted from the dataset and summarized. [optional]
num_keywords int Number of keywords per topic that is extracted from the dataset. [optional]
num_docs int Number of desired recommended docs per topic. [optional]
metadata_selection object JSON object specifying metadata-based queries on the dataset, of type {"metadata_field": "selected_values"} [optional]
time_period str Alternative 1: Time period selection [optional]
period_start str Alternative 2: Start date for the period to analyze within the dataset. Format: "YYYY-MM-DD" [optional]
period_end str Alternative 2: End date for the period to analyze within the dataset. Format: "YYYY-MM-DD HH:MM:SS" [optional]
excluded_docs list[str] [optional]
remove_redundancies bool If True, this option removes quasi-duplicates from the analysis. A quasi-duplicate would have the same NLP representation, but not necessarily the exact same text. [optional] [default to True]

Return type:

DocumentRecommendRespModel

Name Type Description Notes
job_id str If the job is taking too long, job_id is returned, GET /jobs can then be used to poll for results [optional]
result list[DocumentRecommendL1RespModel] [optional]

DocumentRecommendL1RespModel

Name Type Description Notes
keywords str Topic keywords [optional]
recommendations list[DocumentRecommendL2RespModel] [optional]

[Back to top] [Back to API list]

post_doc_sentiment_api

from __future__ import print_function
import time
import nucleus_api
from nucleus_api.rest import ApiException

# Uncomment below to setup prefix (e.g. Bearer) for API key, if needed
# configuration.api_key_prefix['x-api-key'] = 'Bearer'

# create an instance of the API class, using the configuration as showed in the Authentication
api_instance = nucleus_api.NucleusApi(nucleus_api.ApiClient(configuration))
payload = nucleus_api.DocumentSentimentModel() # DocumentSentimentModel |

try:
    api_response = api_instance.post_doc_sentiment_api(payload)
    print(api_response)
except ApiException as e:
    print("Exception when calling NucleusApi->post_doc_sentiment_api: %s\n" % e)

The above command returns JSON structured like this:

{
    "job_id": "str",
    "result": {
        "doc_title": "str",
        "sentiment": "str",
        "sourceid": "str"
    }
}

Http method: POST /documents/document_sentiment/post_doc_sentiment_api

Document sentiment.

Parameters

Name Type Description Notes
payload DocumentSentimentModel

DocumentSentimentModel

Name Type Description Notes
dataset str Dataset name.
doc_title str The title of the document to be analyzed.
query str Dataset-language-specific fulltext query, using mysql MATCH boolean query format. Example: "(word1 OR word2) AND (word3 OR word4)" [optional]
num_topics int Number of topics to be extracted from the document to estimate the document' sentiment. [optional]
num_keywords int Number of keywords per topic that is extracted from the document. [optional]
custom_stop_words list[str] [optional]
custom_dict_file object Custom sentiment dictionary JSON file. [optional]

Return type:

DocumentSentimentRespModel

Name Type Description Notes
job_id str If the job is taking too long, job_id is returned, GET /jobs can then be used to poll for results [optional]
result DocumentSentimentL1Model [optional]

DocumentSentimentL1Model

Name Type Description Notes
doc_title str Document title [optional]
sentiment str Document sentiment [optional]
sourceid str Document ID [optional]

[Back to top] [Back to API list]

post_doc_summary_api

from __future__ import print_function
import time
import nucleus_api
from nucleus_api.rest import ApiException

# Uncomment below to setup prefix (e.g. Bearer) for API key, if needed
# configuration.api_key_prefix['x-api-key'] = 'Bearer'

# create an instance of the API class, using the configuration as showed in the Authentication
api_instance = nucleus_api.NucleusApi(nucleus_api.ApiClient(configuration))
payload = nucleus_api.DocumentSummaryModel() # DocumentSummaryModel |

try:
    api_response = api_instance.post_doc_summary_api(payload)
    print(api_response)
except ApiException as e:
    print("Exception when calling NucleusApi->post_doc_summary_api: %s\n" % e)

The above command returns JSON structured like this:

{
    "job_id": "str",
    "result": {
        "doc_title": "str",
        "summary": {
            "sentences": ["str"],
            "sourceid": "str"
        }
    }
}

Http method: POST /documents/document_summary/post_doc_summary_api

Document summarization.

Parameters

Name Type Description Notes
payload DocumentSummaryModel

DocumentSummaryModel

Name Type Description Notes
dataset str Dataset name.
doc_title str The title of the document to be summarized.
query str Dataset-language-specific fulltext query, using mysql MATCH boolean query format. Example: "(word1 OR word2) AND (word3 OR word4)" [optional]
custom_stop_words list[str] [optional]
summary_length int The maximum number of bullet points a user wants to see in each topic summary. [optional]
context_amount int The number of sentences surrounding key summary sentences in the documents that they come from. [optional]
short_sentence_length int The sentence length (in number of words) below which a sentence is excluded from summarization. [optional]
long_sentence_length int The sentence length (in number of words) beyond which a sentence is excluded from summarization. [optional]

Return type:

DocumentSummaryRespModel

Name Type Description Notes
job_id str If the job is taking too long, job_id is returned, GET /jobs can then be used to poll for results [optional]
result DocumentSummaryL1Model [optional]

DocumentSummaryL1Model

Name Type Description Notes
doc_title str Document title [optional]
summary DocumentSummaryL2Model [optional]

[Back to top] [Back to API list]

post_document_contrast_summary_api

from __future__ import print_function
import time
import nucleus_api
from nucleus_api.rest import ApiException

# Uncomment below to setup prefix (e.g. Bearer) for API key, if needed
# configuration.api_key_prefix['x-api-key'] = 'Bearer'

# create an instance of the API class, using the configuration as showed in the Authentication
api_instance = nucleus_api.NucleusApi(nucleus_api.ApiClient(configuration))
payload = nucleus_api.DocumentContrastSummaryModel() # DocumentContrastSummaryModel |

try:
    api_response = api_instance.post_document_contrast_summary_api(payload)
    print(api_response)
except ApiException as e:
    print("Exception when calling NucleusApi->post_document_contrast_summary_api: %s\n" % e)

The above command returns JSON structured like this:

{
    "job_id": "str",
    "result": {
        "class_1_content": {
            "sentences": ["str"],
            "sourceid": "str"
        },
        "class_2_content": {
            "sentences": ["str"],
            "sourceid": "str"
        }
    }
}

Http method: POST /documents/document_contrasted_summary/post_document_contrast_summary_api

Document contrasted summarization.

Parameters

Name Type Description Notes
payload DocumentContrastSummaryModel

DocumentContrastSummaryModel

Name Type Description Notes
dataset str Dataset name.
query str Dataset-language-specific fulltext query, using mysql MATCH boolean query format. Example: "(word1 OR word2) AND (word3 OR word4)" [optional]
metadata_selection object JSON object specifying the 2 classes subject to contrasted summarization, of type: if based on content-selection {"content": "word1 word2 ... wordN"} ; if based on other field {"metadata_field": ["values_class1", "values_class2"}
custom_stop_words list[str] [optional]
time_period str Alternative 1: Time period selection from now [optional]
period_start str Alternative 2: Start date for the period to analyze within the dataset. Format: "YYYY-MM-DD" [optional]
period_end str Alternative 2: End date for the period to analyze within the dataset. Format: "YYYY-MM-DD" [optional]
summary_length int The maximum number of bullet points a user wants to see in each topic summary. [optional]
context_amount int The number of sentences surrounding key summary sentences in the documents that they come from. [optional]
short_sentence_length int The sentence length (in number of words) below which a sentence is excluded from summarization. [optional]
long_sentence_length int The sentence length (in number of words) beyond which a sentence is excluded from summarization. [optional]
excluded_docs list[str] [optional]
syntax_variables bool If True, the contrasted summary will leverage syntax-related variables on top of content variables to better separate the document from the rest [optional] [default to False]
num_keywords int Number of keywords of the contrasted topic that is extracted from the dataset. [optional]
remove_redundancies bool If True, this option removes quasi-duplicates from the analysis and retains only one copy of it. A quasi-duplicate would have the same NLP representation, but not necessarily the exact same text. [optional] [default to False]

Return type:

DocumentContrastSummaryRespModel

Name Type Description Notes
job_id str If the job is taking too long, job_id is returned, GET /jobs can then be used to poll for results [optional]
result DocumentContrastSummaryL1Model [optional]

DocumentContrastSummaryL1Model

Name Type Description Notes
class_1_content DocumentContrastSummaryL2Model [optional]
class_2_content DocumentContrastSummaryL2Model [optional]

[Back to top] [Back to API list]

Feeds

post_available_sec_filings

from __future__ import print_function
import time
import nucleus_api
from nucleus_api.rest import ApiException

# Uncomment below to setup prefix (e.g. Bearer) for API key, if needed
# configuration.api_key_prefix['x-api-key'] = 'Bearer'

# create an instance of the API class, using the configuration as showed in the Authentication
api_instance = nucleus_api.NucleusApi(nucleus_api.ApiClient(configuration))
payload = nucleus_api.EdgarFields() # EdgarFields |

try:
    api_response = api_instance.post_available_sec_filings(payload)
    print(api_response)
except ApiException as e:
    print("Exception when calling NucleusApi->post_available_sec_filings: %s\n" % e)

The above command returns JSON structured like this:

{
    "job_id": "str",
    "result": "object"
}

Http method: POST /feeds/available_sec_filings/post_available_sec_filings

Get information about the available sec filings. If no input is passed, returns the list of all available tickers. If tickers are passed, returns the list of available document types for these tickers. If documt types are also passed, returns the list of available secions for the selected tickers/filing types

Parameters

Name Type Description Notes
payload EdgarFields

EdgarFields

Name Type Description Notes
tickers list[str] List of tickers to be scraped (eg. ["GOOG"]) [optional]
filing_types list[str] List of form types to be scraped (eg. ["10K"]) [optional]
sections list[str] List of document sections to be scraped. If empty all sections will be scraped [optional]
period_start date Start date for the period to analyze within the dataset. Format: "YYYY-MM-DD" [optional]
period_end date End date for the period to analyze within the dataset. Format: "YYYY-MM-DD" [optional]

Return type:

AvailableFilingsResponseModel

Name Type Description Notes
result object The list of available tickers, filing_types and sections matching the query [optional]
job_id str If the job is taking too long, job_id is returned, GET /jobs can then be used to poll for results [optional]

[Back to top] [Back to API list]

post_create_dataset_from_sec_filings

from __future__ import print_function
import time
import nucleus_api
from nucleus_api.rest import ApiException

# Uncomment below to setup prefix (e.g. Bearer) for API key, if needed
# configuration.api_key_prefix['x-api-key'] = 'Bearer'

# create an instance of the API class, using the configuration as showed in the Authentication
api_instance = nucleus_api.NucleusApi(nucleus_api.ApiClient(configuration))
payload = nucleus_api.EdgarQuery() # EdgarQuery |

try:
    api_response = api_instance.post_create_dataset_from_sec_filings(payload)
    print(api_response)
except ApiException as e:
    print("Exception when calling NucleusApi->post_create_dataset_from_sec_filings: %s\n" % e)

The above command returns JSON structured like this:

{
    "job_id": "str",
    "result": "object"
}

Http method: POST /feeds/create_dataset_from_sec_filings/post_create_dataset_from_sec_filings

Creates a new dataset and populates it with sec filings matching the specified tickers/form types.

Parameters

Name Type Description Notes
payload EdgarQuery

EdgarQuery

Name Type Description Notes
destination_dataset str Name of the new dataset where the scraped documents will be inserted.
tickers list[str] List of tickers to be scraped (eg. ["GOOG"])
filing_types list[str] List of form types to be scraped (eg. ["10K"])
sections list[str] List of document sections to be scraped. If empty all sections will be scraped [optional]
period_start date Start date for the period to analyze within the dataset. Format: "YYYY-MM-DD" [optional]
period_end date End date for the period to analyze within the dataset. Format: "YYYY-MM-DD" [optional]

Return type:

CreateSecDatasetResponseModel

Name Type Description Notes
result object The JSON containing results [optional]
job_id str If the job is taking too long, job_id is returned, GET /jobs can then be used to poll for results [optional]

[Back to top] [Back to API list]

Filters

get_list_filters

from __future__ import print_function
import time
import nucleus_api
from nucleus_api.rest import ApiException

# Uncomment below to setup prefix (e.g. Bearer) for API key, if needed
# configuration.api_key_prefix['x-api-key'] = 'Bearer'

# create an instance of the API class, using the configuration as showed in the Authentication
api_instance = nucleus_api.NucleusApi(nucleus_api.ApiClient(configuration))

try:
    api_response = api_instance.get_list_filters()
    print(api_response)
except ApiException as e:
    print("Exception when calling NucleusApi->get_list_filters: %s\n" % e)

The above command returns JSON structured like this:

{
    "job_id": "str",
    "result": ["object"]
}

Http method: GET /filters/get_list_filters

List the filters owned by the user.

Parameters

This endpoint does not need any parameter.

Return type:

ListFiltersModel

Name Type Description Notes
job_id str If the job is taking too long, job_id is returned, GET /jobs can then be used to poll for results [optional]
result list[object] List of user filters [optional]

[Back to top] [Back to API list]

post_delete_filter

from __future__ import print_function
import time
import nucleus_api
from nucleus_api.rest import ApiException

# Uncomment below to setup prefix (e.g. Bearer) for API key, if needed
# configuration.api_key_prefix['x-api-key'] = 'Bearer'

# create an instance of the API class, using the configuration as showed in the Authentication
api_instance = nucleus_api.NucleusApi(nucleus_api.ApiClient(configuration))
payload = nucleus_api.DeleteFilterModel() # DeleteFilterModel |

try:
    api_response = api_instance.post_delete_filter(payload)
    print(api_response)
except ApiException as e:
    print("Exception when calling NucleusApi->post_delete_filter: %s\n" % e)

The above command returns JSON structured like this:

{
    "job_id": "str",
    "result": "object"
}

Http method: POST /filters/delete_filter/post_delete_filter

Delete a filter.

Parameters

Name Type Description Notes
payload DeleteFilterModel

DeleteFilterModel

Name Type Description Notes
filter_id int ID of the filter to be deleted

Return type:

DeleteFilterRespModel

Name Type Description Notes
job_id str If the job is taking too long, job_id is returned, GET /jobs can then be used to poll for results [optional]
result object {"deleted":filter_id} [optional]

[Back to top] [Back to API list]

post_save_filter

from __future__ import print_function
import time
import nucleus_api
from nucleus_api.rest import ApiException

# Uncomment below to setup prefix (e.g. Bearer) for API key, if needed
# configuration.api_key_prefix['x-api-key'] = 'Bearer'

# create an instance of the API class, using the configuration as showed in the Authentication
api_instance = nucleus_api.NucleusApi(nucleus_api.ApiClient(configuration))
payload = nucleus_api.SaveFilterModel() # SaveFilterModel |

try:
    api_response = api_instance.post_save_filter(payload)
    print(api_response)
except ApiException as e:
    print("Exception when calling NucleusApi->post_save_filter: %s\n" % e)

The above command returns JSON structured like this:

{
    "job_id": "str",
    "result": {
        "custom_stop_words": ["str"],
        "metadata_selection": "object",
        "query": "str",
        "time_period": "str"
    }
}

Http method: POST /filters/save_filter/post_save_filter

Save a filter representing a subsect of a dataset (time range, query, metadata..).

Parameters

Name Type Description Notes
payload SaveFilterModel

SaveFilterModel

Name Type Description Notes
dataset str Dataset name.
filter FilterModel [optional]

Return type:

SaveFilterRespModel

Name Type Description Notes
job_id str If the job is taking too long, job_id is returned, GET /jobs can then be used to poll for results [optional]
result FilterModel [optional]

FilterModel

Name Type Description Notes
query str Fulltext query, using mysql MATCH boolean query format. Example, ("word1" OR "word2") AND ("word3" OR "word4") [optional]
custom_stop_words list[str] [optional]
metadata_selection object JSON object specifying metadata-based queries on the dataset, of type {"metadata_field": "selected_values"} [optional]
time_period str Alternative 1: Time period selection [optional]

[Back to top] [Back to API list]

Jobs

get_job

from __future__ import print_function
import time
import nucleus_api
from nucleus_api.rest import ApiException

# Uncomment below to setup prefix (e.g. Bearer) for API key, if needed
# configuration.api_key_prefix['x-api-key'] = 'Bearer'

# create an instance of the API class, using the configuration as showed in the Authentication
api_instance = nucleus_api.NucleusApi(nucleus_api.ApiClient(configuration))
id = 'id_example' # str | ID of the job

try:
    api_response = api_instance.get_job(id)
    print(api_response)
except ApiException as e:
    print("Exception when calling NucleusApi->get_job: %s\n" % e)

The above command returns JSON structured like this:

{
    "error": "str",
    "job_id": "str",
    "last_update": "int",
    "progress": "str",
    "result": "str"
}

Http method: GET /jobs/get_job

Use this API to check the progress and retrieve results of a job. Poll this endpoint repeatedly until result is not null.

Parameters

Name Type Description Notes
id str ID of the job

Return type:

JobRespModel

Name Type Description Notes
job_id str If the job is taking too long, job_id is returned, GET /jobs can then be used to poll for results [optional]
progress str JSON encoded progress of the job. [optional]
result str JSON encoded result of the job. [optional]
error str error messages [optional]
last_update int Last time the job progress was updated (UNIX seconds from 1970) [optional]

[Back to top] [Back to API list]

post_example_job

from __future__ import print_function
import time
import nucleus_api
from nucleus_api.rest import ApiException

# Uncomment below to setup prefix (e.g. Bearer) for API key, if needed
# configuration.api_key_prefix['x-api-key'] = 'Bearer'

# create an instance of the API class, using the configuration as showed in the Authentication
api_instance = nucleus_api.NucleusApi(nucleus_api.ApiClient(configuration))
color = 'color_example' # str | A color
wait_time = 0 # int | Seconds to wait before returning the result (default to 0)

try:
    api_response = api_instance.post_example_job(color, wait_time)
    print(api_response)
except ApiException as e:
    print("Exception when calling NucleusApi->post_example_job: %s\n" % e)

The above command returns JSON structured like this:

{
    "job_id": "str",
    "result": {
        "color": "str",
        "user_email": "str"
    }
}

Http method: POST /jobs/start_example_job/post_example_job

Start an example background job

Parameters

Name Type Description Notes
color str A color
wait_time int Seconds to wait before returning the result [default to 0]

Return type:

ExampleJobResponse

Name Type Description Notes
job_id str If the job is taking too long, job_id is returned, GET /jobs can then be used to poll for results [optional]
result ExampleJobInnerResponse [optional]

ExampleJobInnerResponse

Name Type Description Notes
color str The color you chose [optional]
user_email str Email of the user who started the job [optional]

[Back to top] [Back to API list]

Legacy

post_legacy

from __future__ import print_function
import time
import nucleus_api
from nucleus_api.rest import ApiException

# Uncomment below to setup prefix (e.g. Bearer) for API key, if needed
# configuration.api_key_prefix['x-api-key'] = 'Bearer'

# create an instance of the API class, using the configuration as showed in the Authentication
api_instance = nucleus_api.NucleusApi(nucleus_api.ApiClient(configuration))
payload = nucleus_api.ApiCall() # ApiCall |

try:
    api_response = api_instance.post_legacy(payload)
    print(api_response)
except ApiException as e:
    print("Exception when calling NucleusApi->post_legacy: %s\n" % e)

The above command returns JSON structured like this:

{
    "job_id": "str",
    "results": "str"
}

Http method: POST /legacy/post_legacy

Recommendation of documents on given topics that have been extracted from a given dataset.

Parameters

Name Type Description Notes
payload ApiCall

ApiCall

Name Type Description Notes
api_call str Path to the legacy api call (eg. 'analyze.get_dataset_info')
params str JSON string containing the arguments (eg. JSON.stringify({'dataset':'my_dataset'})) [optional]

Return type:

LegacyResponseModel

Name Type Description Notes
results str The JSON stringified results [optional]
job_id str If the job is taking too long, job_id is returned, GET /jobs can then be used to poll for results [optional]

[Back to top] [Back to API list]

Topics

post_author_connectivity_api

from __future__ import print_function
import time
import nucleus_api
from nucleus_api.rest import ApiException

# Uncomment below to setup prefix (e.g. Bearer) for API key, if needed
# configuration.api_key_prefix['x-api-key'] = 'Bearer'

# create an instance of the API class, using the configuration as showed in the Authentication
api_instance = nucleus_api.NucleusApi(nucleus_api.ApiClient(configuration))
payload = nucleus_api.AuthorConnection() # AuthorConnection |

try:
    api_response = api_instance.post_author_connectivity_api(payload)
    print(api_response)
except ApiException as e:
    print("Exception when calling NucleusApi->post_author_connectivity_api: %s\n" % e)

The above command returns JSON structured like this:

{
    "job_id": "str",
    "result": {
        "mainstream_connections": [
            {
                "authors": ["str"],
                "keywords": "str"
            }
        ],
        "niche_connections": [
            {
                "authors": ["str"],
                "keywords": "str"
            }
        ]
    }
}

Http method: POST /topics/author_connectivity/post_author_connectivity_api

Get the network of similar authors to a reference author.

Parameters

Name Type Description Notes
payload AuthorConnection

AuthorConnection

Name Type Description Notes
dataset str Dataset name.
target_author str Name of the author to be analyzed.
query str Dataset-language-specific fulltext query, using mysql MATCH boolean query format. Example: "(word1 OR word2) AND (word3 OR word4)" [optional]
custom_stop_words list[str] [optional]
metadata_selection object JSON object specifying metadata-based queries on the dataset, of type {"metadata_field": "selected_values"} [optional]
time_period str Alternative 1: Time period selection [optional]
period_start str Alternative 2: Start date for the period to analyze within the dataset. Format: "YYYY-MM-DD" [optional]
period_end str Alternative 2: End date for the period to analyze within the dataset. Format: "YYYY-MM-DD" [optional]
excluded_docs list[str] [optional]

Return type:

AuthorConnectRespModel

Name Type Description Notes
job_id str If the job is taking too long, job_id is returned, GET /jobs can then be used to poll for results [optional]
result AuthorConnectL1RespModel [optional]

AuthorConnectL1RespModel

Name Type Description Notes
mainstream_connections list[AuthorConnectL2RespModel] [optional]
niche_connections list[AuthorConnectL2RespModel] [optional]

[Back to top] [Back to API list]

post_topic_api

from __future__ import print_function
import time
import nucleus_api
from nucleus_api.rest import ApiException

# Uncomment below to setup prefix (e.g. Bearer) for API key, if needed
# configuration.api_key_prefix['x-api-key'] = 'Bearer'

# create an instance of the API class, using the configuration as showed in the Authentication
api_instance = nucleus_api.NucleusApi(nucleus_api.ApiClient(configuration))
payload = nucleus_api.Topics() # Topics |

try:
    api_response = api_instance.post_topic_api(payload)
    print(api_response)
except ApiException as e:
    print("Exception when calling NucleusApi->post_topic_api: %s\n" % e)

The above command returns JSON structured like this:

{
    "job_id": "str",
    "result": {
        "doc_ids": ["str"],
        "topics": [
            {
                "doc_topic_exposures": ["str"],
                "keywords": "str",
                "keywords_weight": ["str"],
                "strength": "str"
            }
        ]
    }
}

Http method: POST /topics/topics/post_topic_api

Get key topics from a given dataset.

Parameters

Name Type Description Notes
payload Topics

Topics

Name Type Description Notes
dataset str Dataset name.
query str Dataset-language-specific fulltext query, using mysql MATCH boolean query format. Example: "(word1 OR word2) AND (word3 OR word4)" [optional]
custom_stop_words list[str] [optional]
num_topics int Number of topics to be extracted from the dataset and summarized. [optional]
num_keywords int Number of keywords per topic that is extracted from the dataset. [optional]
metadata_selection object JSON object specifying metadata-based queries on the dataset, of type {"metadata_field": "selected_values"} [optional]
time_period str Alternative 1: Time period selection [optional]
period_start str Alternative 2: Start date for the period to analyze within the dataset. Format: "YYYY-MM-DD HH:MM:SS" [optional]
period_end str Alternative 2: End date for the period to analyze within the dataset. Format: "YYYY-MM-DD HH:MM:SS" [optional]
excluded_docs list[str] [optional]
remove_redundancies bool If True, this option removes quasi-duplicates from the analysis. A quasi-duplicate would have the same NLP representation, but not necessarily the exact same text. [optional] [default to True]

Return type:

TopicRespModel

Name Type Description Notes
job_id str If the job is taking too long, job_id is returned, GET /jobs can then be used to poll for results [optional]
result TopicL1RespModel [optional]

TopicL1RespModel

Name Type Description Notes
doc_ids list[str] [optional]
topics list[TopicL2RespModel] [optional]

[Back to top] [Back to API list]

post_topic_consensus_api

from __future__ import print_function
import time
import nucleus_api
from nucleus_api.rest import ApiException

# Uncomment below to setup prefix (e.g. Bearer) for API key, if needed
# configuration.api_key_prefix['x-api-key'] = 'Bearer'

# create an instance of the API class, using the configuration as showed in the Authentication
api_instance = nucleus_api.NucleusApi(nucleus_api.ApiClient(configuration))
payload = nucleus_api.TopicConsensusModel() # TopicConsensusModel |

try:
    api_response = api_instance.post_topic_consensus_api(payload)
    print(api_response)
except ApiException as e:
    print("Exception when calling NucleusApi->post_topic_consensus_api: %s\n" % e)

The above command returns JSON structured like this:

{
    "job_id": "str",
    "result": [
        {
            "consensus": "str",
            "keywords": "str",
            "strength": "str"
        }
    ]
}

Http method: POST /topics/topic_consensus/post_topic_consensus_api

Get topic consensus for topics extracted from a given dataset.

Parameters

Name Type Description Notes
payload TopicConsensusModel

TopicConsensusModel

Name Type Description Notes
dataset str Dataset name.
query str Dataset-language-specific fulltext query, using mysql MATCH boolean query format. Example: "(word1 OR word2) AND (word3 OR word4)" [optional]
custom_stop_words list[str] [optional]
num_topics int Number of topics to be extracted from the dataset and summarized. [optional]
num_keywords int Number of keywords per topic that is extracted from the dataset. [optional]
metadata_selection object JSON object specifying metadata-based queries on the dataset, of type {"metadata_field": "selected_values"} [optional]
time_period str Alternative 1: Time period selection [optional]
period_start str Alternative 2: Start date for the period to analyze within the dataset. Format: "YYYY-MM-DD" [optional]
period_end str Alternative 2: End date for the period to analyze within the dataset. Format: "YYYY-MM-DD" [optional]
excluded_docs list[str] [optional]
custom_dict_file object Custom sentiment dictionary JSON file. [optional]
remove_redundancies bool If True, this option removes quasi-duplicates from the analysis. A quasi-duplicate would have the same NLP representation, but not necessarily the exact same text. [optional] [default to True]

Return type:

TopicConsensusRespModel

Name Type Description Notes
job_id str If the job is taking too long, job_id is returned, GET /jobs can then be used to poll for results [optional]
result list[NestedTopicConsensusModel] [optional]

NestedTopicConsensusModel

Name Type Description Notes
keywords str Topic keywords [optional]
strength str Topic strength [optional]
consensus str Topic consensus [optional]

[Back to top] [Back to API list]

post_topic_consensus_transfer_api

from __future__ import print_function
import time
import nucleus_api
from nucleus_api.rest import ApiException

# Uncomment below to setup prefix (e.g. Bearer) for API key, if needed
# configuration.api_key_prefix['x-api-key'] = 'Bearer'

# create an instance of the API class, using the configuration as showed in the Authentication
api_instance = nucleus_api.NucleusApi(nucleus_api.ApiClient(configuration))
payload = nucleus_api.TopicConsensusTransferModel() # TopicConsensusTransferModel |

try:
    api_response = api_instance.post_topic_consensus_transfer_api(payload)
    print(api_response)
except ApiException as e:
    print("Exception when calling NucleusApi->post_topic_consensus_transfer_api: %s\n" % e)

The above command returns JSON structured like this:

{
    "job_id": "str",
    "result": [
        {
            "consensus": "str",
            "keywords": "str",
            "strength": "str"
        }
    ]
}

Http method: POST /topics/topic_consensus_transfer/post_topic_consensus_transfer_api

Get exposures of documents in a validation dataset to topics extracted from a reference dataset.

Parameters

Name Type Description Notes
payload TopicConsensusTransferModel

TopicConsensusTransferModel

Name Type Description Notes
dataset0 str Name of reference dataset, on which topics are extracted.
dataset1 str Alternative 1: Name of validation dataset, on which topics are applied. Only pass in this argument if the validation dataset has been separately created. [optional]
fixed_topics object JSON object specifying the topics that are exogenously fixed, of type {"keywords": ["keyword_1", "keyword_2", "keyword_3"], "weights": [weight_1, weight_2, weight_3]} [optional]
query str Dataset-language-specific fulltext query, using mysql MATCH boolean query format. Example: "(word1 OR word2) AND (word3 OR word4)" [optional]
custom_stop_words list[str] [optional]
num_topics int Number of topics to be extracted from the dataset and summarized. [optional]
num_keywords int Number of keywords per topic that is extracted from the dataset. [optional]
metadata_selection object JSON object specifying metadata-based queries on the dataset, of type {"metadata_field": "selected_values"} [optional]
period_0_start str Alternative 2: Start date for the reference dataset. Use this approach if reference and validation datasets are different time slices of a superset. Format: "YYYY-MM-DD HH:MM:SS" [optional]
period_0_end str Alternative 2: End date for the reference dataset. Use this approach if reference and validation datasets are different time slices of a superset. Format: "YYYY-MM-DD HH:MM:SS" [optional]
period_1_start str Alternative 2: Start date for the validation dataset. Use this approach if reference and validation datasets are different time slices of a superset. Format: "YYYY-MM-DD HH:MM:SS" [optional]
period_1_end str Alternative 2: End date for the validation dataset. Use this approach if reference and validation datasets are different time slices of a superset. Format: "YYYY-MM-DD HH:MM:SS" [optional]
excluded_docs list[str] [optional]
custom_dict_file object Custom sentiment dictionary JSON file. [optional]
remove_redundancies bool If True, this option removes quasi-duplicates from the analysis. A quasi-duplicate would have the same NLP representation, but not necessarily the exact same text. [optional] [default to True]

Return type:

TopicConsensusTransferRespModel

Name Type Description Notes
job_id str If the job is taking too long, job_id is returned, GET /jobs can then be used to poll for results [optional]
result list[NestedTopicConsensusTransferModel] [optional]

NestedTopicConsensusTransferModel

Name Type Description Notes
keywords str Topic keywords [optional]
strength str Topic strength on the validation dataset [optional]
consensus str Topic consensus on the validation dataset [optional]

[Back to top] [Back to API list]

post_topic_contrast_api

from __future__ import print_function
import time
import nucleus_api
from nucleus_api.rest import ApiException

# Uncomment below to setup prefix (e.g. Bearer) for API key, if needed
# configuration.api_key_prefix['x-api-key'] = 'Bearer'

# create an instance of the API class, using the configuration as showed in the Authentication
api_instance = nucleus_api.NucleusApi(nucleus_api.ApiClient(configuration))
payload = nucleus_api.TopicContrastModel() # TopicContrastModel |

try:
    api_response = api_instance.post_topic_contrast_api(payload)
    print(api_response)
except ApiException as e:
    print("Exception when calling NucleusApi->post_topic_contrast_api: %s\n" % e)

The above command returns JSON structured like this:

{
    "job_id": "str",
    "result": {
        "classifier_config": {
            "coef_": ["float"],
            "intercept_": ["float"]
        },
        "keywords": ["str"],
        "keywords_weight": ["float"],
        "perf_metrics": {
            "accuracy": "float",
            "balanced_accuracy": "float",
            "f1": "float",
            "precision": "float",
            "recall": "float"
        }
    }
}

Http method: POST /topics/topic_contrast/post_topic_contrast_api

Contrasting topic extraction on a chosen dataset.

Parameters

Name Type Description Notes
payload TopicContrastModel

TopicContrastModel

Name Type Description Notes
dataset str Dataset name.
query str Dataset-language-specific fulltext query, using mysql MATCH boolean query format. Example: "(word1 OR word2) AND (word3 OR word4)" [optional]
metadata_selection object JSON object specifying the 2 classes subject to classification, of type: if based on content-selection {"content": "word1 word2 ... wordN"} ; if based on other field {"metadata_field": ["values_class1", "values_class2"}
custom_stop_words list[str] [optional]
time_period str Alternative 1: Time period selection from now [optional]
period_start str Alternative 2: Start date for the period to analyze within the dataset. Format: "YYYY-MM-DD" [optional]
period_end str Alternative 2: End date for the period to analyze within the dataset. Format: "YYYY-MM-DD" [optional]
excluded_docs list[str] [optional]
syntax_variables bool If True, the classifier will include syntax-related variables on top of content variables [optional] [default to False]
num_keywords int Number of keywords for the contrasted topic that is extracted from the dataset. [optional]
remove_redundancies bool If True, this option removes quasi-duplicates from the analysis and retains only one copy of it. A quasi-duplicate would have the same NLP representation, but not necessarily the exact same text. [optional] [default to False]

Return type:

TopicContrastRespModel

Name Type Description Notes
job_id str If the job is taking too long, job_id is returned, GET /jobs can then be used to poll for results [optional]
result TopicContrastL1RespModel [optional]

TopicContrastL1RespModel

Name Type Description Notes
keywords list[str] [optional]
keywords_weight list[float] [optional]
perf_metrics TopicContrastL21RespModel [optional]
classifier_config TopicContrastL22RespModel [optional]

[Back to top] [Back to API list]

post_topic_delta_api

from __future__ import print_function
import time
import nucleus_api
from nucleus_api.rest import ApiException

# Uncomment below to setup prefix (e.g. Bearer) for API key, if needed
# configuration.api_key_prefix['x-api-key'] = 'Bearer'

# create an instance of the API class, using the configuration as showed in the Authentication
api_instance = nucleus_api.NucleusApi(nucleus_api.ApiClient(configuration))
payload = nucleus_api.TopicDeltaModel() # TopicDeltaModel |

try:
    api_response = api_instance.post_topic_delta_api(payload)
    print(api_response)
except ApiException as e:
    print("Exception when calling NucleusApi->post_topic_delta_api: %s\n" % e)

The above command returns JSON structured like this:

{
    "job_id": "str",
    "result": {
        "doc_ids_t0": ["str"],
        "doc_ids_t1": ["str"],
        "topics": [
            {
                "doc_topic_exposure_deltas": ["str"],
                "keywords": "str",
                "keywords_weight": "str",
                "strength": "str"
            }
        ]
    }
}

Http method: POST /topics/topic_delta/post_topic_delta_api

Get changes in exposure to key topics from documents in a dataset in between two dates.

Parameters

Name Type Description Notes
payload TopicDeltaModel

TopicDeltaModel

Name Type Description Notes
dataset str Dataset name.
query str Dataset-language-specific fulltext query, using mysql MATCH boolean query format. Example: "(word1 OR word2) AND (word3 OR word4)" [optional]
custom_stop_words list[str] [optional]
num_topics int Number of topics to be extracted from the dataset and summarized. [optional]
num_keywords int Number of keywords per topic that is extracted from the dataset. [optional]
metadata_selection object JSON object specifying metadata-based queries on the dataset, of type {"metadata_field": "selected_values"} [optional]
period_0_start str Start date for the initial-period dataset. Format: "YYYY-MM-DD HH:MM:SS"
period_0_end str End date for the initial-period dataset. Format: "YYYY-MM-DD HH:MM:SS"
period_1_start str Start date for the final-period dataset. Format: "YYYY-MM-DD HH:MM:SS"
period_1_end str End date for the final-period dataset. Format: "YYYY-MM-DD HH:MM:SS"
excluded_docs list[str] [optional]
remove_redundancies bool If True, this option removes quasi-duplicates from the analysis. A quasi-duplicate would have the same NLP representation, but not necessarily the exact same text. [optional] [default to True]

Return type:

TopicDeltaRespModel

Name Type Description Notes
job_id str If the job is taking too long, job_id is returned, GET /jobs can then be used to poll for results [optional]
result TopicDeltaL1RespModel [optional]

TopicDeltaL1RespModel

Name Type Description Notes
doc_ids_t0 list[str] [optional]
doc_ids_t1 list[str] [optional]
topics list[TopicDeltaL2RespModel] [optional]

[Back to top] [Back to API list]

post_topic_historical_analysis_api

from __future__ import print_function
import time
import nucleus_api
from nucleus_api.rest import ApiException

# Uncomment below to setup prefix (e.g. Bearer) for API key, if needed
# configuration.api_key_prefix['x-api-key'] = 'Bearer'

# create an instance of the API class, using the configuration as showed in the Authentication
api_instance = nucleus_api.NucleusApi(nucleus_api.ApiClient(configuration))
payload = nucleus_api.TopicHistoryModel() # TopicHistoryModel |

try:
    api_response = api_instance.post_topic_historical_analysis_api(payload)
    print(api_response)
except ApiException as e:
    print("Exception when calling NucleusApi->post_topic_historical_analysis_api: %s\n" % e)

The above command returns JSON structured like this:

{
    "job_id": "str",
    "result": [
        {
            "consensuses": ["str"],
            "keywords": "str",
            "sentiments": ["str"],
            "strengths": ["str"],
            "time_stamps": ["str"]
        }
    ]
}

Http method: POST /topics/topic_historical/post_topic_historical_analysis_api

Get a historical analysis of topics extracted from a dataset.

Parameters

Name Type Description Notes
payload TopicHistoryModel

TopicHistoryModel

Name Type Description Notes
dataset str Dataset name.
query str Dataset-language-specific fulltext query, using mysql MATCH boolean query format. Example: "(word1 OR word2) AND (word3 OR word4)" [optional]
custom_stop_words list[str] [optional]
num_topics int Number of topics to be extracted from the dataset and summarized. [optional]
num_keywords int Number of keywords per topic that is extracted from the dataset. [optional]
metadata_selection object JSON object specifying metadata-based queries on the dataset, of type {"metadata_field": "selected_values"} [optional]
time_period str Alternative 1: Time period selection [optional]
period_start str Alternative 2: Start date for the period to analyze within the dataset. Format: "YYYY-MM-DD" [optional]
period_end str Alternative 2: End date for the period to analyze within the dataset. Format: "YYYY-MM-DD" [optional]
n_steps int Number of steps in the historical analysis over the requested period. Each step is such that they contain an equal number of documents. [optional]
excluded_docs list[str] [optional]
custom_dict_file object Custom sentiment dictionary JSON file. [optional]
recompute_topics bool If True, this option will trigger a recomputation of the topics at each past point in time. Especially helpful if conducting historical analysis of a query. [optional] [default to False]
remove_redundancies bool If True, this option removes quasi-duplicates from the analysis. A quasi-duplicate would have the same NLP representation, but not necessarily the exact same text. [optional] [default to True]

Return type:

TopicHistoryRespModel

Name Type Description Notes
job_id str If the job is taking too long, job_id is returned, GET /jobs can then be used to poll for results [optional]
result list[TopicHistoryL1RespModel] [optional]

TopicHistoryL1RespModel

Name Type Description Notes
keywords str Topic keywords [optional]
strengths list[str] [optional]
sentiments list[str] [optional]
consensuses list[str] [optional]
time_stamps list[str] [optional]

[Back to top] [Back to API list]

post_topic_sentiment_api

from __future__ import print_function
import time
import nucleus_api
from nucleus_api.rest import ApiException

# Uncomment below to setup prefix (e.g. Bearer) for API key, if needed
# configuration.api_key_prefix['x-api-key'] = 'Bearer'

# create an instance of the API class, using the configuration as showed in the Authentication
api_instance = nucleus_api.NucleusApi(nucleus_api.ApiClient(configuration))
payload = nucleus_api.TopicSentimentModel() # TopicSentimentModel |

try:
    api_response = api_instance.post_topic_sentiment_api(payload)
    print(api_response)
except ApiException as e:
    print("Exception when calling NucleusApi->post_topic_sentiment_api: %s\n" % e)

The above command returns JSON structured like this:

{
    "job_id": "str",
    "result": [
        {
            "doc_ids": ["str"],
            "doc_sentiments": ["str"],
            "doc_topic_exposures": ["str"],
            "keywords": "str",
            "sentiment": "str",
            "strength": "str"
        }
    ]
}

Http method: POST /topics/topic_sentiment/post_topic_sentiment_api

Get topic sentiment for topics extracted from a given dataset.

Parameters

Name Type Description Notes
payload TopicSentimentModel

TopicSentimentModel

Name Type Description Notes
dataset str Dataset name.
query str Dataset-language-specific fulltext query, using mysql MATCH boolean query format. Example: "(word1 OR word2) AND (word3 OR word4)" [optional]
custom_stop_words list[str] [optional]
num_topics int Number of topics to be extracted from the dataset and summarized. [optional]
num_keywords int Number of keywords per topic that is extracted from the dataset. [optional]
metadata_selection object JSON object specifying metadata-based queries on the dataset, of type {"metadata_field": "selected_values"} [optional]
time_period str Alternative 1: Time period selection from now [optional]
period_start str Alternative 2: Start date for the period to analyze within the dataset. Format: "YYYY-MM-DD" [optional]
period_end str Alternative 2: End date for the period to analyze within the dataset. Format: "YYYY-MM-DD" [optional]
excluded_docs list[str] [optional]
custom_dict_file object Custom sentiment dictionary JSON file. [optional]
remove_redundancies bool If True, this option removes quasi-duplicates from the analysis. A quasi-duplicate would have the same NLP representation, but not necessarily the exact same text. [optional] [default to True]

Return type:

TopicSentimentRespModel

Name Type Description Notes
job_id str If the job is taking too long, job_id is returned, GET /jobs can then be used to poll for results [optional]
result list[TopicSentimentL1RespModel] [optional]

TopicSentimentL1RespModel

Name Type Description Notes
keywords str Topic keywords [optional]
strength str Topic strength [optional]
sentiment str Topic sentiment [optional]
doc_topic_exposures list[str] [optional]
doc_sentiments list[str] [optional]
doc_ids list[str] [optional]

[Back to top] [Back to API list]

post_topic_sentiment_transfer_api

from __future__ import print_function
import time
import nucleus_api
from nucleus_api.rest import ApiException

# Uncomment below to setup prefix (e.g. Bearer) for API key, if needed
# configuration.api_key_prefix['x-api-key'] = 'Bearer'

# create an instance of the API class, using the configuration as showed in the Authentication
api_instance = nucleus_api.NucleusApi(nucleus_api.ApiClient(configuration))
payload = nucleus_api.TopicSentimentTransferModel() # TopicSentimentTransferModel |

try:
    api_response = api_instance.post_topic_sentiment_transfer_api(payload)
    print(api_response)
except ApiException as e:
    print("Exception when calling NucleusApi->post_topic_sentiment_transfer_api: %s\n" % e)

The above command returns JSON structured like this:

{
    "job_id": "str",
    "result": [
        {
            "doc_ids_t1": ["str"],
            "doc_sentiments_t1": ["str"],
            "doc_topic_exposures_t1": ["str"],
            "keywords": "str",
            "sentiment": "str",
            "strength": "str"
        }
    ]
}

Http method: POST /topics/topic_sentiment_transfer/post_topic_sentiment_transfer_api

Get sentiment exposures of documents in a validation dataset to topics extracted from a reference dataset.

Parameters

Name Type Description Notes
payload TopicSentimentTransferModel

TopicSentimentTransferModel

Name Type Description Notes
dataset0 str Name of reference dataset, on which topics are extracted.
dataset1 str Alternative 1: Name of validation dataset, on which topics are applied. Only pass in this argument if the validation dataset has been separately created. [optional]
fixed_topics object JSON object specifying the topics that are exogenously fixed, of type {"keywords": ["keyword_1", "keyword_2", "keyword_3"], "weights": [weight_1, weight_2, weight_3]} [optional]
query str Dataset-language-specific fulltext query, using mysql MATCH boolean query format. Example: "(word1 OR word2) AND (word3 OR word4)" [optional]
custom_stop_words list[str] [optional]
num_topics int Number of topics to be extracted from the dataset and summarized. [optional]
num_keywords int Number of keywords per topic that is extracted from the dataset. [optional]
metadata_selection object JSON object specifying metadata-based queries on the dataset, of type {"metadata_field": "selected_values"} [optional]
period_0_start str Alternative 2: Start date for the reference dataset. Use this approach if reference and validation datasets are different time slices of a superset. Format: "YYYY-MM-DD HH:MM:SS" [optional]
period_0_end str Alternative 2: End date for the reference dataset. Use this approach if reference and validation datasets are different time slices of a superset. Format: "YYYY-MM-DD HH:MM:SS" [optional]
period_1_start str Alternative 2: Start date for the validation dataset. Use this approach if reference and validation datasets are different time slices of a superset. Format: "YYYY-MM-DD HH:MM:SS" [optional]
period_1_end str Alternative 2: End date for the validation dataset. Use this approach if reference and validation datasets are different time slices of a superset. Format: "YYYY-MM-DD HH:MM:SS" [optional]
excluded_docs list[str] [optional]
custom_dict_file object Custom sentiment dictionary JSON file. [optional]
remove_redundancies bool If True, this option removes quasi-duplicates from the analysis. A quasi-duplicate would have the same NLP representation, but not necessarily the exact same text. [optional] [default to True]

Return type:

TopicSentimentTransferRespModel

Name Type Description Notes
job_id str If the job is taking too long, job_id is returned, GET /jobs can then be used to poll for results [optional]
result list[NestedTopicSentimentTransferModel] [optional]

NestedTopicSentimentTransferModel

Name Type Description Notes
keywords str Topic keywords [optional]
strength str Topic strength on the validation dataset [optional]
sentiment str Topic sentiment on the validation dataset [optional]
doc_topic_exposures_t1 list[str] [optional]
doc_sentiments_t1 list[str] [optional]
doc_ids_t1 list[str] [optional]

[Back to top] [Back to API list]

post_topic_summary_api

from __future__ import print_function
import time
import nucleus_api
from nucleus_api.rest import ApiException

# Uncomment below to setup prefix (e.g. Bearer) for API key, if needed
# configuration.api_key_prefix['x-api-key'] = 'Bearer'

# create an instance of the API class, using the configuration as showed in the Authentication
api_instance = nucleus_api.NucleusApi(nucleus_api.ApiClient(configuration))
payload = nucleus_api.TopicSummaryModel() # TopicSummaryModel |

try:
    api_response = api_instance.post_topic_summary_api(payload)
    print(api_response)
except ApiException as e:
    print("Exception when calling NucleusApi->post_topic_summary_api: %s\n" % e)

The above command returns JSON structured like this:

{
    "job_id": "str",
    "result": [
        {
            "keywords": "str",
            "summary": [
                {
                    "attribute": "object",
                    "sentences": "str",
                    "sourceid": "str",
                    "title": "str"
                }
            ]
        }
    ]
}

Http method: POST /topics/topic_summary/post_topic_summary_api

Get summaries of topics that have been extracted from a dataset.

Parameters

Name Type Description Notes
payload TopicSummaryModel

TopicSummaryModel

Name Type Description Notes
dataset str Dataset name.
query str Dataset-language-specific fulltext query, using mysql MATCH boolean query format. Example: "(word1 OR word2) AND (word3 OR word4)" [optional]
custom_stop_words list[str] [optional]
num_topics int Number of topics to be extracted from the dataset and summarized. [optional]
num_keywords int Number of keywords per topic that is extracted from the dataset. [optional]
metadata_selection object JSON object specifying metadata-based queries on the dataset, of type {"metadata_field": "selected_values"} [optional]
time_period str Alternative 1: Time period selection [optional]
period_start str Alternative 2: Start date for the period to analyze within the dataset. Format: "YYYY-MM-DD" [optional]
period_end str Alternative 2: End date for the period to analyze within the dataset. Format: "YYYY-MM-DD" [optional]
summary_length int The maximum number of bullet points a user wants to see in each topic summary. [optional]
context_amount int The number of sentences surrounding key summary sentences in the documents that they come from. [optional]
num_docs int The maximum number of key documents to use for summarization. [optional]
excluded_docs list[str] [optional]
remove_redundancies bool If True, this option removes quasi-duplicates from the analysis. A quasi-duplicate would have the same NLP representation, but not necessarily the exact same text. [optional] [default to True]

Return type:

TopicSummaryRespModel

Name Type Description Notes
job_id str If the job is taking too long, job_id is returned, GET /jobs can then be used to poll for results [optional]
result list[TopicSummaryL1RespModel] [optional]

TopicSummaryL1RespModel

Name Type Description Notes
keywords str Topic keywords [optional]
summary list[TopicSummaryL2RespModel] [optional]

[Back to top] [Back to API list]

post_topic_transfer_api

from __future__ import print_function
import time
import nucleus_api
from nucleus_api.rest import ApiException

# Uncomment below to setup prefix (e.g. Bearer) for API key, if needed
# configuration.api_key_prefix['x-api-key'] = 'Bearer'

# create an instance of the API class, using the configuration as showed in the Authentication
api_instance = nucleus_api.NucleusApi(nucleus_api.ApiClient(configuration))
payload = nucleus_api.TopicTransferModel() # TopicTransferModel |

try:
    api_response = api_instance.post_topic_transfer_api(payload)
    print(api_response)
except ApiException as e:
    print("Exception when calling NucleusApi->post_topic_transfer_api: %s\n" % e)

The above command returns JSON structured like this:

{
    "job_id": "str",
    "result": {
        "doc_ids_t1": ["str"],
        "topics": [
            {
                "doc_topic_exposures_t1": ["str"],
                "keywords": "str",
                "keywords_weight": ["str"],
                "strength": "str"
            }
        ]
    },
    "status": {
        "code": "int",
        "message": "str"
    }
}

Http method: POST /topics/topic_transfer/post_topic_transfer_api

Get exposures of documents in a validation dataset to topics extracted from a reference dataset.

Parameters

Name Type Description Notes
payload TopicTransferModel

TopicTransferModel

Name Type Description Notes
dataset0 str Name of reference dataset, on which topics are extracted.
dataset1 str Alternative 1: Name of validation dataset, on which topics are applied. Only pass in this argument if the validation dataset has been separately created. [optional]
fixed_topics object JSON object specifying the topics that are exogenously fixed, of type {"keywords": ["keyword_1", "keyword_2", "keyword_3"], "weights": [weight_1, weight_2, weight_3]} [optional]
query str Dataset-language-specific fulltext query, using mysql MATCH boolean query format. Example: "(word1 OR word2) AND (word3 OR word4)" [optional]
custom_stop_words list[str] [optional]
num_topics int Number of topics to be extracted from the dataset and summarized. [optional]
num_keywords int Number of keywords per topic that is extracted from the dataset. [optional]
metadata_selection object JSON object specifying metadata-based queries on the dataset, of type {"metadata_field": "selected_values"} [optional]
period_0_start str Alternative 2: Start date for the reference dataset. Use this approach if reference and validation datasets are different time slices of a superset. Format: "YYYY-MM-DD HH:MM:SS" [optional]
period_0_end str Alternative 2: End date for the reference dataset. Use this approach if reference and validation datasets are different time slices of a superset. Format: "YYYY-MM-DD HH:MM:SS" [optional]
period_1_start str Alternative 2: Start date for the validation dataset. Use this approach if reference and validation datasets are different time slices of a superset. Format: "YYYY-MM-DD HH:MM:SS" [optional]
period_1_end str Alternative 2: End date for the validation dataset. Use this approach if reference and validation datasets are different time slices of a superset. Format: "YYYY-MM-DD HH:MM:SS" [optional]
excluded_docs list[str] [optional]
remove_redundancies bool If True, this option removes quasi-duplicates from the analysis. A quasi-duplicate would have the same NLP representation, but not necessarily the exact same text. [optional] [default to True]

Return type:

TopicTransferRespModel

Name Type Description Notes
job_id str If the job is taking too long, job_id is returned, GET /jobs can then be used to poll for results [optional]
status JobStatusRespModel [optional]
result TopicTransferL1RespModel [optional]

TopicTransferL1RespModel

Name Type Description Notes
doc_ids_t1 list[str] [optional]
topics list[TopicTransferL2RespModel] [optional]

[Back to top] [Back to API list]

Users

get_user

from __future__ import print_function
import time
import nucleus_api
from nucleus_api.rest import ApiException

# create an instance of the API class
api_instance = nucleus_api.NucleusApi()
user_email = 'user_email_example' # str | Email of the user to authenticate.
password = 'password_example' # str | Plaintext password of the user to authenticate.

try:
    api_response = api_instance.get_user(user_email, password)
    print(api_response)
except ApiException as e:
    print("Exception when calling NucleusApi->get_user: %s\n" % e)

The above command returns JSON structured like this:

{
    "api_key": "str",
    "company": "str",
    "expiry": "str",
    "first_name": "str",
    "job_id": "str",
    "last_name": "str",
    "license_id": "str",
    "license_type": "str",
    "phone": "str",
    "reg_time": "str",
    "settings": "str",
    "title": "str",
    "user_email": "str"
}

Http method: GET /users/get_user

Use this API to authenticate. If the password is correct, returns the user details, including the user's api key.

Parameters

Name Type Description Notes
user_email str Email of the user to authenticate.
password str Plaintext password of the user to authenticate.

Return type:

UserModel

Name Type Description Notes
job_id str If the job is taking too long, job_id is returned, GET /jobs can then be used to poll for results [optional]
user_email str Email [optional]
first_name str First name [optional]
last_name str Last name [optional]
api_key str API key [optional]
phone str Phone number [optional]
company str Company [optional]
title str Title [optional]
settings str User settings [optional]
reg_time str Registration time [optional]
license_id str License ID [optional]
license_type str License type [optional]
expiry str License expiration date [optional]

[Back to top] [Back to API list]

post_user

from __future__ import print_function
import time
import nucleus_api
from nucleus_api.rest import ApiException

# Uncomment below to setup prefix (e.g. Bearer) for API key, if needed
# configuration.api_key_prefix['x-api-key'] = 'Bearer'

# create an instance of the API class, using the configuration as showed in the Authentication
api_instance = nucleus_api.NucleusApi(nucleus_api.ApiClient(configuration))
payload = nucleus_api.User() # User |

try:
    api_response = api_instance.post_user(payload)
    print(api_response)
except ApiException as e:
    print("Exception when calling NucleusApi->post_user: %s\n" % e)

The above command returns JSON structured like this:

{
    "job_id": "str",
    "result": ["str"]
}

Http method: POST /users/post_user

Use this API to register a new user. Email and password are required, all other fields optional.

Parameters

Name Type Description Notes
payload User

User

Name Type Description Notes
user_email str Email of the user to register.
password str Password of the user to register.
first_name str First name of the user to register.' [optional]
last_name str Last name of the user to register. [optional]
phone int Phone number (int) of the user to register. [optional]
company str Name of the Company of the user. [optional]
title str Job title. [optional]
country str Country of origin. [optional]

Return type:

PostUserRespModel

Name Type Description Notes
job_id str If the job is taking too long, job_id is returned, GET /jobs can then be used to poll for results [optional]
result list[str] [optional]

[Back to top] [Back to API list]

Response Models' Schema

Appendjsonparams

Properties

Name Type Description Notes
dataset str Name of the dataset.
language str If you want to override language detection. [optional]
document Document [optional]

[Back to top] [Back to Schemas]

AppendJsonRespModel

Properties

Name Type Description Notes
job_id str If the job is taking too long, job_id is returned, GET /jobs can then be used to poll for results [optional]
result JsonPropertyModel [optional]

[Back to top] [Back to Schemas]

AuthorConnection

Properties

Name Type Description Notes
dataset str Dataset name.
target_author str Name of the author to be analyzed.
query str Dataset-language-specific fulltext query, using mysql MATCH boolean query format. Example: "(word1 OR word2) AND (word3 OR word4)" [optional]
custom_stop_words list[str] [optional]
metadata_selection object JSON object specifying metadata-based queries on the dataset, of type {"metadata_field": "selected_values"} [optional]
time_period str Alternative 1: Time period selection [optional]
period_start str Alternative 2: Start date for the period to analyze within the dataset. Format: "YYYY-MM-DD" [optional]
period_end str Alternative 2: End date for the period to analyze within the dataset. Format: "YYYY-MM-DD" [optional]
excluded_docs list[str] [optional]

[Back to top] [Back to Schemas]

AuthorConnectL1RespModel

Properties

Name Type Description Notes
mainstream_connections list[AuthorConnectL2RespModel] [optional]
niche_connections list[AuthorConnectL2RespModel] [optional]

[Back to top] [Back to Schemas]

AuthorConnectL2RespModel

Properties

Name Type Description Notes
keywords str Topic discussed by target author and connected authors [optional]
authors list[str] [optional]

[Back to top] [Back to Schemas]

AuthorConnectRespModel

Properties

Name Type Description Notes
job_id str If the job is taking too long, job_id is returned, GET /jobs can then be used to poll for results [optional]
result AuthorConnectL1RespModel [optional]

[Back to top] [Back to Schemas]

AvailableFilingsResponseModel

Properties

Name Type Description Notes
result object The list of available tickers, filing_types and sections matching the query [optional]
job_id str If the job is taking too long, job_id is returned, GET /jobs can then be used to poll for results [optional]

[Back to top] [Back to Schemas]

BulkInsertParams

Properties

Name Type Description Notes
dataset str Name of the dataset.
language str If you want to override language detection. [optional]
documents list[Document] [optional]

[Back to top] [Back to Schemas]

BulkInsertRespModel

Properties

Name Type Description Notes
job_id str If the job is taking too long, job_id is returned, GET /jobs can then be used to poll for results [optional]
result JsonPropertyModel [optional]

[Back to top] [Back to Schemas]

CreateSecDatasetResponseModel

Properties

Name Type Description Notes
result object The JSON containing results [optional]
job_id str If the job is taking too long, job_id is returned, GET /jobs can then be used to poll for results [optional]

[Back to top] [Back to Schemas]

CustomTrackerL1RespModel

Properties

Name Type Description Notes
query str Query analyzed. If no query was provided, then these are the hot topics. [optional]
strengths list[str] [optional]
sentiments list[str] [optional]
consensuses list[str] [optional]
time_stamps list[str] [optional]

[Back to top] [Back to Schemas]

CustomTrackerModel

Properties

Name Type Description Notes
dataset str Dataset name.
query str Dataset-language-specific fulltext query, using mysql MATCH boolean query format. Example: "(word1 OR word2) AND (word3 OR word4)" [optional]
custom_stop_words list[str] [optional]
num_topics int Number of topics to be extracted from the dataset per query to aggregate back into a tracker. [optional]
num_keywords int Number of keywords per topic that is extracted from the dataset per query. [optional]
metadata_selection object JSON object specifying metadata-based queries on the dataset, of type {"metadata_field": "selected_values"} [optional]
time_period str Alternative 1: Time period selection [optional]
period_start str Alternative 2: Start date for the period to analyze within the dataset. Format: "YYYY-MM-DD" [optional]
period_end str Alternative 2: End date for the period to analyze within the dataset. Format: "YYYY-MM-DD" [optional]
n_steps int Number of steps in the historical analysis over the requested period. Each step is such that they contain an equal number of documents. [optional]
excluded_docs list[str] [optional]
custom_dict_file object Custom sentiment dictionary JSON file. [optional]

[Back to top] [Back to Schemas]

CustomTrackerRespModel

Properties

Name Type Description Notes
job_id str If the job is taking too long, job_id is returned, GET /jobs can then be used to poll for results [optional]
result list[CustomTrackerL1RespModel] [optional]

[Back to top] [Back to Schemas]

DatasetInfo

Properties

Name Type Description Notes
dataset str Dataset name.
query str Dataset-language-specific fulltext query, using mysql MATCH boolean query format. Example: "(word1 OR word2) AND (word3 OR word4)" [optional]
metadata_selection object JSON object specifying metadata-based queries on the dataset, of type {"metadata_field": "selected_values"} [optional]
time_period str Alternative 1: Time period selection [optional]
period_start str Alternative 2: Start date for the period to analyze within the dataset. Format: "YYYY-MM-DD HH:MM:SS" [optional]
period_end str Alternative 2: End date for the period to analyze within the dataset. Format: "YYYY-MM-DD HH:MM:SS" [optional]
owner_email str Has to be specified to access a shared dataset that is owned by another user [optional]

[Back to top] [Back to Schemas]

DatasetInfoModel

Properties

Name Type Description Notes
dataset str Dataset name [optional]
num_documents str Number of documents in the dataset [optional]
time_range list[str] [optional]
detected_language str Language of the dataset [optional]
metadata object Metadata information [optional]

[Back to top] [Back to Schemas]

DatasetInfoRespModel

Properties

Name Type Description Notes
job_id str If the job is taking too long, job_id is returned, GET /jobs can then be used to poll for results [optional]
result DatasetInfoModel [optional]

[Back to top] [Back to Schemas]

DatasetModel

Properties

Name Type Description Notes
name str Unique identifier of the dataset [optional]
date_modified datetime Datetime of last insertion or deletion of documents [optional]

[Back to top] [Back to Schemas]

DatasetTaggingL1RespModel

Properties

Name Type Description Notes
dataset str Dataset name [optional]
doc_ids list[str] [optional]
entities_tagged list[str] [optional]
entities_count list[str] [optional]

[Back to top] [Back to Schemas]

DatasetTagging

Properties

Name Type Description Notes
dataset str Dataset name.
query str Dataset-language-specific fulltext query specifying an entity to tag in the dataset, with possibly several alternative namings, using mysql MATCH boolean query format. Example: "word1 OR word2 OR word3 OR word4"
metadata_selection object JSON object specifying metadata-based queries on the dataset, of type {"metadata_field": "selected_values"} [optional]
time_period str Alternative 1: Time period selection from now [optional]
period_start str Alternative 2: Start date for the period to analyze within the dataset. Format: "YYYY-MM-DD" [optional]
period_end str Alternative 2: End date for the period to analyze within the dataset. Format: "YYYY-MM-DD" [optional]

[Back to top] [Back to Schemas]

DatasetTaggingRespModel

Properties

Name Type Description Notes
job_id str If the job is taking too long, job_id is returned, GET /jobs can then be used to poll for results [optional]
result DatasetTaggingL1RespModel [optional]

[Back to top] [Back to Schemas]

Deletedatasetmodel

Properties

Name Type Description Notes
dataset str Name of the dataset.

[Back to top] [Back to Schemas]

DeleteDatasetRespModel

Properties

Name Type Description Notes
job_id str If the job is taking too long, job_id is returned, GET /jobs can then be used to poll for results [optional]
result object Dataset deleted [optional]

[Back to top] [Back to Schemas]

Deletedocumentmodel

Properties

Name Type Description Notes
dataset str Name of the dataset.
doc_ids list[str] [optional]

[Back to top] [Back to Schemas]

DeleteDocumentRespModel

Properties

Name Type Description Notes
job_id str If the job is taking too long, job_id is returned, GET /jobs can then be used to poll for results [optional]
result str Documents deleted from dataset [optional]

[Back to top] [Back to Schemas]

DeleteFilterModel

Properties

Name Type Description Notes
filter_id int ID of the filter to be deleted

[Back to top] [Back to Schemas]

DeleteFilterRespModel

Properties

Name Type Description Notes
job_id str If the job is taking too long, job_id is returned, GET /jobs can then be used to poll for results [optional]
result object {"deleted":filter_id} [optional]

[Back to top] [Back to Schemas]

DocClassifyL1RespModel

Properties

Name Type Description Notes
detailed_results DocClassifyL2DRRespModel [optional]
perf_metrics DocClassifyL2PMRespModel [optional]

[Back to top] [Back to Schemas]

DocClassifyL2DRRespModel

Properties

Name Type Description Notes
docids list[int] [optional]
true_class list[str] [optional]
estimated_class list[str] [optional]

[Back to top] [Back to Schemas]

DocClassifyL2PMRespModel

Properties

Name Type Description Notes
accuracy float Accuracy of the classifier [optional]
recall float Recall of the classifier [optional]
precision float Precision of the classifier [optional]
f1 float F1 of the classifier [optional]
balanced_accuracy float Balanced Accuracy of the classifier [optional]

[Back to top] [Back to Schemas]

DocClassifyModel

Properties

Name Type Description Notes
dataset str Dataset name.
query str Dataset-language-specific fulltext query, using mysql MATCH boolean query format. Example: "(word1 OR word2) AND (word3 OR word4)" [optional]
fixed_topics object JSON object specifying the contrasting topic that is exogenously fixed, of type {"keywords": ["keyword_1", "keyword_2", "keyword_3"], "weights": [weight_1, weight_2, weight_3]}
classifier_config object JSON object specifying the configuration of the classifier from the training phase, of type {"coefs": [coef_1, coef_2, coef_3], "intercept": intercept}
metadata_selection object JSON object specifying the 2 classes subject to classification, of type: if based on content-selection {"content": "word1 word2 ... wordN"} ; if based on other field {"metadata_field": ["values_class1", "values_class2"}
custom_stop_words list[str] [optional]
time_period str Alternative 1: Time period selection from now [optional]
period_start str Alternative 2: Start date for the period to analyze within the dataset. Format: "YYYY-MM-DD" [optional]
period_end str Alternative 2: End date for the period to analyze within the dataset. Format: "YYYY-MM-DD" [optional]
excluded_docs list[str] [optional]
syntax_variables bool If True, the classifier will include syntax-related variables on top of content variables [optional] [default to False]
validation_phase bool If True, the classifier assumes that the dataset provided is labeled with the 2 classes and will use that to compute accuracy/precision/recall [optional] [default to False]
remove_redundancies bool If True, this option removes quasi-duplicates from the analysis and retains only one copy of it. A quasi-duplicate would have the same NLP representation, but not necessarily the exact same text. [optional] [default to False]

[Back to top] [Back to Schemas]

DocClassifyRespModel

Properties

Name Type Description Notes
job_id str If the job is taking too long, job_id is returned, GET /jobs can then be used to poll for results [optional]
result DocClassifyL1RespModel [optional]

[Back to top] [Back to Schemas]

DocDisplayL1RespModel

Properties

Name Type Description Notes
title str Document title [optional]
sourceid str Document ID [optional]
content str Document content [optional]
attribute object JSON containing document metadata key:value pairs (eg. author, time..) [optional]

[Back to top] [Back to Schemas]

DocDisplay

Properties

Name Type Description Notes
dataset str Dataset name.
doc_titles list[str] [optional]
doc_ids list[str] [optional]
metadata_selection object JSON object specifying metadata-based queries on the dataset. If titles or docids are also provided, then this selection is ignored. Format: {"metadata_field": "selected_values"} [optional]

[Back to top] [Back to Schemas]

DocDisplayRespModel

Properties

Name Type Description Notes
job_id str If the job is taking too long, job_id is returned, GET /jobs can then be used to poll for results [optional]
result list[DocDisplayL1RespModel] [optional]

[Back to top] [Back to Schemas]

DocInfo

Properties

Name Type Description Notes
dataset str Dataset name.
doc_titles list[str] [optional]
doc_ids list[str] [optional]
metadata_selection object JSON object specifying metadata-based queries on the dataset. If titles or docids are also provided, then this selection is ignored. Format: {"metadata_field": "selected_values"} [optional]

[Back to top] [Back to Schemas]

DocInfoRespModel

Properties

Name Type Description Notes
job_id str If the job is taking too long, job_id is returned, GET /jobs can then be used to poll for results [optional]
result list[NestedDocInfoModel] [optional]

[Back to top] [Back to Schemas]

DocumentContrastSummaryL1Model

Properties

Name Type Description Notes
class_1_content DocumentContrastSummaryL2Model [optional]
class_2_content DocumentContrastSummaryL2Model [optional]

[Back to top] [Back to Schemas]

DocumentContrastSummaryL2Model

Properties

Name Type Description Notes
sentences list[str] [optional]
sourceid str The ID of the document the sentence comes from [optional]

[Back to top] [Back to Schemas]

DocumentContrastSummaryModel

Properties

Name Type Description Notes
dataset str Dataset name.
query str Dataset-language-specific fulltext query, using mysql MATCH boolean query format. Example: "(word1 OR word2) AND (word3 OR word4)" [optional]
metadata_selection object JSON object specifying the 2 classes subject to contrasted summarization, of type: if based on content-selection {"content": "word1 word2 ... wordN"} ; if based on other field {"metadata_field": ["values_class1", "values_class2"}
custom_stop_words list[str] [optional]
time_period str Alternative 1: Time period selection from now [optional]
period_start str Alternative 2: Start date for the period to analyze within the dataset. Format: "YYYY-MM-DD" [optional]
period_end str Alternative 2: End date for the period to analyze within the dataset. Format: "YYYY-MM-DD" [optional]
summary_length int The maximum number of bullet points a user wants to see in each topic summary. [optional]
context_amount int The number of sentences surrounding key summary sentences in the documents that they come from. [optional]
short_sentence_length int The sentence length (in number of words) below which a sentence is excluded from summarization. [optional]
long_sentence_length int The sentence length (in number of words) beyond which a sentence is excluded from summarization. [optional]
excluded_docs list[str] [optional]
syntax_variables bool If True, the contrasted summary will leverage syntax-related variables on top of content variables to better separate the document from the rest [optional] [default to False]
num_keywords int Number of keywords of the contrasted topic that is extracted from the dataset. [optional]
remove_redundancies bool If True, this option removes quasi-duplicates from the analysis and retains only one copy of it. A quasi-duplicate would have the same NLP representation, but not necessarily the exact same text. [optional] [default to False]

[Back to top] [Back to Schemas]

DocumentContrastSummaryRespModel

Properties

Name Type Description Notes
job_id str If the job is taking too long, job_id is returned, GET /jobs can then be used to poll for results [optional]
result DocumentContrastSummaryL1Model [optional]

[Back to top] [Back to Schemas]

Document

Properties

Name Type Description Notes
time str Time of publication.
title str Title of the document.
content str Plaintext content of the document.

[Back to top] [Back to Schemas]

DocumentNewWordsL1Model

Properties

Name Type Description Notes
query list[str] [optional]
new_words list[str] [optional]
new_sents list[str] [optional]

[Back to top] [Back to Schemas]

DocumentNewWordsModel

Properties

Name Type Description Notes
dataset str Dataset name.
query str Dataset-language-specific fulltext query, using mysql MATCH boolean query format. Example: "(word1 OR word2) AND (word3 OR word4)" [optional]
custom_stop_words list[str] [optional]
num_days_back int Number of days from the latest date of the dataset, which are considered as foreground for the new words analysis. [optional]
num_new_words int Maximum number of new words returned from the new words detection [optional]
metadata_selection object JSON object specifying metadata-based queries on the dataset, of type {"metadata_field": "selected_values"} [optional]
time_period str Alternative 1: Time period selection for the period considered as background for the novelty analysis [optional]
period_start str Alternative 2: Start date for the period considered as background for the novelty analysis. Format: "YYYY-MM-DD" [optional]
period_end str Alternative 2: End date for the period considered as background for the novelty analysis. Format: "YYYY-MM-DD HH:MM:SS" [optional]
excluded_docs list[str] [optional]
remove_redundancies bool If True, this option removes quasi-duplicates from the analysis. A quasi-duplicate would have the same NLP representation, but not necessarily the exact same text. [optional] [default to True]

[Back to top] [Back to Schemas]

DocumentNewWordsRespModel

Properties

Name Type Description Notes
job_id str If the job is taking too long, job_id is returned, GET /jobs can then be used to poll for results [optional]
result list[DocumentNewWordsL1Model] [optional]

[Back to top] [Back to Schemas]

DocumentNoveltyL1Model

Properties

Name Type Description Notes
query list[str] [optional]
novel_docs list[str] [optional]
novelty_scores list[str] [optional]

[Back to top] [Back to Schemas]

DocumentNoveltyModel

Properties

Name Type Description Notes
dataset str Dataset name.
query str Dataset-language-specific fulltext query, using mysql MATCH boolean query format. Example: "(word1 OR word2) AND (word3 OR word4)" [optional]
custom_stop_words list[str] [optional]
num_days_back int Number of days from the latest date of the dataset, which are considered as foreground for the novelty analysis. [optional]
novelty_threshold float Value of novelty of a document above which a document is considered novel. This value should be between 0 and 1 and reflects the percentage of the document information unexplained by the background. [optional]
metadata_selection object JSON object specifying metadata-based queries on the dataset, of type {"metadata_field": "selected_values"} [optional]
time_period str Alternative 1: Time period selection for the period considered as background for the novelty analysis [optional]
period_start str Alternative 2: Start date for the period considered as background for the novelty analysis. Format: "YYYY-MM-DD" [optional]
period_end str Alternative 2: End date for the period considered as background for the novelty analysis. Format: "YYYY-MM-DD HH:MM:SS" [optional]
excluded_docs list[str] [optional]
remove_redundancies bool If True, this option removes quasi-duplicates from the analysis. A quasi-duplicate would have the same NLP representation, but not necessarily the exact same text. [optional] [default to False]

[Back to top] [Back to Schemas]

DocumentNoveltyRespModel

Properties

Name Type Description Notes
job_id str If the job is taking too long, job_id is returned, GET /jobs can then be used to poll for results [optional]
result DocumentNoveltyL1Model [optional]

[Back to top] [Back to Schemas]

DocumentRecommendL1RespModel

Properties

Name Type Description Notes
keywords str Topic keywords [optional]
recommendations list[DocumentRecommendL2RespModel] [optional]

[Back to top] [Back to Schemas]

DocumentRecommendL2RespModel

Properties

Name Type Description Notes
title str Document title [optional]
sourceid str Document ID [optional]
attribute object JSON containing document metadata key:value pairs (eg. author, time..) [optional]

[Back to top] [Back to Schemas]

DocumentRecommendModel

Properties

Name Type Description Notes
dataset str Dataset name.
query str Dataset-language-specific fulltext query, using mysql MATCH boolean query format. Example: "(word1 OR word2) AND (word3 OR word4)" [optional]
custom_stop_words list[str] [optional]
num_topics int Number of topics to be extracted from the dataset and summarized. [optional]
num_keywords int Number of keywords per topic that is extracted from the dataset. [optional]
num_docs int Number of desired recommended docs per topic. [optional]
metadata_selection object JSON object specifying metadata-based queries on the dataset, of type {"metadata_field": "selected_values"} [optional]
time_period str Alternative 1: Time period selection [optional]
period_start str Alternative 2: Start date for the period to analyze within the dataset. Format: "YYYY-MM-DD" [optional]
period_end str Alternative 2: End date for the period to analyze within the dataset. Format: "YYYY-MM-DD HH:MM:SS" [optional]
excluded_docs list[str] [optional]
remove_redundancies bool If True, this option removes quasi-duplicates from the analysis. A quasi-duplicate would have the same NLP representation, but not necessarily the exact same text. [optional] [default to True]

[Back to top] [Back to Schemas]

DocumentRecommendRespModel

Properties

Name Type Description Notes
job_id str If the job is taking too long, job_id is returned, GET /jobs can then be used to poll for results [optional]
result list[DocumentRecommendL1RespModel] [optional]

[Back to top] [Back to Schemas]

DocumentSentimentL1Model

Properties

Name Type Description Notes
doc_title str Document title [optional]
sentiment str Document sentiment [optional]
sourceid str Document ID [optional]

[Back to top] [Back to Schemas]

DocumentSentimentModel

Properties

Name Type Description Notes
dataset str Dataset name.
doc_title str The title of the document to be analyzed.
query str Dataset-language-specific fulltext query, using mysql MATCH boolean query format. Example: "(word1 OR word2) AND (word3 OR word4)" [optional]
num_topics int Number of topics to be extracted from the document to estimate the document' sentiment. [optional]
num_keywords int Number of keywords per topic that is extracted from the document. [optional]
custom_stop_words list[str] [optional]
custom_dict_file object Custom sentiment dictionary JSON file. [optional]

[Back to top] [Back to Schemas]

DocumentSentimentRespModel

Properties

Name Type Description Notes
job_id str If the job is taking too long, job_id is returned, GET /jobs can then be used to poll for results [optional]
result DocumentSentimentL1Model [optional]

[Back to top] [Back to Schemas]

DocumentSummaryL1Model

Properties

Name Type Description Notes
doc_title str Document title [optional]
summary DocumentSummaryL2Model [optional]

[Back to top] [Back to Schemas]

DocumentSummaryL2Model

Properties

Name Type Description Notes
sentences list[str] [optional]
sourceid str The ID of the document the sentence comes from [optional]

[Back to top] [Back to Schemas]

DocumentSummaryModel

Properties

Name Type Description Notes
dataset str Dataset name.
doc_title str The title of the document to be summarized.
query str Dataset-language-specific fulltext query, using mysql MATCH boolean query format. Example: "(word1 OR word2) AND (word3 OR word4)" [optional]
custom_stop_words list[str] [optional]
summary_length int The maximum number of bullet points a user wants to see in each topic summary. [optional]
context_amount int The number of sentences surrounding key summary sentences in the documents that they come from. [optional]
short_sentence_length int The sentence length (in number of words) below which a sentence is excluded from summarization. [optional]
long_sentence_length int The sentence length (in number of words) beyond which a sentence is excluded from summarization. [optional]

[Back to top] [Back to Schemas]

DocumentSummaryRespModel

Properties

Name Type Description Notes
job_id str If the job is taking too long, job_id is returned, GET /jobs can then be used to poll for results [optional]
result DocumentSummaryL1Model [optional]

[Back to top] [Back to Schemas]

EdgarAvailableFields

Properties

Name Type Description Notes
tickers object Mapping of tickers and company names (eg: {"GOOG":"Google"}) [optional]
filing_types list[str] List of form types to be scraped (eg. ["10K"]) [optional]
sections list[str] List of document sections to be scraped. If empty all sections will be scraped [optional]
count int Number of available filings matching the input filter (if sections is specified, count is the number of matching sections) [optional]
date_range list[date] Available date range matching the input filter [optional]

[Back to top] [Back to Schemas]

EdgarFields

Properties

Name Type Description Notes
tickers list[str] List of tickers to be scraped (eg. ["GOOG"]) [optional]
filing_types list[str] List of form types to be scraped (eg. ["10K"]) [optional]
sections list[str] List of document sections to be scraped. If empty all sections will be scraped [optional]
period_start date Start date for the period to analyze within the dataset. Format: "YYYY-MM-DD" [optional]
period_end date End date for the period to analyze within the dataset. Format: "YYYY-MM-DD" [optional]

[Back to top] [Back to Schemas]

EdgarQuery

Properties

Name Type Description Notes
destination_dataset str Name of the new dataset where the scraped documents will be inserted.
tickers list[str] List of tickers to be scraped (eg. ["GOOG"])
filing_types list[str] List of form types to be scraped (eg. ["10K"])
sections list[str] List of document sections to be scraped. If empty all sections will be scraped [optional]
period_start date Start date for the period to analyze within the dataset. Format: "YYYY-MM-DD" [optional]
period_end date End date for the period to analyze within the dataset. Format: "YYYY-MM-DD" [optional]

[Back to top] [Back to Schemas]

ExampleJobInnerResponse

Properties

Name Type Description Notes
color str The color you chose [optional]
user_email str Email of the user who started the job [optional]

[Back to top] [Back to Schemas]

ExampleJobResponse

Properties

Name Type Description Notes
job_id str If the job is taking too long, job_id is returned, GET /jobs can then be used to poll for results [optional]
result ExampleJobInnerResponse [optional]

[Back to top] [Back to Schemas]

FilePropertyModel

Properties

Name Type Description Notes
filename str Filename of the file uploaded [optional]
size int File size in bytes of the file uploaded [optional]

[Back to top] [Back to Schemas]

FilterModel

Properties

Name Type Description Notes
query str Fulltext query, using mysql MATCH boolean query format. Example, ("word1" OR "word2") AND ("word3" OR "word4") [optional]
custom_stop_words list[str] [optional]
metadata_selection object JSON object specifying metadata-based queries on the dataset, of type {"metadata_field": "selected_values"} [optional]
time_period str Alternative 1: Time period selection [optional]

[Back to top] [Back to Schemas]

JobRespModel

Properties

Name Type Description Notes
job_id str If the job is taking too long, job_id is returned, GET /jobs can then be used to poll for results [optional]
progress str JSON encoded progress of the job. [optional]
result str JSON encoded result of the job. [optional]
error str error messages [optional]
last_update int Last time the job progress was updated (UNIX seconds from 1970) [optional]

[Back to top] [Back to Schemas]

JobStatusRespModel

Properties

Name Type Description Notes
code int Return code from API [optional]
message str Return message from API [optional]

[Back to top] [Back to Schemas]

JsonPropertyModel

Properties

Name Type Description Notes
n_documents int Number of documents in the dataset [optional]
size int Size in bytes of the JSON record [optional]

[Back to top] [Back to Schemas]

KeyAuthorsL1RespModel

Properties

Name Type Description Notes
query str Query analyzed. If no query was provided, then these are the hot topics. [optional]
top_authors list[KeyAuthorsL2RespModel] [optional]

[Back to top] [Back to Schemas]

KeyAuthorsL2RespModel

Properties

Name Type Description Notes
author_names list[str] [optional]
author_docs list[str] [optional]

[Back to top] [Back to Schemas]

KeyAuthorsModel

Properties

Name Type Description Notes
dataset str Dataset name.
query str Dataset-language-specific fulltext query, using mysql MATCH boolean query format. Example: "(word1 OR word2) AND (word3 OR word4)" [optional]
custom_stop_words list[str] [optional]
num_topics int Number of topics to be extracted from the dataset per query to aggregate back into the key authors analysis. [optional]
num_keywords int Number of keywords per topic that is extracted from the dataset per query. [optional]
metadata_selection object JSON object specifying metadata-based queries on the dataset, of type {"metadata_field": "selected_values"} [optional]
time_period str Alternative 1: Time period selection [optional]
period_start str Alternative 2: Start date for the period to analyze within the dataset. Format: "YYYY-MM-DD" [optional]
period_end str Alternative 2: End date for the period to analyze within the dataset. Format: "YYYY-MM-DD" [optional]
num_authors int Max number of key contributors that the user wants to see returned by the analysis. [optional]
num_keydocs int Max number of key contributions from key contributors that the user wants to see returned by the analysis. [optional]
excluded_docs list[str] [optional]
custom_dict_file object Custom sentiment dictionary JSON file. [optional]

[Back to top] [Back to Schemas]

KeyAuthorsRespModel

Properties

Name Type Description Notes
job_id str If the job is taking too long, job_id is returned, GET /jobs can then be used to poll for results [optional]
result list[KeyAuthorsL1RespModel] [optional]

[Back to top] [Back to Schemas]

LegacyResponseModel

Properties

Name Type Description Notes
results str The JSON stringified results [optional]
job_id str If the job is taking too long, job_id is returned, GET /jobs can then be used to poll for results [optional]

[Back to top] [Back to Schemas]

ListDatasetsModel

Properties

Name Type Description Notes
job_id str If the job is taking too long, job_id is returned, GET /jobs can then be used to poll for results [optional]
result list[DatasetModel] [optional]

[Back to top] [Back to Schemas]

ListFiltersModel

Properties

Name Type Description Notes
job_id str If the job is taking too long, job_id is returned, GET /jobs can then be used to poll for results [optional]
result list[object] List of user filters [optional]

[Back to top] [Back to Schemas]

NestedDocInfoModel

Properties

Name Type Description Notes
title str Document title [optional]
sourceid str Document ID [optional]
attribute object JSON containing document metadata key:value pairs (eg. author, time..) [optional]

[Back to top] [Back to Schemas]

NestedTopicConsensusModel

Properties

Name Type Description Notes
keywords str Topic keywords [optional]
strength str Topic strength [optional]
consensus str Topic consensus [optional]

[Back to top] [Back to Schemas]

NestedTopicConsensusTransferModel

Properties

Name Type Description Notes
keywords str Topic keywords [optional]
strength str Topic strength on the validation dataset [optional]
consensus str Topic consensus on the validation dataset [optional]

[Back to top] [Back to Schemas]

NestedTopicSentimentTransferModel

Properties

Name Type Description Notes
keywords str Topic keywords [optional]
strength str Topic strength on the validation dataset [optional]
sentiment str Topic sentiment on the validation dataset [optional]
doc_topic_exposures_t1 list[str] [optional]
doc_sentiments_t1 list[str] [optional]
doc_ids_t1 list[str] [optional]

[Back to top] [Back to Schemas]

PostUserRespModel

Properties

Name Type Description Notes
job_id str If the job is taking too long, job_id is returned, GET /jobs can then be used to poll for results [optional]
result list[str] [optional]

[Back to top] [Back to Schemas]

Renamedatasetmodel

Properties

Name Type Description Notes
dataset str Old name of the dataset.
new_name str New name of the dataset.

[Back to top] [Back to Schemas]

RenameDatasetRespModel

Properties

Name Type Description Notes
job_id str If the job is taking too long, job_id is returned, GET /jobs can then be used to poll for results [optional]
result object new dataset name [optional]

[Back to top] [Back to Schemas]

SaveFilterModel

Properties

Name Type Description Notes
dataset str Dataset name.
filter FilterModel [optional]

[Back to top] [Back to Schemas]

SaveFilterRespModel

Properties

Name Type Description Notes
job_id str If the job is taking too long, job_id is returned, GET /jobs can then be used to poll for results [optional]
result FilterModel [optional]

[Back to top] [Back to Schemas]

SmartAlertsL1RespModel

Properties

Name Type Description Notes
query str Query analyzed. If no query was provided, then these are the hot topics. [optional]
novel_docs list[SmartAlertsL2RespModel] [optional]
new_words list[str] [optional]
new_sents list[str] [optional]

[Back to top] [Back to Schemas]

SmartAlertsL2RespModel

Properties

Name Type Description Notes
title str Document title [optional]
contrasted_summary list[str] [optional]

[Back to top] [Back to Schemas]

SmartAlertsModel

Properties

Name Type Description Notes
dataset str Dataset name.
query str Dataset-language-specific fulltext query, using mysql MATCH boolean query format. Example: "(word1 OR word2) AND (word3 OR word4)" [optional]
custom_stop_words list[str] [optional]
num_topics int Number of topics to be extracted from the dataset per query to aggregate back into a tracker. [optional]
num_keywords int Number of keywords per topic that is extracted from the dataset per query. [optional]
metadata_selection object JSON object specifying metadata-based queries on the dataset, of type {"metadata_field": "selected_values"} [optional]
time_period str Alternative 1: Time period selection [optional]
period_start str Alternative 2: Start date for the period to analyze within the dataset. Format: "YYYY-MM-DD" [optional]
period_end str Alternative 2: End date for the period to analyze within the dataset. Format: "YYYY-MM-DD" [optional]
num_days_back int Number of days from the latest date of the dataset, which are considered as foreground for the novelty analysis. [optional]
novelty_threshold float Value of novelty of a document above which a document is considered novel. This value should be between 0 and 1 and reflects the percentage of the document information unexplained by the background. [optional]
num_new_words int Maximum number of new words returned from the new words detection [optional]
excluded_docs list[str] [optional]
remove_redundancies bool If True, this option removes quasi-duplicates from the analysis. A quasi-duplicate would have the same NLP representation, but not necessarily the exact same text. [optional] [default to False]

[Back to top] [Back to Schemas]

SmartAlertsRespModel

Properties

Name Type Description Notes
job_id str If the job is taking too long, job_id is returned, GET /jobs can then be used to poll for results [optional]
result list[SmartAlertsL1RespModel] [optional]

[Back to top] [Back to Schemas]

TopicConsensusModel

Properties

Name Type Description Notes
dataset str Dataset name.
query str Dataset-language-specific fulltext query, using mysql MATCH boolean query format. Example: "(word1 OR word2) AND (word3 OR word4)" [optional]
custom_stop_words list[str] [optional]
num_topics int Number of topics to be extracted from the dataset and summarized. [optional]
num_keywords int Number of keywords per topic that is extracted from the dataset. [optional]
metadata_selection object JSON object specifying metadata-based queries on the dataset, of type {"metadata_field": "selected_values"} [optional]
time_period str Alternative 1: Time period selection [optional]
period_start str Alternative 2: Start date for the period to analyze within the dataset. Format: "YYYY-MM-DD" [optional]
period_end str Alternative 2: End date for the period to analyze within the dataset. Format: "YYYY-MM-DD" [optional]
excluded_docs list[str] [optional]
custom_dict_file object Custom sentiment dictionary JSON file. [optional]
remove_redundancies bool If True, this option removes quasi-duplicates from the analysis. A quasi-duplicate would have the same NLP representation, but not necessarily the exact same text. [optional] [default to True]

[Back to top] [Back to Schemas]

TopicConsensusRespModel

Properties

Name Type Description Notes
job_id str If the job is taking too long, job_id is returned, GET /jobs can then be used to poll for results [optional]
result list[NestedTopicConsensusModel] [optional]

[Back to top] [Back to Schemas]

TopicConsensusTransferModel

Properties

Name Type Description Notes
dataset0 str Name of reference dataset, on which topics are extracted.
dataset1 str Alternative 1: Name of validation dataset, on which topics are applied. Only pass in this argument if the validation dataset has been separately created. [optional]
fixed_topics object JSON object specifying the topics that are exogenously fixed, of type {"keywords": ["keyword_1", "keyword_2", "keyword_3"], "weights": [weight_1, weight_2, weight_3]} [optional]
query str Dataset-language-specific fulltext query, using mysql MATCH boolean query format. Example: "(word1 OR word2) AND (word3 OR word4)" [optional]
custom_stop_words list[str] [optional]
num_topics int Number of topics to be extracted from the dataset and summarized. [optional]
num_keywords int Number of keywords per topic that is extracted from the dataset. [optional]
metadata_selection object JSON object specifying metadata-based queries on the dataset, of type {"metadata_field": "selected_values"} [optional]
period_0_start str Alternative 2: Start date for the reference dataset. Use this approach if reference and validation datasets are different time slices of a superset. Format: "YYYY-MM-DD HH:MM:SS" [optional]
period_0_end str Alternative 2: End date for the reference dataset. Use this approach if reference and validation datasets are different time slices of a superset. Format: "YYYY-MM-DD HH:MM:SS" [optional]
period_1_start str Alternative 2: Start date for the validation dataset. Use this approach if reference and validation datasets are different time slices of a superset. Format: "YYYY-MM-DD HH:MM:SS" [optional]
period_1_end str Alternative 2: End date for the validation dataset. Use this approach if reference and validation datasets are different time slices of a superset. Format: "YYYY-MM-DD HH:MM:SS" [optional]
excluded_docs list[str] [optional]
custom_dict_file object Custom sentiment dictionary JSON file. [optional]
remove_redundancies bool If True, this option removes quasi-duplicates from the analysis. A quasi-duplicate would have the same NLP representation, but not necessarily the exact same text. [optional] [default to True]

[Back to top] [Back to Schemas]

TopicConsensusTransferRespModel

Properties

Name Type Description Notes
job_id str If the job is taking too long, job_id is returned, GET /jobs can then be used to poll for results [optional]
result list[NestedTopicConsensusTransferModel] [optional]

[Back to top] [Back to Schemas]

TopicContrastL1RespModel

Properties

Name Type Description Notes
keywords list[str] [optional]
keywords_weight list[float] [optional]
perf_metrics TopicContrastL21RespModel [optional]
classifier_config TopicContrastL22RespModel [optional]

[Back to top] [Back to Schemas]

TopicContrastL21RespModel

Properties

Name Type Description Notes
accuracy float Accuracy of the contrasting topic [optional]
recall float Recall of the contrasting topic [optional]
precision float Precision of the contrasting topic [optional]
f1 float F1 of the contrasting topic [optional]
balanced_accuracy float Balanced accuracy of the contrasting topic [optional]

[Back to top] [Back to Schemas]

TopicContrastL22RespModel

Properties

Name Type Description Notes
coef_ list[float] [optional]
intercept_ list[float] [optional]

[Back to top] [Back to Schemas]

TopicContrastModel

Properties

Name Type Description Notes
dataset str Dataset name.
query str Dataset-language-specific fulltext query, using mysql MATCH boolean query format. Example: "(word1 OR word2) AND (word3 OR word4)" [optional]
metadata_selection object JSON object specifying the 2 classes subject to classification, of type: if based on content-selection {"content": "word1 word2 ... wordN"} ; if based on other field {"metadata_field": ["values_class1", "values_class2"}
custom_stop_words list[str] [optional]
time_period str Alternative 1: Time period selection from now [optional]
period_start str Alternative 2: Start date for the period to analyze within the dataset. Format: "YYYY-MM-DD" [optional]
period_end str Alternative 2: End date for the period to analyze within the dataset. Format: "YYYY-MM-DD" [optional]
excluded_docs list[str] [optional]
syntax_variables bool If True, the classifier will include syntax-related variables on top of content variables [optional] [default to False]
num_keywords int Number of keywords for the contrasted topic that is extracted from the dataset. [optional]
remove_redundancies bool If True, this option removes quasi-duplicates from the analysis and retains only one copy of it. A quasi-duplicate would have the same NLP representation, but not necessarily the exact same text. [optional] [default to False]

[Back to top] [Back to Schemas]

TopicContrastRespModel

Properties

Name Type Description Notes
job_id str If the job is taking too long, job_id is returned, GET /jobs can then be used to poll for results [optional]
result TopicContrastL1RespModel [optional]

[Back to top] [Back to Schemas]

TopicDeltaL1RespModel

Properties

Name Type Description Notes
doc_ids_t0 list[str] [optional]
doc_ids_t1 list[str] [optional]
topics list[TopicDeltaL2RespModel] [optional]

[Back to top] [Back to Schemas]

TopicDeltaL2RespModel

Properties

Name Type Description Notes
keywords str Start-of-period topics [optional]
keywords_weight str Weight of keywords in each topic [optional]
strength str Prevalence of each topic in the dataset [optional]
doc_topic_exposure_deltas list[str] [optional]

[Back to top] [Back to Schemas]

TopicDeltaModel

Properties

Name Type Description Notes
dataset str Dataset name.
query str Dataset-language-specific fulltext query, using mysql MATCH boolean query format. Example: "(word1 OR word2) AND (word3 OR word4)" [optional]
custom_stop_words list[str] [optional]
num_topics int Number of topics to be extracted from the dataset and summarized. [optional]
num_keywords int Number of keywords per topic that is extracted from the dataset. [optional]
metadata_selection object JSON object specifying metadata-based queries on the dataset, of type {"metadata_field": "selected_values"} [optional]
period_0_start str Start date for the initial-period dataset. Format: "YYYY-MM-DD HH:MM:SS"
period_0_end str End date for the initial-period dataset. Format: "YYYY-MM-DD HH:MM:SS"
period_1_start str Start date for the final-period dataset. Format: "YYYY-MM-DD HH:MM:SS"
period_1_end str End date for the final-period dataset. Format: "YYYY-MM-DD HH:MM:SS"
excluded_docs list[str] [optional]
remove_redundancies bool If True, this option removes quasi-duplicates from the analysis. A quasi-duplicate would have the same NLP representation, but not necessarily the exact same text. [optional] [default to True]

[Back to top] [Back to Schemas]

TopicDeltaRespModel

Properties

Name Type Description Notes
job_id str If the job is taking too long, job_id is returned, GET /jobs can then be used to poll for results [optional]
result TopicDeltaL1RespModel [optional]

[Back to top] [Back to Schemas]

TopicHistoryL1RespModel

Properties

Name Type Description Notes
keywords str Topic keywords [optional]
strengths list[str] [optional]
sentiments list[str] [optional]
consensuses list[str] [optional]
time_stamps list[str] [optional]

[Back to top] [Back to Schemas]

TopicHistoryModel

Properties

Name Type Description Notes
dataset str Dataset name.
query str Dataset-language-specific fulltext query, using mysql MATCH boolean query format. Example: "(word1 OR word2) AND (word3 OR word4)" [optional]
custom_stop_words list[str] [optional]
num_topics int Number of topics to be extracted from the dataset and summarized. [optional]
num_keywords int Number of keywords per topic that is extracted from the dataset. [optional]
metadata_selection object JSON object specifying metadata-based queries on the dataset, of type {"metadata_field": "selected_values"} [optional]
time_period str Alternative 1: Time period selection [optional]
period_start str Alternative 2: Start date for the period to analyze within the dataset. Format: "YYYY-MM-DD" [optional]
period_end str Alternative 2: End date for the period to analyze within the dataset. Format: "YYYY-MM-DD" [optional]
n_steps int Number of steps in the historical analysis over the requested period. Each step is such that they contain an equal number of documents. [optional]
excluded_docs list[str] [optional]
custom_dict_file object Custom sentiment dictionary JSON file. [optional]
recompute_topics bool If True, this option will trigger a recomputation of the topics at each past point in time. Especially helpful if conducting historical analysis of a query. [optional] [default to False]
remove_redundancies bool If True, this option removes quasi-duplicates from the analysis. A quasi-duplicate would have the same NLP representation, but not necessarily the exact same text. [optional] [default to True]

[Back to top] [Back to Schemas]

TopicHistoryRespModel

Properties

Name Type Description Notes
job_id str If the job is taking too long, job_id is returned, GET /jobs can then be used to poll for results [optional]
result list[TopicHistoryL1RespModel] [optional]

[Back to top] [Back to Schemas]

TopicL1RespModel

Properties

Name Type Description Notes
doc_ids list[str] [optional]
topics list[TopicL2RespModel] [optional]

[Back to top] [Back to Schemas]

TopicL2RespModel

Properties

Name Type Description Notes
keywords str Topic [optional]
keywords_weight list[str] [optional]
strength str Prevalence of each topic in the dataset [optional]
doc_topic_exposures list[str] [optional]

[Back to top] [Back to Schemas]

TopicRespModel

Properties

Name Type Description Notes
job_id str If the job is taking too long, job_id is returned, GET /jobs can then be used to poll for results [optional]
result TopicL1RespModel [optional]

[Back to top] [Back to Schemas]

TopicSentimentL1RespModel

Properties

Name Type Description Notes
keywords str Topic keywords [optional]
strength str Topic strength [optional]
sentiment str Topic sentiment [optional]
doc_topic_exposures list[str] [optional]
doc_sentiments list[str] [optional]
doc_ids list[str] [optional]

[Back to top] [Back to Schemas]

TopicSentimentModel

Properties

Name Type Description Notes
dataset str Dataset name.
query str Dataset-language-specific fulltext query, using mysql MATCH boolean query format. Example: "(word1 OR word2) AND (word3 OR word4)" [optional]
custom_stop_words list[str] [optional]
num_topics int Number of topics to be extracted from the dataset and summarized. [optional]
num_keywords int Number of keywords per topic that is extracted from the dataset. [optional]
metadata_selection object JSON object specifying metadata-based queries on the dataset, of type {"metadata_field": "selected_values"} [optional]
time_period str Alternative 1: Time period selection from now [optional]
period_start str Alternative 2: Start date for the period to analyze within the dataset. Format: "YYYY-MM-DD" [optional]
period_end str Alternative 2: End date for the period to analyze within the dataset. Format: "YYYY-MM-DD" [optional]
excluded_docs list[str] [optional]
custom_dict_file object Custom sentiment dictionary JSON file. [optional]
remove_redundancies bool If True, this option removes quasi-duplicates from the analysis. A quasi-duplicate would have the same NLP representation, but not necessarily the exact same text. [optional] [default to True]

[Back to top] [Back to Schemas]

TopicSentimentRespModel

Properties

Name Type Description Notes
job_id str If the job is taking too long, job_id is returned, GET /jobs can then be used to poll for results [optional]
result list[TopicSentimentL1RespModel] [optional]

[Back to top] [Back to Schemas]

TopicSentimentTransferModel

Properties

Name Type Description Notes
dataset0 str Name of reference dataset, on which topics are extracted.
dataset1 str Alternative 1: Name of validation dataset, on which topics are applied. Only pass in this argument if the validation dataset has been separately created. [optional]
fixed_topics object JSON object specifying the topics that are exogenously fixed, of type {"keywords": ["keyword_1", "keyword_2", "keyword_3"], "weights": [weight_1, weight_2, weight_3]} [optional]
query str Dataset-language-specific fulltext query, using mysql MATCH boolean query format. Example: "(word1 OR word2) AND (word3 OR word4)" [optional]
custom_stop_words list[str] [optional]
num_topics int Number of topics to be extracted from the dataset and summarized. [optional]
num_keywords int Number of keywords per topic that is extracted from the dataset. [optional]
metadata_selection object JSON object specifying metadata-based queries on the dataset, of type {"metadata_field": "selected_values"} [optional]
period_0_start str Alternative 2: Start date for the reference dataset. Use this approach if reference and validation datasets are different time slices of a superset. Format: "YYYY-MM-DD HH:MM:SS" [optional]
period_0_end str Alternative 2: End date for the reference dataset. Use this approach if reference and validation datasets are different time slices of a superset. Format: "YYYY-MM-DD HH:MM:SS" [optional]
period_1_start str Alternative 2: Start date for the validation dataset. Use this approach if reference and validation datasets are different time slices of a superset. Format: "YYYY-MM-DD HH:MM:SS" [optional]
period_1_end str Alternative 2: End date for the validation dataset. Use this approach if reference and validation datasets are different time slices of a superset. Format: "YYYY-MM-DD HH:MM:SS" [optional]
excluded_docs list[str] [optional]
custom_dict_file object Custom sentiment dictionary JSON file. [optional]
remove_redundancies bool If True, this option removes quasi-duplicates from the analysis. A quasi-duplicate would have the same NLP representation, but not necessarily the exact same text. [optional] [default to True]

[Back to top] [Back to Schemas]

TopicSentimentTransferRespModel

Properties

Name Type Description Notes
job_id str If the job is taking too long, job_id is returned, GET /jobs can then be used to poll for results [optional]
result list[NestedTopicSentimentTransferModel] [optional]

[Back to top] [Back to Schemas]

Topics

Properties

Name Type Description Notes
dataset str Dataset name.
query str Dataset-language-specific fulltext query, using mysql MATCH boolean query format. Example: "(word1 OR word2) AND (word3 OR word4)" [optional]
custom_stop_words list[str] [optional]
num_topics int Number of topics to be extracted from the dataset and summarized. [optional]
num_keywords int Number of keywords per topic that is extracted from the dataset. [optional]
metadata_selection object JSON object specifying metadata-based queries on the dataset, of type {"metadata_field": "selected_values"} [optional]
time_period str Alternative 1: Time period selection [optional]
period_start str Alternative 2: Start date for the period to analyze within the dataset. Format: "YYYY-MM-DD HH:MM:SS" [optional]
period_end str Alternative 2: End date for the period to analyze within the dataset. Format: "YYYY-MM-DD HH:MM:SS" [optional]
excluded_docs list[str] [optional]
remove_redundancies bool If True, this option removes quasi-duplicates from the analysis. A quasi-duplicate would have the same NLP representation, but not necessarily the exact same text. [optional] [default to True]

[Back to top] [Back to Schemas]

TopicSummaryL1RespModel

Properties

Name Type Description Notes
keywords str Topic keywords [optional]
summary list[TopicSummaryL2RespModel] [optional]

[Back to top] [Back to Schemas]

TopicSummaryL2RespModel

Properties

Name Type Description Notes
title str Document Title [optional]
sentences str Sentences [optional]
sourceid str Document ID [optional]
attribute object JSON containing document metadata key:value pairs (eg. author, time..) [optional]

[Back to top] [Back to Schemas]

TopicSummaryModel

Properties

Name Type Description Notes
dataset str Dataset name.
query str Dataset-language-specific fulltext query, using mysql MATCH boolean query format. Example: "(word1 OR word2) AND (word3 OR word4)" [optional]
custom_stop_words list[str] [optional]
num_topics int Number of topics to be extracted from the dataset and summarized. [optional]
num_keywords int Number of keywords per topic that is extracted from the dataset. [optional]
metadata_selection object JSON object specifying metadata-based queries on the dataset, of type {"metadata_field": "selected_values"} [optional]
time_period str Alternative 1: Time period selection [optional]
period_start str Alternative 2: Start date for the period to analyze within the dataset. Format: "YYYY-MM-DD" [optional]
period_end str Alternative 2: End date for the period to analyze within the dataset. Format: "YYYY-MM-DD" [optional]
summary_length int The maximum number of bullet points a user wants to see in each topic summary. [optional]
context_amount int The number of sentences surrounding key summary sentences in the documents that they come from. [optional]
num_docs int The maximum number of key documents to use for summarization. [optional]
excluded_docs list[str] [optional]
remove_redundancies bool If True, this option removes quasi-duplicates from the analysis. A quasi-duplicate would have the same NLP representation, but not necessarily the exact same text. [optional] [default to True]

[Back to top] [Back to Schemas]

TopicSummaryRespModel

Properties

Name Type Description Notes
job_id str If the job is taking too long, job_id is returned, GET /jobs can then be used to poll for results [optional]
result list[TopicSummaryL1RespModel] [optional]

[Back to top] [Back to Schemas]

TopicTransferL1RespModel

Properties

Name Type Description Notes
doc_ids_t1 list[str] [optional]
topics list[TopicTransferL2RespModel] [optional]

[Back to top] [Back to Schemas]

TopicTransferL2RespModel

Properties

Name Type Description Notes
keywords str Topic [optional]
keywords_weight list[str] [optional]
strength str Prevalence of each topic in the validation dataset [optional]
doc_topic_exposures_t1 list[str] [optional]

[Back to top] [Back to Schemas]

TopicTransferModel

Properties

Name Type Description Notes
dataset0 str Name of reference dataset, on which topics are extracted.
dataset1 str Alternative 1: Name of validation dataset, on which topics are applied. Only pass in this argument if the validation dataset has been separately created. [optional]
fixed_topics object JSON object specifying the topics that are exogenously fixed, of type {"keywords": ["keyword_1", "keyword_2", "keyword_3"], "weights": [weight_1, weight_2, weight_3]} [optional]
query str Dataset-language-specific fulltext query, using mysql MATCH boolean query format. Example: "(word1 OR word2) AND (word3 OR word4)" [optional]
custom_stop_words list[str] [optional]
num_topics int Number of topics to be extracted from the dataset and summarized. [optional]
num_keywords int Number of keywords per topic that is extracted from the dataset. [optional]
metadata_selection object JSON object specifying metadata-based queries on the dataset, of type {"metadata_field": "selected_values"} [optional]
period_0_start str Alternative 2: Start date for the reference dataset. Use this approach if reference and validation datasets are different time slices of a superset. Format: "YYYY-MM-DD HH:MM:SS" [optional]
period_0_end str Alternative 2: End date for the reference dataset. Use this approach if reference and validation datasets are different time slices of a superset. Format: "YYYY-MM-DD HH:MM:SS" [optional]
period_1_start str Alternative 2: Start date for the validation dataset. Use this approach if reference and validation datasets are different time slices of a superset. Format: "YYYY-MM-DD HH:MM:SS" [optional]
period_1_end str Alternative 2: End date for the validation dataset. Use this approach if reference and validation datasets are different time slices of a superset. Format: "YYYY-MM-DD HH:MM:SS" [optional]
excluded_docs list[str] [optional]
remove_redundancies bool If True, this option removes quasi-duplicates from the analysis. A quasi-duplicate would have the same NLP representation, but not necessarily the exact same text. [optional] [default to True]

[Back to top] [Back to Schemas]

TopicTransferRespModel

Properties

Name Type Description Notes
job_id str If the job is taking too long, job_id is returned, GET /jobs can then be used to poll for results [optional]
status JobStatusRespModel [optional]
result TopicTransferL1RespModel [optional]

[Back to top] [Back to Schemas]

UploadFileRespModel

Properties

Name Type Description Notes
job_id str If the job is taking too long, job_id is returned, GET /jobs can then be used to poll for results [optional]
result FilePropertyModel [optional]

[Back to top] [Back to Schemas]

UploadURLModel

Properties

Name Type Description Notes
dataset str Name of the dataset where the file will be inserted.
file_url str Public URL pointing to the file (pdf/txt/docx...)
filename str Specify the filename if different from the URL (Nucleus guesses the file type from the extension at the end of the URL or file name) [optional]
metadata object JSON containing document metadata key:value pairs (eg. author, time..) [optional]

[Back to top] [Back to Schemas]

UploadUrlRespModel

Properties

Name Type Description Notes
job_id str If the job is taking too long, job_id is returned, GET /jobs can then be used to poll for results [optional]
result UrlPropertyModel [optional]

[Back to top] [Back to Schemas]

UrlPropertyModel

Properties

Name Type Description Notes
file_url str URL to the file uploaded [optional]
size int File size in bytes of the URL uploaded [optional]

[Back to top] [Back to Schemas]

User

Properties

Name Type Description Notes
user_email str Email of the user to register.
password str Password of the user to register.
first_name str First name of the user to register.' [optional]
last_name str Last name of the user to register. [optional]
phone int Phone number (int) of the user to register. [optional]
company str Name of the Company of the user. [optional]
title str Job title. [optional]
country str Country of origin. [optional]

[Back to top] [Back to Schemas]

UserModel

Properties

Name Type Description Notes
job_id str If the job is taking too long, job_id is returned, GET /jobs can then be used to poll for results [optional]
user_email str Email [optional]
first_name str First name [optional]
last_name str Last name [optional]
api_key str API key [optional]
phone str Phone number [optional]
company str Company [optional]
title str Title [optional]
settings str User settings [optional]
reg_time str Registration time [optional]
license_id str License ID [optional]
license_type str License type [optional]
expiry str License expiration date [optional]

[Back to top] [Back to Schemas]

Dataset Resources

The following datasets are provided to execute the tutorials.

custom-sentiment-dict.json

fomc-minutes.zip

quarles20181109a.pdf

trump-tweets-100.csv