// css

# Trying Out Deepsparse

I’ve been keeping an eye out for language models that can run locally so that I use them on personal data sets for thinking like summarization and knowledge retrieval without sending all my data up to someone else’s cloud. Anthony sent me a link to a Twitter thread about product called deepsparse by Neural Magic that claims to offer

[a]n inference runtime offering GPU-class performance on CPUs and APIs to integrate ML into your application

## Experimenting

Neural Magic provides a Docker container with a few options for interacting with their inference runtime. You can start it up like this:

docker run -it -p 5543:5543 ghcr.io/neuralmagic/deepsparse:1.4.0-debian11 deepsparse.server --task question_answering --model_path "zoo:nlp/question_answering/distilbert-none/pytorch/huggingface/squad/base-none"


This starts a model server for the specific task and model passed in the CLI args. Once started, you can play with the available API at http://localhost:5543/docs.

You can also start the container and drop into a Python REPL with the deepsparse libraries installed. This approach proved a bit easier for testing things out:

docker run -it ghcr.io/neuralmagic/deepsparse:1.4.0-debian11


Neural Magic has some docs on Hugging Face about building a question and answer pipeline, so I figured I would try that out on a recent Matt Levine article about Silicon Valley Bank.

from deepsparse import Pipeline

# the pasted contents of the article: https://www.bloomberg.com/opinion/articles/2023-03-10/startup-bank-had-a-startup-bank-run
context = "..."

# I plugged each of my questions below in here
question = "..."

inference = pipeline(question=question, context=context)

# the resulting QuestionAnsweringOutput
print(inference)


Here were some of the questions I asked, the model’s answers and my commentary on the responses.

question = "What did SVB's customers keep doing?"



Sort of, yeah.

question = "How much of the deposits are insured?"



Wrong, this is the total amount of deposits.

question = "What is SVB?"



Yes, the article does say that. I think you could probably argue this is the best answer given the context of the article.

question = "Who are SVB's customers?"



Funnily enough, yes, but more specifically, founders of companies with venture capital investor’s money.

question = "When did SVB collapse?"



Right.

question = "Who could buy SVB?"



Probably not? I don’t think the article suggests this might happen at least.

## Takeaways

To start, the software does perform as advertised. It is an inference engine I can download and run it on my local CPU, and it does produce responses of reasonable accuracy. It’s possible, having used GPT-3.5 and ChatGPT extensively, that my expectations were too high for running models on such under powered hardware. This model, after all, does a pretty good job with what it was trained on (the article text) and appears only to regurgitate prose of the article in response to a question – it doesn’t seem to have generative capabilities. Having limited model experience before beginning to play with the OpenAI playground, I assumed models would have baked in training context, but in retrospect it appears somewhat obvious that this may not always be the case. Models need to be trained on your datasets to be useful for your application. If I wanted “better” answers about what recently happened to SVB, I would probably need more training context, a different model or both.

This prompted me to research a bit more about other types of models and/or training approaches.

ChatGPT told me the following with regards to the model_path I pass into deepsparse:

• zoo : This refers to the zoo of pre-trained models available in the Hugging Face Model Hub. It is the base URL for accessing pre-trained models.
• nlp : This refers to the Natural Language Processing domain.
• question_answering : This indicates the specific task for which the pre-trained model has been fine-tuned, i.e., answering questions given a context.
• distilbert-none : This is the architecture of the pre-trained model. In this case, it is a distilled version of the BERT model, which has fewer parameters but achieves similar performance.
• pytorch : This indicates the deep learning framework used to train and load the pre-trained model.
• huggingface : This indicates the library that provides access to the pre-trained model.
• squad : This indicates the specific dataset on which the pre-trained model has been fine-tuned. In this case, it is the Stanford Question Answering Dataset (SQuAD).
• base-none : This refers to the specific version of the pre-trained model. In this case, it is the base version with no further modifications or enhancements.

That helps a bit. There are a lot of options that I need to understand better.

Let’s try the same set of questions with a different model just for fun:

from deepsparse import Pipeline

questions = [
"What did SVB's customers keep doing?",
"How much of the deposits are insured?",
"What is SVB?",
"Who are SVB's customers?",
"When did SVB collapse?",
]

# pasted article
context = "..."

for question in questions:
inference = pipeline(question=question, context=context)
print(f'{question}\n{inference}\n')


Here are the results:

What did SVB's customers keep doing?

How much of the deposits are insured?

What is SVB?
score=15.948863983154297 answer='a liability-sensitive outlier in a generally asset-sensitive world' start=6262 end=6328

Who are SVB's customers?