c©Copyright 2021
Rafal Kocielnik
Designing Engaging Conversational Interactions for Health &
Behavior Change
Rafal Kocielnik
A dissertation
submitted in partial fulfillment of the
requirements for the degree of
Doctor of Philosophy
University of Washington
2021
Reading Committee:
Gary Hsieh, Chair
Daniel Avrahami
James Fogarty
Hannaneh Hajishirzi
Program Authorized to Offer Degree:
Human Centered Design & Engineering
University of Washington
Abstract
Designing Engaging Conversational Interactions for Health & Behavior Change
Rafal Kocielnik
Chair of the Supervisory Committee:
Associate Professor Gary Hsieh
Human Centered Design and Engineering
The recent popularity of chat and voice-based conversational interactions fueled by advances
in natural language processing (NLP) has opened up opportunities for re-imagining user
interactions in health & behavior change as conversational experiences. Prior work has
indicated that a well-designed conversational approach can be more engaging, motivating,
natural, personal, and understandable. It can also mimic the properties of some of the
most successful human-led interventions, such as coaching and motivational interviewing.
However, designing conversational interactions poses numerous challenges. Efficiently cre-
ating conversational content that is diverse, relevant for the context, and sounds natural is
challenging. Furthermore, balancing the still limited AI capabilities with user expectations
requires careful problem scoping and other design considerations. Finally, the mechanisms
in which a successful conversational interaction can help improve user engagement are still
not well explored.
In this dissertation I propose 4 different conversational systems that address some of the
fundamental health & behavior change challenges. In Chapter 3 to address the intrinsic
challenge of user boredom and engagement loss with repeated interactions - I propose a
conversational system with value-based conversation topic personalization and diversifica-
tion. In Chapter 4 to address the challenge of engaging users in mindful self-learning from
their behavioral data - I propose conversational systems supporting structured reflection on
physical activity and on professional development at work. In Chapter 5 to support health
data collection, especially to improve user comfort in sensitive topics and understandability
among low-literacy populations - I propose a system for conversational survey administration.
Finally in Chapter 6, to lower the effort involved in designing good quality conversational
systems, I propose a tool for automated conversion of form-based surveys to a more engaging
conversational format.
My work identifies and provides evidence for several benefits of the use of conversational
interactions in health & behavior change. Among others, I demonstrate the benefits of
increased engagement in interaction, improved motivation for performing activities, acces-
sibility benefits related to familiarity, ease of use, comfort with sharing, and an ability to
guide the users in the behavior change process via dialogue. I also identify several important
challenges: perceptions of artificiality, managing high expectations of contextual knowledge,
and social intelligence, as well as lower efficiency that could negatively affect the experience
for some user groups. I further investigate the concrete links between conversational design
elements and these benefits and challenges. My thesis demonstrates various design processes
and automation techniques that can lower the effort of designing conversational experiences.
As technology progresses conversational interactions can offer valuable support compliment-
ing the existing automated tracking and the efforts of human health coaches. My work offers
an important contribution to our understanding of how conversational interactions can play
such a beneficial role.
TABLE OF CONTENTS
Page
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 The Promise of Conversational Agents . . . . . . . . . . . . . . . . . . . . . 1
1.2 Engagement Challenges of Current Technology for Behavior Change . . . . . 1
1.3 Value of Conversational Approach in Behavior Change . . . . . . . . . . . . 2
1.4 Challenges of Conversational Agent Design for Behavior Change . . . . . . . 3
1.5 Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.6 Research Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.7 Summary of Key Contributions . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.8 Summary of Key Findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Chapter 2: Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.1 Current Technology Support for Behavior Change . . . . . . . . . . . . . . . 13
2.1.1 Challenges of Sustaining Actions over Time . . . . . . . . . . . . . . 13
2.1.2 Challenges of Reflection and Learning from Data . . . . . . . . . . . 14
2.1.3 Challenges of Data Collection . . . . . . . . . . . . . . . . . . . . . . 15
2.2 Potential value of Conversational Approach . . . . . . . . . . . . . . . . . . . 16
Chapter 3: Conversational Activity Promotion: Designing Diversified and Tailored
Prompts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.1.1 Conversational Inspirations . . . . . . . . . . . . . . . . . . . . . . . 19
3.1.2 Design of Motivational Triggers . . . . . . . . . . . . . . . . . . . . . 19
3.2 Design Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
i
3.2.1 Target-diverse Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2.2 Self-diverse Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2.3 Generating Messages . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3 User Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3.1 Study 1 - Controlled Lab Experiment . . . . . . . . . . . . . . . . . . 22
3.3.2 Study 2 - Controlled Field Deployment . . . . . . . . . . . . . . . . . 23
3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.4.1 Quantitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.4.2 Qualitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.5.1 Conversational Message Triggers Design . . . . . . . . . . . . . . . . 32
3.5.2 Diverse Message Generation Process . . . . . . . . . . . . . . . . . . 34
3.6 Summary of Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
Chapter 4: Conversational Reflection: Designing for Physical Activity & Workplace
Productivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.1 Physical Activity Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.1.2 Design Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.1.3 User Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.1.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.1.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.1.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.2 Workplace Productivity Setting . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.2.2 Design Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.2.3 User Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.2.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.3 Discussion on Supporting Conversational Reflection in Both Settings . . . . 73
4.3.1 Comparison of Impact . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.3.2 Comparison of Conversational Reflection Design Approaches . . . . . 74
ii
4.3.3 Impact of Modality & Interaction Channel . . . . . . . . . . . . . . . 75
4.3.4 Private vs. Semi-public Space . . . . . . . . . . . . . . . . . . . . . . 75
4.4 Summary of Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
Chapter 5: Conversational Data Collection: Designing Health & Social Needs Con-
versational Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.1.1 Challenges of Collecting Data From Vulnerable Populations . . . . . 79
5.1.2 Technology-based Data Collection vs. Human Interviewing . . . . . . 79
5.1.3 Potential of Conversational Approach . . . . . . . . . . . . . . . . . . 80
5.2 Design Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.2.1 Design Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.2.2 User Interface & Response Options . . . . . . . . . . . . . . . . . . . 81
5.2.3 Persona . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.2.4 Dialogue-Based Interaction . . . . . . . . . . . . . . . . . . . . . . . . 82
5.3 User Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.4.1 Preferences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.4.2 Time to Completion . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.4.3 Equivalence of Responses . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.4.4 Reasons for Response Discrepancies . . . . . . . . . . . . . . . . . . . 85
5.4.5 Workload (NASA TLX) . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.4.6 Engagement, Understandability, and Comfort with Sharing . . . . . . 86
5.4.7 Interview Feedback . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.5.1 Positive Design Aspects . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.5.2 Challenging Design Aspects . . . . . . . . . . . . . . . . . . . . . . . 92
5.5.3 Future Design Directions . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.6 Summary of Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
Chapter 6: Automating the Design of Engaging Conversational Data Collection . . 95
6.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
6.1.1 Engagement Benefits of Conversational Survey Administration . . . . 96
6.1.2 Linguistic Elements of Engaging Conversational Design . . . . . . . . 97
iii
6.2 Making Survey Conversational - Design & Automation . . . . . . . . . . . . 98
6.2.1 General Design Principles . . . . . . . . . . . . . . . . . . . . . . . . 99
6.2.2 Building a Repository of Augmentation Phrases . . . . . . . . . . . . 101
6.2.3 Design of Augmentation Tasks . . . . . . . . . . . . . . . . . . . . . . 106
6.2.4 Automation of Augmentation Tasks . . . . . . . . . . . . . . . . . . . 111
6.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
6.3.1 ML Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
6.3.2 Correction Effort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
6.3.3 User Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
6.3.4 Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
6.3.5 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
6.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
6.4.1 ML Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
6.4.2 Correction Effort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
6.4.3 User Study: Quantitative Results . . . . . . . . . . . . . . . . . . . . 133
6.4.4 User Study: Qualitative Feedback . . . . . . . . . . . . . . . . . . . . 135
6.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
6.5.1 Design Definition Improvement Opportunities . . . . . . . . . . . . . 140
6.5.2 Correction Effort Reduction . . . . . . . . . . . . . . . . . . . . . . . 141
6.5.3 Automation Performance and Capability Improvements . . . . . . . . 142
6.5.4 Intrinsic Challenges of Conversational Survey Adaptation . . . . . . . 145
6.5.5 Additional Augmentation Tasks; Prototyping & Tailoring Support . . 147
6.6 Summary of Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
Chapter 7: Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
7.1 Benefits of Conversational Design in Health & Behavior Change . . . . . . . 151
7.1.1 Engagement in Interaction . . . . . . . . . . . . . . . . . . . . . . . . 151
7.1.2 Motivation to Perform Activity . . . . . . . . . . . . . . . . . . . . . 152
7.1.3 Accessibility: Familiarity & Understandability . . . . . . . . . . . . . 154
7.1.4 Comfort & Sharing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
7.1.5 Guidance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
7.2 Challenges of Explored Conversational Design for Health & Behavior Change 157
7.2.1 Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
iv
7.2.2 Artificial Feel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
7.2.3 High Expectations, Contextual & Social Intelligence . . . . . . . . . . 160
7.2.4 Effort of Creating Engaging Content . . . . . . . . . . . . . . . . . . 161
Chapter 8: Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
Chapter 9: Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
Chapter 10: Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
Appendix A: Examples of reactions matched to question-answer context . . . . . . . 193
Appendix B: Phrasing categories used for question rephrasing in automation . . . . 195
Appendix C: Hold-out surveys used in the user study evaluation . . . . . . . . . . . 196
Appendix D: Surveys used for ML development . . . . . . . . . . . . . . . . . . . . . 197
Appendix E: ML performance on the full dataset . . . . . . . . . . . . . . . . . . . . 198
Appendix F: Manual question corrections in Chapter 6 . . . . . . . . . . . . . . . . 199
Appendix G: Manual reaction corrections in Chapter 6 . . . . . . . . . . . . . . . . 205
v
LIST OF FIGURES
Figure Number Page
3.1 A non-diverse (baseline) and two diverse strategies as depicted in the cogni-
tive space: A) Non-diverse - messages connect self and exercising, B) Target-
diverse - messages connect concepts cognitively close to target (e.g. different
types of exercising) and self, C) Self-diverse - messages connect concepts cog-
nitively close to self (e.g. motivations) with exercising. . . . . . . . . . . . . 20
3.2 The exercise completion webpage used in the study and an example conver-
sational SMS prompt delivered on a mobile phone. . . . . . . . . . . . . . . . 25
3.3 Average self-reported exercise completion per study day. Big drops represent
weekends . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.1 Reflection depicted as a process with stages of levels synthesized based on
multiple structured reflection models. . . . . . . . . . . . . . . . . . . . . . . 41
4.2 Example of an actual user exchanges with our system’s mini-dialogues on the
left. On the right a block diagram of an example dynamic mini dialogue with:
actual user replies, user intents recognized based on free-text replies, and the
system tailored follow-ups. The red boxes represent a path where user reply
was not recognized and has been handled by a “generic” (non-tailored) follow-up. 43
4.3 Response rates to initial, follow-up questions, and average response length in
characters for 14 days of core study. . . . . . . . . . . . . . . . . . . . . . . . 48
4.4 System architecture of the Robota conversational agent. A common backend
supports chat interaction as a Slack bot and voice interaction as a custom
Amazon Alexa Skill using an Amazon Dash Wand. . . . . . . . . . . . . . . 62
4.5 An example of interaction with Robota using the chat module, in this case, a
mid-day journaling prompt. . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.1 HarborBot GUI elements. On the left “Question response types” showing
different types of responses users available. On the right “Control buttons”
show the 4 controls associated with each question. “Other elements” show
HarborBot icon and an ellipsis icon HarborBot used for mimicking writing by
a person in chat interaction. . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
vi
6.1 Distribution of 6 question phrasing categories in general and across the 16
development surveys. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
6.2 Distribution of 3 empathy question framing categories in general and across
the 16 development surveys. . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
6.3 Distribution of 3 empathy answer framing categories in general (this includes
the 138 answer examples extracted from common likert-scales [228]) and across
the 16 development surveys. . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
6.4 Distribution of labels in the 6 survey hold-out dataset. From the left: A) Dis-
tribution of question phrasing classes among surveys, B) Distribution of Em-
pathy Question Framing classes, C) Distribution of Empathy Answer Framing
classes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
6.5 The 3-step evaluation process: 1. ML performance - evaluation via accuracy
and F1 score in leave-one-out and 5-fold cross validation setups. 2. Correction
effort - manual editor effort needed to correct basic issues (e.g, grammatical
errors). 3. User study - impact of the adapted conversational surveys on
engagement, usability and the quality of the conversational elements. . . . . 119
6.6 Last page of the AMT user study asking for feedback on particular conversa-
tional augmentation design elements. On the left participants were shown the
log of their exchange with red-highlighted phrases of interest. On the right
they were asked to evaluate the overall quality of the phrases as well as to give
detailed free-form feedback. Pressing “Continue” would ask them to evaluate
another aspect (red highlights in the conversation would change accordingly). 122
6.7 User rated quality of conversational elements in AMT study on a 5-point likert
scale. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
6.8 Correcting reactions - Left: manual correction of question misclassification
results in the need of rewriting all the reactions (a cost of 66 character edits).
Right: with GUI support from an editing tool, the correction could involve
just re-labeling the question (a cost of 2 mouse clicks). . . . . . . . . . . . . 142
vii
LIST OF TABLES
Table Number Page
1.1 Thesis claims, research questions and chapter organization. . . . . . . . . . . 5
1.2 Summary of Chapters and Findings . . . . . . . . . . . . . . . . . . . . . . . 11
3.1 Examples changing part of the conversational prompts used in the field. . . . 25
3.2 Summary of the results from study 1 - differences between the conditions
based on the post-study measures . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3 Summary of the results from study 2 - differences between the conditions
based on the post-study measures . . . . . . . . . . . . . . . . . . . . . . . . 26
3.4 Mixed-effects regression models for predicting exercise completion . . . . . . 27
3.5 Summary of the proposed diverse message generation process based on the
approaches explored in both studies. . . . . . . . . . . . . . . . . . . . . . . 32
4.1 Examples of reflective questions generated during the workshop sessions. Ques-
tions are grouped by the main prompted categories (rows) and categories iden-
tified in through affinity diagramming (columns). Only the 6 most frequent
categories are shown. The five white cells represent intersections for which
the workshop participants generated no questions. For creating diverse and
novel questions, I suggested questions for these intersections. . . . . . . . . . 42
4.2 Summary of pre- and post study measures. The levels from Kember’s survey
are mapped to the stages of reflection in the structured reflection process. . . 48
4.3 Summary of the positive/negative aspects of the system design choices based
on feedback from participants. . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.4 Work activity journaling prompts for different journaling schedules (schedule
selected by the user). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6.1 Introduction & Closing template examples and their instantiations for specific
surveys. Phrases in-between square brackets are survey specific slots that are
filled-in dynamically. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
6.2 Examples of original survey items and the rephrasing resulting from the aug-
mentation process. Phrases in-between square brackets have been added or
modified. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
viii
6.3 Template examples of progress communication and topic switching phrases
and their instantiations for specific surveys. Phrases in-between square brack-
ets are survey specific slots that are filled-in dynamically. . . . . . . . . . . . 111
6.4 Summary of the setup of between different classifiers supporting the augmen-
tation tasks. The best setups have been determined in a limited parameter
exploration on development set (however, no exhaustive grid-search has not
been performed). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
6.5 Meta-parameters controlling the automated conversion . . . . . . . . . . . . 118
6.6 Classification performance for the 4 text classification tasks (+1 derived) used
in automated conversational survey adaptation. Question Empathy Framing
and Answer Empathy Framing classifications are part of empathetic addition
- the results of these two classifications taken together are used to decide on
reaction class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
6.7 Classification performance on 6 hold-out surveys used for correction effort
estimation and in user study. . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
6.8 Correction effort quantified as character edits per hold-out survey. The cor-
rections represent minimal changes to the grammar and empathetic reactions
needed from a survey administrator to present the conversational survey to
end users. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
6.9 Mixed-effects model predicting engagement by conversational element quality
rating. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
6.10 Comparison of accuracy for a classical ML model used and a pre-trained deep
learning model fine-tuned on the task dataset . . . . . . . . . . . . . . . . . 144
A.1 Examples of empathetic reactions matched to local question-answer context.
Phrases in-between square brackets have been added or modified. . . . . . . 194
B.1 Phrasing categories for survey questions derived empirically from survey data.
Each category is composed from a prefix that is preprended to the original
text of survey questions and a set of modification rules which change the text
of the question to fit the 3rd person & question form. . . . . . . . . . . . . . 195
E.1 Classification performance for the 4 text classification tasks (+1 derived) on a
full dataset of 22 surveys (combined 16 development and 6 hold-out surveys).
Question Empathy Framing and Answer Empathy Framing classifications are
part of empathetic addition - the results of these two classifications taken
together are used to decide on reaction class . . . . . . . . . . . . . . . . . . 198
ix
ACKNOWLEDGMENTS
There are numerous people who greatly contributed to this dissertation and to my growth
as a researcher in various ways. I will try to express my gratitude to all of them hoping I
can be forgiven if I missed anybody’s name.
I would first like to thank my advisor and mentor Gary Hsieh, for his guidance, under-
standing, and continued support through various challenges. I will always be grateful for
numerous valuable skills I have learned from him and for his support in helping me under-
stand the important first principles of research. I also want to thank him for allowing me to
entertain my curiosity and ideas in other research areas, even if they sometimes expanded
beyond his numerous areas of expertise. I am very thankful for his patience and tolerance
of my, sometimes, slow progress and for never giving up on me.
I would also like to thank all my dissertation committee members for their support
and sharing their expertise with me. I would like to thank Daniel Avrahami for helping
me think about the implications of my research for practice beyond academia and for our
collaborations on several exciting projects. I would like to thank Hannaneh Hajishirzi for
sharing her expertise on modern AI which helped me understand the important trade-offs in
applying such techniques in the HCI context. I would like to thank James Fogarty for helping
me understand the technical HCI work and his guidance on communicating my research in
a succinct, yet precise manner. A very valuable skill when having to pitch my research to
new audiences. Finally, I would like to thank my GSR Dan Weld for asking insightful and
thought provoking questions which helped me understand new facets of my work.
I would like to thank fellow department colleagues I had an opportunity to collaborate
with on various research projects, also outside of my dissertation work: Nan-Chen Chen, Ray
x
Hong, Meg Drouhard, Jina Suh, Elena Agapie, Michael Brooks, Raina Langevin, and Amelia
Wang. These collaborations allowed me to explore different areas, fueled my interests, and
expanded my perspectives on broader research. I would like to especially thank Nan-Chen
Chen and Ray Hong for our extended collaborations on several published projects.
I am also very grateful to all the members of the Prosocial Computing lab and my
cohort: Ahmer Arif, John J Robinson, Mia Suh, Lucas Colusso, Kristin N. Dew, Arpita
Bhattacharya, Christina Chung, Jenna Frens, Keri Mallari, Kerem O¨zcan, Kiley Sobel, Hye-
won Suh, Amelia Wang, Spencer Williams, Himanshu Zade, and Andrew Neang. Thank you
for numerous potlucks, happy hours, and sharing in my complaints about the hardships of
graduate life.
I would like to thank faculty at HCDE who were always very helpful in supporting me
and in providing resources I needed: Jennifer Turns, Sean Munson, Julie Kientz, Cecilia
Aragon, David McDonald. Especially Jennifer Turns provided invaluable guidance and re-
sources in helping me understand the important topic of refection. I would also like to thank
my numerous external faculty and industry research collaborators who helped with various
projects in and outside of my thesis: Andrea Hartzler, Jonathan Morgan, Dario Taraborelli,
Dennis Hsieh, and Herbert Duber. I also appreciate all the rich perspectives and guidance
from my various internship mentors: Andre´s Monroy-Herna´ndez, Justin Cranshaw, Daniel
Avrahami, Saleema Amershi, Jonathan Bragg, and Doug Downey.
Outside of the research I had great experiences with learning how to guide students and
support teaching during my long-term collaboration with Andy Davidson on teaching HCDE
539. I look fondly at all the fun physical computing project we were able to foster among
our students and our numerous conversations on how to improve the course further. I am
grateful for the invaluable knowledge about the teaching process I gained though my work
with Andy. I am also really thankful for his understanding of my grading delays during the
final quarter when I was busy wrapping up this dissertation. I also want to thank my other
xi
teaching collaborators: Brock Craft, and Rafael Silva.
Last, but not least, I am extremely grateful to my family for their support. My Mom
for always being there to listen to my challenges and support me with good advice. My
grandparents for their perpetual concern about my eating habits and to my girlfriend Ada
for always trying to align our busy research schedules to find some time to spend together.
xii
DEDICATION
to my Mom for her continued support
xiii
1Chapter 1
INTRODUCTION
1.1 The Promise of Conversational Agents
Conversational interaction, once a sci-fi dream, is now becoming increasingly common in
everyday use of various computer systems. Voice Assistants (VA) such as Alexa, Siri, and
Google Home provide voice-based interactions for accessing news [69], providing weather
information [217], supporting scheduling [56], and controlling devices at home [137]. Sim-
ilarly, text-based service chatbots support vacation planning, financial advice, and pizza
ordering [112]. The principal value of conversational interaction in most of these scenarios
is efficiency and claimed convenience for the user (e.g. hands-free interaction for voice [162],
reusing a familiar messaging interface for chatbots [127]). Another, somewhat less commer-
cially explored, benefit of conversational interaction lies in its potential for improving user
engagement. User engagement is important in both task-oriented and non task-oriented
applications [112]. Primary examples of such, arguably less task-oriented, applications are
social chatbots with Xiaolce [249], Mitsuku [183], and Replika.ai [171] being the most success-
ful. The engagement benefits of conversational interaction have been commonly attributed
to human mimicking aspects of such interfaces [41].
1.2 Engagement Challenges of Current Technology for Behavior Change
In the domains of health & behavior change, keeping users engaged, particularly over a longer
period of time and across multiple interactions has always been a challenge [7, 30, 182].
Technology has offered valuable support in automating many aspects of behavior change,
but has been arguably less successful in improving user engagement [93, 107, 117]. Typical
user journey in personal informatics (a technology based approach for supporting behavior
2change) involves, among others, 1) execution of actions that would help the user change
behavior, 2) reasoning based on past behavior and collected data, and 3) collection of data
about one’s behavior [73]. Each of these is associated with challenges. Wearable sensor-
based technologies (e.g., Fitbit, Apple Watch) aid users in automated collection of behavior
data and can be seen as some of the greatest benefits of the use of technology [117]. Yet,
certain data can’t be measured easily and needs to be collected through self reports [52].
Surveys are the primary method of collecting such data, but oftentimes users find them
tedious and resort to various response satisficing behaviors [104, 199]. Particularly in health
& medical domains where sensitive data and vulnerable populations may be involved this
can lead to trust, engagement and understandability issues, which lower response quality
and rates [96]. Even if the data can be collected automatically, a crucial purpose of this
data is to support reasoning and reflection on it [23]. For this purpose, the technology
currently supports users predominantly with dashboards and graphs. While useful, these
often encourage only surface level habitual “glancing” of the data, without engaging users
in deeper reflection on the meaning, interpretation, and mechanisms behind the information
[22]. This is, in part, why human health coaches can be more effective in keeping users
engaged [139]. Finally, sustaining behavior change actions via technology often relies on
repeated reminding strategies [173] and planning support [4]. Yet commonly used automated
reminders are repetitive, monotonous, and over a period of time tend to cause boredom and
are often eventually ignored by the users [35, 101, 223].
1.3 Value of Conversational Approach in Behavior Change
Conversational approaches with their ability to mimic some of the human-human interaction
aspects have the potential to improve on a number of these challenges. Human adminis-
tered surveys, especially with vulnerable populations, have been reported to lead to higher
quality responses, yet are more costly, less scalable, and not always possible [104]. Con-
versational interfaces have an ability to combine the best of both worlds by limiting costs
through automation and providing some of the valuable human-human interaction aspects.
3Similarly better learning from the automatically collected behavior data can be aided with
well-crafted reflection dialogues around such data. Such dialogues could support user reflec-
tion and self-learning in a manner similar to what human health coaches do [198]. Finally,
repetitiveness and artificiality of automated activity reminders can be reduced by mimicking
some aspects present in human use of messaging platforms, such as rich diversification of
topics and phrasing as well as tailoring based on understanding of conversational partner’s
interests and values. Such aspects come naturally in interaction with friends and professional
human coaches [4].
1.4 Challenges of Conversational Agent Design for Behavior Change
However, designing conversational interactions for health & behavior change is challenging
in a number of ways. Depending on the aspects being supported, the conversational agent
might have a clearly defined task (e.g. collect specific health data) or be fairly open (e.g.,
help users reflect on their own data). Usually, however, a certain mix of the two is desired.
Prior work indicates that users want to be efficient in their interaction with conversational
interfaces, but at the same time expect some level of socialization even in more task-oriented
applications [112]. Yet, previous work also warns against so-called ‘mission creep’ where the
conversational interface introduces distracting interactions for the sake of being more social
[97]. Combined with still profound limitations of AI technologies, this can easily lead to
bloated expectations of intelligence that the agent can’t sustain [156] and introduce the risk
of interaction breakdowns [12]. On top of that, there are indications that not all users expect
and enjoy the same level of socialization, even in the same context of use [147]. Furthermore
obtaining rich, diverse, and personalized contents for conversational interaction is another
design and technical challenge [76]. Existing datasets contain social media exchanges or tech
support dialogues [5], which are not directly suitable for use in behavior change applications.
Furthermore the design process involved in collecting or generating domain specific data from
users in the form amenable for use in dialogue design is not yet well explored [49, 140, 192].
Diversification of contents, to create novel and engaging interactions each time is yet another
4challenge [242]. Due to all these difficulties, designing good-quality conversational interfaces
can be exceptionally challenging and existing tools fall short of supporting numerous nuances
of the whole process well [97].
1.5 Thesis Statement
My thesis claim is summarized in the following statement:
Conversational interactions leveraging content diversification and tailoring can
(T1) increase activity adherence, (T2) facilitate reflection, (T3) support collection
of sensitive data. The effort of designing such interactions can be (T4) lowered
with automation
1.6 Research Overview
To demonstrate the fulfillment of this statement I explored the design of conversational
interfaces for supporting health and behavior change practical applications related to some of
the key aspects in this domain: 1) motivating & sustaining actions (Chapter 3), 2) supporting
learning from behavior or reflection (Chapter 4) and 3) collecting personal data (Chapter 5).
Through leveraging conversational design to support these applications I provide practical
approaches for addressing the common challenges involved in conversational design in health
& behavior change context. I also explore the mechanisms in which conversational interaction
can provide benefits to user engagement and overall experience. Finally, I distill the common
best-practice design principles I discovered across the settings to propose an automated
support for lowering the conversational design effort (Chapter 6). Table 1.1 describes the
research questions I examined aligned with the thesis claims. The structure of the rest of
my dissertation follows below.
Chapter 2 discusses the challenges of current technology support for health & behavior
change and how conversational approach could help address these challenges. I present key
aspects of health & behavior change which require support according to prevalent behavior
5Thesis Claim Research Question Addressed in
T1, increase
activity
adherence
RQ1: How can conversational approach
improve repetitive activity triggers?
RQ2: What is the impact of
conversational content diversification on
user boredom & adherence?
Chapter 3, through creation of two
systematic content diversification
strategies informed by cognitive space
theory. Further through lab and field
studies with Fitness Challenges system.
T2, facilitate
reflection
RQ3: How to generate engaging
personalized contents for reflection?
RQ4: How to leverage dialogue
structure to benefit engagement?
RQ5: What is the impact of
conversational approach & interaction
modality on ease and depth of reflection?
Chapter 4, through proposing a
workshop-based and structured reflection
informed content generation. Further
through development of Reflection
Companion and Robota systems, which
inject work tasks, fitness tracker data, and
health goals into the dialogues. Finally
through evaluation in the field studies.
T3, support
collection of
sensitive data
RQ6: How to design conversational
data collection to improve engagement,
comfort and understandabilty?
RQ7: What is the impact of
conversational approach to data
collection on vulnerable populations?
Chapter 5, through development of
HarborBot conversational social needs
screening system with empathy and
understandability features. Further
through evaluation with high and
low-health literacy patients in hospital
emergency room setting.
T4, lower the
conversational
design effort
RQ8: How can designing conversational
data collection be automated?
RQ9: Which design aspects are easy
and which are hard to automate?
Chapter 6, through proposing 4 tasks for
conversational survey adaptation and
creation of a repository of augmentations.
Further though a 3-step evaluation of
automation performance, remaining
correction effort, and via a user study.
Table 1.1: Thesis claims, research questions and chapter organization.
change [191] and personal informatics [73, 142] models. I discuss the specific challenges
identified in prior work and related to these aspects. I then discuss the characteristics and
potential benefits of conversational interfaces based on theoretical indications and past lab
studies. Finally, I discuss how these potential benefits of conversational design align with
challenges of current technology support for health & behavior change.
Chapter 3 explores the potential for conversational design to motivate & sustain activity.
In this work I redesign the activity triggers for physical activity promotion to follow conver-
6sational style. I specifically introduce natural diversity of language & topics and personalize
the interaction. For topic-based contents diversification I specifically propose two systematic
strategies informed by cognitive space theory. I use a simplified format of conversational
interaction focusing on single turn exchanges over a period of time on a mobile device. I test
the proposed strategies in a lab experiment and a controlled field study.
Chapter 4 explores the potential of conversational approach to engage users in meaningful
reflection. I design two conversational reflection agents for physical activity (Reflection Com-
panion) & workspace productivity (Robota) settings. Both settings have been indicated in
prior work as in need of support for reflection [139, 134]. In physical activity setting with the
use of mini-dialogue structure informed by structured reflection theoretical model I address
the challenges of lowering the effort and deepening the reflection by splitting the challenge
into smaller manageable guided reflection turns. In workspace productivity setting I com-
pare the use of voice and text-based chat for triggering reflection and explore the challenge
of personal reflection design for semi-public space. In both settings I also personalize and
diversify the interaction to make it more engaging. I test both approaches in field studies.
Chapter 5 investigates the opportunities conversational interaction offers for increas-
ing comfort with sharing sensitive personal data & improving understandability among low
health-literacy users. I design HarborBot, which is a mixed-modality (voice and text) chatbot
for conversational administration of social needs screening survey in a hospital emergency de-
partment (ED) setting. To specifically address the challenge of comfort with sharing sensitive
information I explore design for conversational empathy via social phrases, interface social
cues, and empathetic reactions. Furthermore to address the challenge of understandabil-
ity among low health literacy populations (common in this setting) I design conversational
question rephrasing and the use of voice-based question readout. I evaluate HarborBot in
a controlled study performed in ED setting with low and high health literacy patients com-
paring conversational and form-based social needs screening approaches.
Chapter 6 explores the opportunities for use of automation to support design of engaging
conversational interactions. I design & develop an automated process for adapting survey-
7based data collection to a conversational form. In this work I turn the manual design process I
applied in Chapter 5 into a semi-automated process that lowers design effort and systematizes
some of the engaging conversational design principles I developed in prior chapters. I perform
a 3-step evaluation. I evaluate the data-driven machine-learning aspect of adaptation in a
leave-one-out and cross-validation setups. I also evaluate and quantify the remaining survey
administrators’ manual correction effort (caused by automation imperfections) and finally
evaluate the impact of the generated conversational surveys in a crowd-sourced study.
Finally, Chapter 7 discusses how the conversational application projects explored in my
work provide evidence in support of my thesis claim. I also specifically discuss the benefits
and challenges of conversational approach identified throughout my work. I conclude by
briefly describing a few directions I plan to pursue in the future.
1.7 Summary of Key Contributions
Findings from my thesis provide a better understanding of how to design conversational expe-
riences that address the key challenges of successful health & behavior change. Furthermore,
my findings also indicate how the design process of engaging conversational experiences in
these settings can be supported with automation. The contributions of my thesis include:
Conversational activity promotion (Chapter 3)
• Design: Proposed two systematic content diversification strategies informed by cogni-
tive space theory: target-diverse and self-diverse.
• Design Process: Proposed a crowd-sourced process for generating motivational conver-
sational prompts which are both tailored to individuals’ values and diversified.
• System Artifact: Implemented a Fitness Challenges mobile conversational system for
activity promotion.
• Understanding: Provided insights into user perception of tailored & diversified conver-
sational prompts vs. repetitive non-conversational messages.
• Evidence: Demonstrated an ability of the self-diverse conversational prompts to signif-
8icantly increase exercise completion.
Conversational reflection (Chapter 4)
• Design Process: Proposed a workshop-based process for generating diversified conver-
sational reflection questions informed by a structured reflection theoretical model.
• Design: Proposed a 2-step mini dialogue informed by structured reflection model for
lowering reflection effort and guiding the users towards deeper reflection.
• Design: Proposed a design for reflection at work which combines work-related benefit
(i.e., journaling & reporting) via work-based channel with a personal benefit via a
dedicated separate channel (e.g., organization, career goals).
• System Artifacts: Implemented Reflection Companion & Robota conversational sys-
tems for physical activity & workplace productivity respectively.
• Evidence: Demonstrated the ability of conversationally supported reflection to engage
users in interaction as well as meaningful reflection.
• Understanding: Provided detailed understanding of how specific conversational ele-
ments (e.g., two-step dialogue, typing & sending response) affect user engagement and
the quality of reflection.
• Understanding: Provided insights into user perception of reflection via different modal-
ities (voice & text) and the related differences in interaction and engagement.
Conversational data collection (Chapter 5)
• System Artifact: A mixed-modality (voice and text) chatbot called HarborBot for
administering a social needs screening survey in a conversational manner.
• Evidence: Demonstrated the benefits of conversational data collection (for social needs
screening) with vulnerable populations.
• Understanding: Insights into how low & high literacy ED patients perceive different
aspects of conversational social needs screening.
9Automating conversational design for data collection (Chapter 6)
• Design: Proposed automated conversational adaptation of any survey in 4 steps: 1)
addition of introduction & closing, 2) addition of contextual empathetic reactions, 3)
addition of progress communication handling, 4) adaptation of question language to
conversational style.
• Implementation Artifact: Implementation of the proposed 4-step conversion approach
using ML techniques and a reusable repository of conversational augmentation phrases.
• Evidence: Demonstrated that the proposed automation approach can produce engaging
conversational surveys (comparable to manual design) with only limited additional
manual correction effort of grammar and misclassifications.
• Understanding: Provided insight into what it means to make survey-based data collec-
tion conversational & identified the trade-offs between survey administration require-
ments (e.g., dictated by validity) and an engaging conversational experience.
1.8 Summary of Key Findings
The summary of my findings from each chapter are presented in 1.2.
10
Chapter Summary of Findings
Conversational Activity
Promotion (Chapter 3)
• Self-diverse strategy significantly increase user activity performance in a 2-week long field
study, making users 3.7 more likely to exercise.
• Topic-based diversification of prompts can attract user attention (perceptually), provide in-
formational value (more opportunities for new information), increase personal relevance (more
opportunities for cognitive elaboration & human-like feel)
• Non-diversified repetitive prompts are perceived more like reminders, in which users ignore
the repetitive aspects (content blindness).
• Conversational design of prompts can trigger higher expectations of intelligence & contextual
meaningfulness and also invite higher scrutiny of the contents quality.
Conversational Reflection
(Chapter 4)
Both Settings
• Conversational approach can trigger different types of reflection (increased awareness, alter-
natives and future actions, and new insights) and offer tangible benefits (increased motivation,
new behaviors, mindfulness, more realistic plans).
• The use of personalized aspects in the dialogues (name, activity graphs, tasks) is useful for
grounding responses in personal experiences, promotes engagement & motivation
Physical Activity Set-
ting Specific
• Two-step mini dialogue structure for reflection can offer benefits: 1) extending thinking time
for reflection, 2) encouraging deeper thinking and more meaningful answers, 3) lower reflection
effort, but also runs the risk of disappointing users if the second step feels ‘generic’.
• Typing and sending responses in chat has the benefits of promoting deeper thinking, as well
as seriousness & precision in making plans, but also incurred typing effort.
• Once a day reflection supports reflection on a continual basis, enables devoting the whole day
to deepen reflection on one aspect.
Workspace Productiv-
ity Setting Specific
• Too broad or out of context reflection questions at work can be perceived as a meaningless
to reflect on and an unnecessary distraction.
• Slack-based text modality for conversational reflection perceived as 1) easier to read questions
and think about responses in their own time, 2) easier to reply in own time and describe the
details, 3) easier to review and change responses, but at the same time to be: 4) more time
consuming due to typing, and 5) less personal than voice.
• Voice modality for conversational reflection considered: 1) valuable to have a separate channel
just for reflection due to more personal feel and an ability to quickly capture ‘quick’ thoughts,
2) faster to answer questions with voice, as well as 3) more interactive, fun and engaging,
but at the same time caused: 4) a pressure to respond immediately and 5) listening to own
responses to be inconvenient and uncomfortable.
11
Chapter Summary of Findings
Conversational Data Col-
lection (Chapter 5)
• Conversational social needs screening applied in ED setting can be more engaging (due to
feeling of talking to somebody), perceived as more caring (due to personality & empathy),
and understandable (due to audio & question rephrasing) especially for low health literacy
populations.
• Efficiency of interaction is much more important for high health literacy, than low health
literacy users
• Perception of inefficiency can be triggered by 1) fixed & sequential pace of interaction, 2)
a need to wait before question shows up (e.g., due to typing indicator ‘ellipses’), 3) ability
to read faster than the voice readout, 4) presence of additional conversational utterances, 5)
inability to concentrate on reading with audio on.
• Conversational social needs screening can feel ‘pushy’ due to: 1) direct questions, like from a
teacher, 2) lack of lead in interaction between very sensitive questions, 3) perception of chat
trying to repeatedly get information that was declined, 4) feeling rushed to respond due to
short delays
Automating Conversa-
tional Design for Survey-
based Data Collection
(Chapter 6)
• A simple 4 tasks based conversion composed of 1) addition of introduction & closing, 2)
addition of contextual empathetic reactions, 3) addition of progress communication & topic
handling, 4) adaptation of question language to conversational style can offer engaging con-
versions while needing only relatively minor correction effort.
• The same empathetic reactions to user answers in conversational surveys can be very polar-
izing depending on context: perceived as engaging, natural, pleasant an even ‘cute’ in one
context and as judgmental and patronizing in another.
• Proposed approach involving 3 types of empathetic reactions suffers from: 1) lack of appro-
priate reaction class for specific scenarios, 2) insufficient use of broader context, and 3) lack
of specificity to survey contents.
• Several conversion challenges are related to the trade-offs between survey requirements and
conversational experience (e.g., rephrasing 2nd and 1st person survey items, addressing sur-
vey intrinsic question repetition), as well as availability of data matching socialization and
empathy to the user.
Table 1.2: Summary of Chapters and Findings
12
Chapter 2
BACKGROUND
Many people aspire to seek to change their behaviors to better themselves in various
aspects, such as eating healthier [72], exercising [93], being more productive [107], or better
managing one’s finances [119]. Increasingly changing lifestyles leading to widespread obesity
in developed countries, aging populations, and disparity in access to health services, further
emphasize the particular importance of behavior change in the domain of health [48].
Numerous existing behavior change frameworks (e.g., Theory of Planned Behavior [6],
Transtheoretical Model of Behavior Change [191]) identify various factors important in in-
fluencing one’s motivation to change behavior. These involve individual characteristics, such
as intrinsic or extrinsic motivation [230] or self-efficacy (e.g., how much the person believes
in successfully changing their behavior [17]). People may have different reasons for wanting
to change behavior, such as specific one-time goals, maintenance of existing positive habits,
or wanting to increase a particular behavior (e.g., increase physical activity levels) or even
eliminate or decrease unwanted behaviors (e.g., smoking). Behavior change, to be effective,
requires a fundamental ability to collect information about one’s behavior, an ability to effec-
tively use this information to introduce measurable changes to one’s behavior, and an ability
to sustain such improved behavior over time [73, 142]. Combination of all these factors make
behavior change on individual and societal levels very challenging [120]. That’s why, six
months after making a New Year’s resolution, only 46% of people were still on track with
their behavior change goals [177]. Similarly in the health domain, only 22% of Americans
follow the aerobic and muscle strengthening national guidelines [103], and less than 23% of
the world population meets recommended guidelines [237].
Recently some of the more practical challenges in behavior change, such as tracking
13
activities and measuring effectiveness of interventions, have been increasingly supported by
emerging technology-based tools. Such support was made possible by advances in wearable
sensors, widespread internet connectivity, and prevalence of mobile and IoT devices which led
to the creation of technology supported behavior change in the form of personal informatics
[73, 142].
2.1 Current Technology Support for Behavior Change
Personal informatics relates to the use of technology for collecting and reflecting on per-
sonal information [143]. Li et al proposed a five-stage model of personal informatics that
characterizes how technology can support people in behavior change by proposing phases of
preparation, collection, integration, reflection, and action [142]. Epstein et al. further built
up on this work with a lived informatics model, which adds, among others, aspects of lapsing
and resuming [73]. Similarly, beyond individual (the main focus of personal informatics), a
popular stages-of-change model [191] identifies several similar stages. The Precontemplation
stage in which the user may be unaware of problematic behavior and may need evidence and
support (also from others) in realizing that the change is needed. In the Contemplation and
Preparation stages, the user wants to take action, but may need support in deciding what
is the best action to take. Finally, the Action and Maintenance stages is when user takes
actions, but may need support in maintaining the positive momentum over time. These
models characterize how technology has and can be used for supporting behavior change.
While different models propose different steps or stages, they all identify a number of com-
mon crucial behavior change aspects and further characterize how technology has struggled
to support these aspects.
2.1.1 Challenges of Sustaining Actions over Time
The maintenance of positive activities is crucial in behavior change and a major challenge.
Lived Informatics model includes lapses and difficulties in resuming activities as temporal as-
pects of the challenge [73]. Identifying important factors for predicting and affecting people’s
14
intentions for actions is the main focus of many theoretical behavior change and persuasion
models [57]. Successful change in behavior usually requires consistent and sustained exe-
cution of actions considered desirable for an extended period of time (e.g., running, going
to sleep at proper times, eating healthy). Reminders and message-based triggers have been
some of the most commonly used technical solutions for supporting such sustained user in-
volvement in behavior change efforts [84, 167]. Yet despite their prominence, the design of
effective message-based triggers is challenging [54, 84]. One of the main problems stem from
the need for repeated user exposure to such triggers which can lead to annoyance, boredom,
content blindness, or purposeful avoidance [35]. This “alert fatigue” has been linked to use of
same or similar contents and lack of personal relevance of the message for users. Both issues
call for designing motivational triggers that are diversely phrased, novel and personalized in
their contents [65, 100, 169, 208]. Yet important challenges pertain: What aspects of the
behavior change triggers need to be diversified to make them appear novel, engaging, and
natural for users? How can we design diverse and personalized triggers in a way suitable for
use at scale? Which and how mimicking aspects of human-human communication can help
improve the efficacy of triggers?
2.1.2 Challenges of Reflection and Learning from Data
Helping users make sense and learn from behavioral data is yet another challenge. Tech-
nology has been used for combining user activity data collected from different sources (e.g.,
wearable activity trackers and self-reports) into a format intended to support exploration
and self-learning (reflection) on past behavior patterns [22]. This has often been supported
via dashboards [195], visual analytics tools [132] and glanceable displays [92]. Reflection is
considered a crucial step that translates observations to actions [23] which can help users in-
crease their self-knowledge [22], formulate realistic behavior change goals [139], and increase
self-control while promoting positive behaviors [144]. Despite the importance of reflection,
personal informatics models reveal little about how reflection can, or should be triggered
via technology [22]. At the same time several personal counseling techniques (such as moti-
15
vational interviewing [198]), as well as commercial behavior change programs (e.g., Weight
Watchers [113]) rely on engaging and insightful conversations with the goal of triggering
reflection on one’s own activity. Unfortunately, technology has struggled to successfully
support reflection in practice at the same level as human counselors [218] and design for
reflection is still in its infancy [23, 82]. As noted in [22] “prior work carries an implicit
assumption that by providing access to data that has been ‘prepared, combined, and trans-
formed’ for the purpose of reflection, reflection will occur.” Current best practices rely on
visualizations of self-tracking data [132, 55], or on journaling [149]. Both of these approaches
assume that reflection will occur naturally when the data is presented. However, reflection is
time consuming and not necessarily something that comes naturally to people [82]. In many
cases people need a reason to reflect or at least an encouragement to do so [168]. Therefore
important questions pertain: How can we design technology that could mimic some of the
best practices of human counseling and coaching to better support reflection in behavior
change?
2.1.3 Challenges of Data Collection
Technology has been particularly successful in supporting easy collection of measurable be-
havior and physiological data (e.g., steps, physical activity, heart rate). Unfortunately, not
all data relevant for behavior can be easily collected or inferred with sensor-based tracking.
Important aspects such as personality traits, individual motivations, as well as social de-
terminants of health still need to be largely collected via self-reports [52]. Technology can
still help, by means of electronic surveys which can easily and cost effectively scale, but
numerous studies have identified challenges in how technology is currently used for survey
administration. Electronic surveys can be easily ignored and can collect lower quality re-
sponses as compared to in-person interviews (in one study, 92.8% face-to-face response rate
compared to 52.2% web-survey response rate) [104]. Furthermore, traditional surveys have
been found to implicitly bias against non-whites [151], low income individuals, homeless or
those that are disenfranchised with mental health and/or substance use [44]. Such disen-
16
franchised populations are likely to suffer disproportionately more from health, financial and
legal issues and are in greater need of behavior change support [157]. These challenges have
been partially attributed to the difficulties of understandability, trust issues, as well as to the
rigid and impersonal way in which technology is employed to collect often sensitive and per-
sonal data [32]. Therefore important questions pertain: How can we support survey-based
data collection that is understandable, flexible and empathetic for sensitive settings? How
can we employ positive aspects of face-to-face interviewing to improve user engagement with
automated survey administration?
In summary, the majority of technology support for behavior change relies on providing
tools that can prove helpful for already motivated and engaged users, but technology sup-
port may fall short of effectively engaging less motivated ones. The often impersonal, rigid,
non-empathetic, and less thought provoking use of current technical tools can limit the level
of support the technology could provide for users in need. Therefore it is an interesting ques-
tion whether it’s possible to effectively redesign important behavior change interactions by
mimicking some of the aspects of natural human-human interaction to make these interac-
tions more engaging for the users? Furthermore, how to successfully design technology that
mimics such human-human interaction aspects without running into the trap of overpromis-
ing intelligent behavior given, still profound, technological limitations? In the next section
I look at how a conversational approach could potentially improve on some of the issues I
have identified in the currently existing technology-based support for behavior change.
2.2 Potential value of Conversational Approach
Conversational agents (CA) with their ability to mimic aspects of human-human interaction
have the capacity to improve on different behavior change challenges. Past work found
that polite interruptions used by CAs [27] as well as use of social dialogue, empathy and
expressions of friendliness [26] can improve users long-term engagement. Similarly, uses of
persuasion [30, 187] and behavior change techniques [90] in conversational contexts have been
shown effective for motivating users [206]. Furthermore, some of the intrinsic properties of
17
natural conversations, such as novelty of topics, natural phrasing, content diversification, and
personal relevance also have the potential to help sustain long-term engagement and alleviate
some of the alert fatigue experienced when supporting behavior change actions repeatedly.
To address the challenges of reflection and learning from behavior data conversational
approach can offer a number of advantages. One of the main methods in which human-
coaches engage people in learning is by repeatedly asking open questions that trigger deeper
thinking [139]. Such questions can help people understand their own needs and motivations.
Unfortunately, simple prompting approaches such as “tell me more” or restating what the
user said in the form of a question (e.g., Eliza [235]) only have short-term value [28, 176]. A
dialogue that can support structured progression of reflection and build up on user answers
can help elicit contemplative [114] and metacognitive [81] thinking, encouraging people to
think about the needs and wants beyond their first answers that come to mind when relating
to singular prompts.
Conversational approach can also be helpful in supporting data collection. Past work
demonstrated that conversational survey administration can increase user attention to survey
questions leading to better quality responses and higher engagement [124]. These qualities
have been specifically attributed to mimicking human-like interactions in chatbots [239].
Mimicking such aspects has also been shown to improve trustworthiness [194], which can be
very valuable when personal or sensitive data is collected. Furthermore the use of voice in
interaction can help mitigate understandability issues among low literacy participants [96].
This is particularly valuable as low literacy can be correlated with certain systematic health
conditions that would particularly benefit from behavior change interventions.
In summary a number of aspects related to interaction and appearance of CAs can be
valuable in improving user engagement and other crucial aspects related to various behavior
change challenges. While prior work focused on embodiment aspects of CAs, the details
of the language use (utterance phrasing, diversification, and novelty), the dialogues itself
(topics, progression, and social dialogue), as well as the design process involved in creating
these have been less explored. In this work I improve our understanding of these aspects.
18
Chapter 3
CONVERSATIONAL ACTIVITY PROMOTION: DESIGNING
DIVERSIFIED AND TAILORED PROMPTS
In this chapter I focus on the challenge of motivating & supporting action in health behav-
ior change. This goal aligns with Li et al’s personal informatics [142] challenge of promoting
repeated actions and with Prochaska et al’s stages-of-change [191] activity maintenance. I
investigate this challenge in the context of physical activity promotion via repeated message-
based triggers, which are widely used by existing technology to sustain activities over time.
I aim to demonstrate the value of conversational redesign of such triggers by means of con-
versational language style [124], diversification [78] & personalization [139] to address the
challenges of repetitiveness [101] and content blindness [105]. I use a simplified format of
conversational interaction focusing on single turn exchanges over a period of time on a mo-
bile device, which is a predominant platform used by users wanting to change their physical
activity related behaviors.
3.1 Background
Repeated triggers or reminders are one of the popular forms of motivating behavior change
in various domains [167], including exercising [54, 210], sustainable living [3], or civic en-
gagement [178]. Such message based triggers can serve to promote, remind, or even motivate
action [84]. However designing effective triggers is challenging [84, 54]. The challenge largely
lies in the need for frequent repetition of the triggers to sustain behavior over a period of
time, which can lead to user annoyance and boredom [35, 223], purposeful avoidance [101],
content blindness [105] and in extreme cases even lower motivation [208].
19
3.1.1 Conversational Inspirations
Social support has been shown to have a large positive impact on behavior change. Pos-
itive encouragements from friends and family in the form of social platforms or personal
messages have been shown to have a big impact on user engagement [121]. Hence a pro-
posed solution to adverse effects of repetition could be to make them feel more like coming
from a person by making them sound more conversational by personalizing and diversifying
their contents. Human health caches commonly personalize their communication with clients
based on knowledge about an individual [202]. Content-based and linguistic diversification
is a natural ‘side effect’ of how people communicate [185, 219] and has also been suggested
in the domain of advertising [41]. Furthermore prior work has shown that short and similar
messages coupled with high repetition accelerate the appearance of tedium (measured by an-
noyance and boredom) and some controlled experiments demonstrated the positive impact
of diversification in constrained settings [100, 208].
3.1.2 Design of Motivational Triggers
While conversational feel, personalization and diversification of communication seem like
good candidates for addressing adverse effects of behavior change trigger repetition, prior
work has applied these strategies inconsistently and often on a study-by-study manual basis
without a clear design process with predictable results [67]. Hence systematic reviews on SMS
mobile messages as well as health related text messaging specifically pointed to a need for
closer investigation of the relationship between design characteristics and user engagement
and retention [79, 102]. In this chapter I therefore investigate: How might designers diversify,
personalize, and make the behavior change triggers more conversational in a systematic
manner? How can a systematic design process support such improvements?
20
3.2 Design Approach
To support systematic message diversification I employ a cognitive space modeling approach
called Galileo Theory [1]. The theory operates with the notion of semantic similarity between
different concepts. It is quite similar to the general notion of e.g., semantic relatedness of
words based on Wikipedia links [71] or embedding-based word similarity in neural space [123],
with the important difference that the Galileo space is personal for an individual or a group
of individuals rather than universal. Using this conceptual framework allows me to diversify
messages in a personalized manner [25]. The theory defines two distinct terms in this personal
cognitive space: 1) Self referent term (e.g., “self”) and 2) target term (e.g., “exercising” for
physical activity). Concepts cognitively close to the “self” term are conceptually important
for an individual, while concepts close to the target term are semantically similar to it. I
used these two terms to propose two systematic content diversification strategies.
Figure 3.1: A non-diverse (baseline) and two diverse strategies as depicted in the cognitive space: A)
Non-diverse - messages connect self and exercising, B) Target-diverse - messages connect concepts
cognitively close to target (e.g. different types of exercising) and self, C) Self-diverse - messages
connect concepts cognitively close to self (e.g. motivations) with exercising.
21
3.2.1 Target-diverse Strategy
The concepts close to the target concept of “exercising”, being semantically close to it, but
still somewhat different linguistically offer an opportunity to create diverse formulations
of the messages while maintaining relation to the selected topic. While Galileo space is
personal, the semantic relatedness of concepts not around the “self” are likely to be universal
(e.g., “strength training” will be perceived as similar to “exercising” irrespective of the
person). The related concepts can therefore be obtained using word use similarity. I used the
“ensemble” semantic relatedness (SR) measure available in WikiBrain [210] to find an initial
list of 8 concepts closest to “exercising” in terms of SR measure: “weight training”, “jogging”,
“stretching”, “strength training”, “running”, “walking”, “aerobics”, “body building”. Based
on the properties of cognitive space, by connecting concepts related to exercise (e.g., strength
training) to “self” in a message, I can indirectly affect attitude towards “exercising” while
providing surface level diversification of the message phrasing (Figure3.1.B)
3.2.2 Self-diverse Strategy
The second strategy takes a similar approach, but centers around the self-referent term
of (“self”) (Figure3.1.C). In this case, however, I can’t rely on simple word similarity as
the terms need to be “close” to the notion of “self”, which in Galileo space means these
are the concepts that the message recipients would care about (e.g., stress reduction). In
that case the notion of similarity is likely not universal, but personal. Such personally
relevant similar concepts can obtained in various ways, I used two approaches in my work:
In study 1, I used prior literature around exercising motivations [135] which informed 8 key
motivations people have for exercising: “stress reduction”, “physical appearance”, “increased
vigor”, “relaxation”, “health”, “fitness”, “pleasure”, “self-esteem”, In study 2 I relied on
Schwartz’s values framework [209] to rate individual’ closeness to 10 basic universal values
of: “achievement”, “benevolence”, “conformity”, “hedonism”, “power”, “security”, “self-
direction”, “stimulation”, “tradition”, “universalism”.
22
3.2.3 Generating Messages
While the above strategies provide a set of key concepts or terms that the message should
revolve around (e.g., “fitness” and “aerobics” or “self-esteem” and “jogging”) the actual text
of the message (syntax) also needs to be designed. I incorporated two different approaches
to do that. In study 1, I used a template based approach, where I keep the syntactic
components intact and just swap different terms (e.g., “[Exercising] can help with improving
[self-esteem]. Latest research has confirmed many of the anticipated benefits of improved
[self-esteem]”). In study 2, I used a crowd-sourced generation where the crowd-workers were
asked to write a motivational message that connects the given notions (e.g., instruction to
connect “power” and “exercising” could result in a message: “People respect someone who
makes a commitment to exercise!”). Examples are presented in Table 3.1.
3.3 User Study
I conducted two studies to investigate the effects of diversification. First study focused on
comparing self-diverse, target-diverse and baseline strategies in a controlled lab study with
150 Amazon Mechanical Turk (AMT) workers using template-based message generation.
The second study further tested the more personalized strategy (“self-diverse”) in a 2 weeks
long field deployment with 28 participants receiving messages along with exercise challenges
on their mobile phones.
3.3.1 Study 1 - Controlled Lab Experiment
The AMT workers (45% male, median age group 25-34, 44% exercised regularly) participated
in a between subjects study with 3 conditions: target-diverse, self-diverse and the baseline
non-diverse. Participants were told that they will be asked to evaluate a draft of an infor-
mational website about health and nutrition. They were then shown a series of 4 web pages.
Each web page contained health and nutrition information (100-150 words) and an associated
image (e.g., presenting people running or eating healthy). I used the actual content from a
23
university’s student health services’ webpage. On each of the 4 pages the participants were
shown 1 health tip consisting of an image (always the same for each message and condition)
and a pop-up text message trigger (participants saw four different messages, one on every
page). To protect against possible ordering effects, I counterbalanced the message order for
the manipulation conditions. Participants received $2.20 for study participation. Based on
the design goals I had 4 hypotheses:
• H1: Annoyance towards the message-based triggers will be lower when using diverse
strategies.
• H2: Boredom with the message-based triggers will be reduced when using diverse
strategies.
• H3: Informativeness of the message-based triggers will be rated higher using diverse
strategies.
• H4: Helpfulness of the message-based triggers will be rated higher using diverse strate-
gies.
To test these I measured annoyance (H1.1), boredom (H1.2), informativeness (H1.3),
and helpfulness (H1.4) by asking the participants to estimate the experienced level of each
towards the message contents on a 5-point likert scale. I measured behavior intention and
attitude using TPB [6]. The reliability of both TPB measures were high, attitude: α=0.88,
intention: α=0.77. I also measured reactance (negative impact of persuasion) following [66].
3.3.2 Study 2 - Controlled Field Deployment
I also conducted a second study that involved a field deployment. This was meant to address
the two main limitations of the first study. First is the lack of realism. Study 1 involved only
four messages and participants were not sent these messages in the context of behavior to
perform. The second major limitation is that I collected only perceived measures. I aimed to
24
address these in the field deployment where users were asked to perform actual exercises and
report their performance with a reply text message on their mobile (3.2). I used a between
subject design, comparing the stronger of the two strategies (self-diverse) to the baseline,
non-diverse. A stratified randomization based on the level of physical activity was used to
assign participants into two experimental groups (self-diverse and non-diverse).
I recruited 28 participants online and through fliers distributed at a university campus
(18% male, median age of 31, 32% claimed to exercise regularly). Participants were invited
to use a daily-challenges application that I built for the study. The application presents 4
daily exercises that the participants were asked to complete: 2-4 push-ups, 12-15 crunches,
12-15 lunges, and 12-15 jumping-jacks (these numbers were chosen using feedback from
pilot studies). Participants were notified via a message trigger to perform these activities
four times a day (the order of activities was randomized daily). Participants were asked to
perform the activities at 9:30 am, 11:30 am, 1:30 pm and 3:30 pm. They also received a daily
summary message at 6:00 pm, which linked them to a webpage showing completion status
dashboard (Figure 3.2). Participants could mark their activity completion either directly
through the communication channel (e.g., texting “done” back), or if they forgot, they could
manually enter in their completion on the website.
In addition to the 4 hypotheses from study 1 I also tested the messages’ effect on adher-
ence:
• H5: Diversification will increase exercise completion
Each participant was awarded $35 for participation and an additional $20 if she agreed
to participate in a follow up interview. On top of the measures from study 1 I collected the
self-reported exercise completion rating and conducted one-hour semi-structured interviews
with 14 participants that expressed interest in being interviewed.
25
Figure 3.2: The exercise completion webpage used in the study
and an example conversational SMS prompt delivered on a mo-
bile phone.
Table 3.1: Examples chang-
ing part of the conversational
prompts used in the field.
3.4 Results
3.4.1 Quantitative Results
In study 1 I found that both strategies were considered significantly more informative and
helpful as compared to the baseline (Table 3.2), however, only the self-diverse strategy offered
significant reductions in annoyance and boredom. Hypotheses H1 and H2 were only partially
supported (not supported for target-diverse), while H3 and H4 fully were supported. The
results indicate that the self-diverse strategy performed better compared to the target-diverse
one. It could be that the self-diverse strategy addresses more personally relevant issues and
such personal relevance may render it less annoying and boring.
In study 2, on top of the measures from study 1, I collected the self-reported exercise
completion rating, which I focus on first as the most direct behavioral measure of message
effectiveness. The main hypothesis for the field deployment was that self-diverse messages
26
will increase exercise completion (Table 3.3).
Table 3.2: Summary of the results from study 1
- differences between the conditions based on the
post-study measures
Table 3.3: Summary of the results from study
2 - differences between the conditions based
on the post-study measures
I found that this hypothesis was supported. First, through visual inspection of the data,
those in the self-diverse condition completed more exercises in 12 out of the 14 days and on
the other 2 days the differences were negligible (Figure3.3); Second, I constructed a mixed-
effects logistic regression model for predicting the completion of each prompted exercise
(Model 1 in Table 3.4). I found that those in the self-diverse condition are 3.7 times more
likely to exercise, but this result is only weakly significant (p=0.09). Using the second model
I analyzed exercise completion per day. I binned the number of daily exercises completed into
two levels using median split (0-2 completed as fewer and 3-4 completed as more) and used
a similar mixed effects logistic regression to predict likelihood to complete more exercises
(Model 2 in Table 3.4). This model also shows that participants in the self-diverse condition
exercised more (about 6.5 times more likely to complete 3-4 exercises, daily; p=0.04).
Finally model 3 leveraged the fact that there were message repetitions in our self-diverse
condition (I did not have the 4*14 messages needed for the full duration of the study), which
allowed me to more specifically test whether message repetition affects exercise completion.
Using the self-diverse only dataset, I coded up how many times a specific message has been
27
Figure 3.3: Average self-reported exercise
completion per study day. Big drops rep-
resent weekends
Table 3.4: Mixed-effects regression models for predict-
ing exercise completion
sent to an individual participant and used that as a predictor variable in the model (Model 3,
repetition count in Table 3.4). I found that, controlling for the time progression of the study
(exercise day), the number of repetitions was indeed a significant factor influencing exercise
completion (p¡0.001) with participants being 0.5 times less likely to complete the exercise
with each message repetition. Interestingly, in this model, the effect of exercise day was
not significant, suggesting that the decline in the self-diverse condition is mostly due to the
message repetition rather than potential novelty effects associated with study participation.
On the other hand, post-study self-report measures (Table3.3) indicated that none of the
H1-H4 hypotheses were supported. To help explain these potentially conflicting results, and
to explore whether and how these triggers helped, I turn to our qualitative results from study
2 interviews.
3.4.2 Qualitative Results
Based on interview feedback I found a number of themes related to effectiveness and per-
ception of conversational triggers. First of all, across both conditions, all the participants
appreciated that the triggers reminded them to regularly perform exercises during the day.
Some even pointed out that they would not have done any exercises during the day if they
28
were not reminded of them. Even if the participants felt that they triggers were pushing
them a little, they still considered it to be a positive push: “I liked that it was annoying me
to do exercise. Kept nagging at me to do the few workouts needed (. . . )” (P5, non-diverse).
Many participants appreciated that the messages felt positive, encouraging, but also to the
point. This shows that regardless of trigger type, having triggers was helpful in general.
Diversification helps
Diversification used in conversational messages in the experimental condition was generally
appreciated. In-depth analysis of user feedback allowed me to identify three most common
ways in which diversification was considered helpful: 1) attracting attention, 2) providing
information, and 3) personal relevance.
Attracting attention: All the participants noticed that the contents of the messages
were changing. These constant changes built an expectation of novelty each time a message
arrived, which in turn increased attention, sustained engagement and interest in reading the
messages: “I would definitely skip over them if they were all the same message.” (P26, self-
diverse) Furthermore, the diversity of the messages led to increased curiosity and introduced
a certain element of “fun”: “With all different messages it’s fun to read what they say. (...)
That’s a fun element to it I guess. I remember a few of them” (P22, self-diverse).
Providing information: About half of our participants in the self-diverse condition liked
the fact that the messages delivered small informational pieces about the different benefits
of exercising: “I liked that they talked about all the different benefits of exercises. One of
them was learn about yourself, which I thought was cool. Something about endorphins.” (P2,
self-diverse).
Also some of the information in the messages was not necessarily obvious to the partici-
pants and was therefore somewhat revealing. On top of that, the participants perceived such
informational pieces as inspirational: 11I like that it seemed informative. Some of the things
29
weren’t really as obvious. It’s kind of inspirational (...). Yeah, so like, there were some that
were kind of more factual.” (P15, self-diverse).
Personal relevance: Most of the participants also resonated with specific messages or
specific keywords in them. These participants specifically remembered selected messages
that had some sort of personal value to them either by relating to their past or current
experiences: “I remember one, ‘Being healthy yourself helps your aging parents better, as
they get older.’ (...) I think that one was the one that I picked up most, because my parents
are getting old.” (P24, self-diverse).
In that sense the message diversity was very valuable for helping the participants cog-
nitively elaborate on the personal value of exercising. It is worth noting that we have also
proposed the ease of making the diverse messages personally relevant as an explanation of
the higher informativeness and helpfulness ratings in our lab study.
Additionally, what also seems to have contributed to the sense of personal relevance was
the fact that the messages felt as if a person wrote them: “I did think that they were written
by a person. Certainly, there’s a sense of someone writing them for you at some point” (P3,
self-diverse).
Diverse conversational triggers - reminders & motivators; Non-diverse - reminders only
Further analysis of participants’ comments revealed that the triggers were perceived and eval-
uated differently between conditions. In the non-diverse condition, the participants almost
immediately noticed that the motivational part of the message is fixed and would subse-
quently focus on just the part that changes - the exercise to complete: “My first reaction
was ‘Yes, exercise is good for me’. Then because they never changed my brain just looped
past entirely. (...) I stopped really paying attention to it so much and was going straight to
what is the exercise.” (P4, non-diverse).
Consequently, they perceived the messages as simple reminders about exercising, which
they considered helpful: “The message is very straightforward. It’s short. ‘Exercising is good
30
for you and then do something now’.” (P25, non-diverse).
On the other hand, those in the diverse condition employed a more critical evaluation
of the contents. The changes from message to message solicited more attention from the
participants. They scrutinized the messages more than those in the non-diverse condition
to see whether the particular contents they received this time is actually appropriate and
helpful to them given their context. They expected the messages to be somewhat intelligent
and meaningful motivators rather than just simple automated reminders: “It would be like
‘Oh, nature is the same as exercise.’ It’s like what does that mean? (...) I love the self-help
stuff. I love motivation, but it needs to actually make sense to me or seem like somewhat
logical I guess” (P22, self-diverse).
This effect, coupled with the fact that people tend to remember the negative more than
the positive [21], resulted in the participants exposed to the diversification strategy recalling
and focusing on incidents in which the messages were less helpful. They evaluated the
messages not just as a reminder for which exercise they need to complete, but also for their
motivational component.
Challenges in designing diversity
Despite many benefits, I also identified a number of challenges that designers need to con-
sider to improve on the use of diverse conversational messages. These are: 1) quality of
diversification, 2) challenge of the need for perpetual novelty, and 3) contextual relevance.
Opportunities to increase diversity: Despite the diversification, a number of partic-
ipants still felt that the messages were not that different. These participants commented
that they indeed noticed that the messages were technically different, but felt that they were
also very similar in terms of tone and framing: “(...) They seemed pretty similar in terms
of being encouraging of exercise and talking about the different benefits. It seemed like they
were the same in tone but certainly, each one was different.” (P7, self-diverse).
Many participants also commented about the practical value of information in the mes-
31
sages. Although the messages were generally perceived as presenting diverse information
about the benefits of exercising, which was appreciated, a number of participants felt that
the provided information was not necessarily very revealing for them. They generally felt
that they already knew most of the information: “I think very few of the things were really
novel to me, once they’re saying, ‘Exercise to look and feel better,’ you kind of know that.”
(P16, self-diverse).
Repetition is a problem: Making sure that the messages stay novel is important. For
this 2-week study we did not have enough different messages to ensure that the participants
will not receive the same message twice. Unfortunately, almost all the participants noticed
this fact and it lead to disappointment in each single case. The fact the messages were
repeating, led the participants to lose interest in reading them. Majority reported that they
started skipping the motivational part at and focused on the exercise they have to complete
once they noticed that the messages are not novel: “At first, you know ... I think I would have
liked it a lot more if you guys had a whole new set of different messages. Make the reading
more enjoyable (...) Later when the messages were ... Seems to be repeating I stopped reading
them.” – (P11, self-diverse).
It is worth noticing that this qualitative feedback is consistent with my qualitative analysis
showing that the message repetition had a significant negative impact on exercise completion.
3.5 Discussion
In this research, I sought to evaluate and understand the feasibility of the two proposed
strategies for diversification of behavior change messages. Through both studies, I found
several benefits of using the proposed approaches. In a controlled lab study setting, the
strategies resulted in messages that were perceived to be more informative and helpful and the
self-diverse strategy also reduced annoyance and boredom from repeated exposure. Applied
in the field and in a more conversational context of SMS exchanges, the self-diverse strategy
also led to an increase in behavior change adherence. This application represented the
32
Table 3.5: Summary of the proposed diverse message generation process based on the approaches
explored in both studies.
basic use of conversational design (single turn, agent initiative only). At the same time,
the exchanges were sustained and repeated over a longer period of time making the setting
closer to long-term relational agents employed in [28]. Even this simplified application showed
the benefits of a diversified, personalized and more conversational approach to motivating
behavior change on an everyday basis that can be further improved with a more elaborate
conversational approach.
3.5.1 Conversational Message Triggers Design
Through this research, we proposed two strategies and a general process to generate diverse
trigger messages. These strategies provide different benefits making them useful under dif-
ferent circumstances. The self-diverse strategy seemed more effective in mitigating negative
effect of repetition. But the target-diverse strategy may also be useful in settings when the
target concepts should not be addressed directly, e.g., in anti-smoking campaigns talking
33
about smoking directly may actually induce more smoking from smokers [95].
Tone & Framing Diversification: In relation to the messages contents itself, the qual-
itative feedback from the field deployment helped me identify that despite the measurable
effectiveness of the diversification strategy, there are opportunities for improvement. One
such direction relates to the possible use of different framing or tones for the messages. Cur-
rently all the messages are generally positively motivational. I could imagine, the prompts in
my crowd sourced message generation process that are more focused on challenging the re-
cipient, pointing out negative consequences of inaction or that employ social comparison for
the purpose of triggering competitiveness or cooperation. This would increase the syntactic
diversity of the messages within the framing of the prompted concepts.
Addressing Long-Term Repetitiveness: Another aspects of the diversification that
could further be improved relates to the repetition of the messages. Both quantitative and
qualitative data indicated that when the messages started repeating at some point it had
a measurable negative impact on exercise completion and participants’ overall experience.
Unfortunately, it may be impossible (and too costly) to generate an infinite number of diverse
messages. There are, however, a number of options I could explore. One, we could increase
the perceived novelty of the messages following some of the techniques discussed in the
previous paragraph. Another possibility is that given a sufficiently large set of messages,
people might start forgetting past exchanges. There might be an optimal threshold for total
message count dependent on use frequency. This could require a more costly generation
process, but effectiveness of varying exposure has already been indicated in [102]. Yet another
strategy could be to expand the current single-turn, agent initiative only interaction to more
of a mini-dialogue with different interaction paths dependant on user answers. Such mini-
dialogues could render slightly different exchanges each time contributing to perception of
diversity. On the other hand more elaborate interaction could involve more user effort.
34
Context matching: Finally, the current diversification and message delivery did not take
into account the context in which the interaction takes place. Many participants pointed
out that they expected the messages to “match” the activity they are expected to perform
or change in respect to the time of day or social setting they are in. Lack of such matching
negatively affected the perception of personal relevance and introduced a sense of artificial-
ity. I could address this mismatch, by prompting message generation for specific contexts in
advance and then try to match these messages to the appropriate context; this would un-
fortunately increase generation costs. Another approach would be to automatically modify
the already existing messages (e.g., via providing templates with slots) to make them more
appropriate for the specific context [220].
3.5.2 Diverse Message Generation Process
Aside from demonstrating effectiveness, the practical execution of the designs also provides
design insights into the processes and workflows that can be used for generating diverse
conversational messaging content. Based on my experiences through study 1 and 2, I propose
a four-stage process of systematic conversational message trigger generation (Table 3.5): 1)
Concept generation, 2) Concept selection, 3) Messages generation, and 4) Message selection.
I summarize the value and importance of each generation step.
Concept Generation: The first step of the process with a goal to generate a diverse set
of concepts related to the target concept I proposed two ways of generating diverse concepts:
target- and self-diverse. For target-diverse approach, I used semantic relatedness measure
(based on WikiBrain [71]) to help assess concepts that are related to the target concept (in
my studies - “exercising”). Other approaches that assess relatedness can also be employed
(e.g., embeddings [141]). For self-diverse approach I used past-literature (study 1) and
values framework (study 2) for generating personally relevant concepts. For settings where
the motivations are broad or unclear, the values framework offers an alternative strategy. It
provides a manageable set of universal values that people across cultures care about, just at
35
varying degrees [49]. Generating the contents using these values can then result in a number
of personally relevant conversational triggers.
Concept Selection: The goal of this step is to narrow down the size of the concept space.
This is optional and mostly important for reducing the costs involved in executing the next
stages. The set of concepts can be focused around those most closely related to the target
concept or self. This was done in study 1, where I selected the 3 most relevant concepts out
of 8 initial ones based on the crowd-generated cognitive space. However, generating cognitive
space required laborious comparisons of pairs of concepts. In the near future, this process
may be done algorithmically.
Message Generation: This step is where the concepts are turned into concrete text. One
appropriate approach is to use fixed sentence templates as I did in study 1. This, however,
can produce messages that feel artificial (due to limited lexical diversity). Existing fully
automated methods of natural generation could be used, but the exact outcome can be hard
to control [33]. In study 2 I explored a crowd based generation, which results in messages
that seem much more natural. The messages are also likely to be more creative as well
as lexically and semantically diverse, as crowd-workers have the freedom to weave in other
concepts, aside from the ones prompted. This benefit has already been observed in previous
work, where crowd-workers introduced topics from personal experience [51].
Message Selection: This step focuses on selection of messages, from the generated mes-
sage corpus, that are the most relevant for a particular participant or particular context of
delivery. The goal of this step is to further increase the “natural feel” and personal relevance
of the messages. In this work, I focused on personalization based on individuals’ values. I
asked participants to fill out a short survey to assess their value orientations. Then I sent
them a set of the messages that are more personally relevant. To reduce cost, future versions
may be able to utilize recent NLP advancements in social media based personality profiling
36
[42]. Although careful consideration of user privacy and permission for data use needs to be
considered.
Context matching can also be important as I have learned in the field deployment, where
the pre-generated text did not always go well with some activities or specific social or time-
based context. Such mismatch has been picked up by our participants and affected the
perceived quality of the conversational triggers. It might also be valuable to include context
information already in the “message generation” step to generate text for the set of ex-
pected contexts. Another approach would be to use automated natural language generation
techniques, to slightly alter the messages on the fly to make them fit the context better.
3.6 Summary of Contribution
In this chapter I examined a more conversational approach to design of motivational trig-
gers, which are commonly used in behavior change for activity promotion. I identified as the
major challenges to current technology support for activity triggering: repetition and limited
tailoring to an individual. I consequently redesigned these triggers using aspects of a nat-
ural conversation: lexical & topic diversity [78], personalization & tailoring used by human
coaches [139], as well as conversational language style [124]. I specifically proposed two sys-
tematic content diversification strategies informed by cognitive space theory: target-diverse
and self-diverse. Target-diverse strategy uses concepts related to the target concept (e.g.,
“exercising”), while the self-diverse strategy uses concepts related to an individual (e.g., mo-
tivations for exercising) to inform diversification. I paired these strategies with topic-based
tailoring informed by an individual’s values profile [209]. I evaluated both strategies in a lab
study as well as in a 2-week long field deployment. I demonstrated that the conversational
design of triggers based on these diversification strategies results in higher perceptions of
informativess, helpfulness, as well as reduced annoyance and boredom (Study 1) and most
importantly can lead to higher real-world exercise completion (Study 2). Designers and
practitioners in health & behavior change could use the proposed strategies to improve ef-
fectiveness of their motivational approaches. Also designers of conversational systems can
37
leverage these strategies to inform diversification in the conversational design. Finally this
work, aside from the designs themselves, proposed a systematic process employing crowd-
sourcing and computational semantic-relatedness for effectively reproducing the designs for
use in different settings.
38
Chapter 4
CONVERSATIONAL REFLECTION: DESIGNING FOR
PHYSICAL ACTIVITY & WORKPLACE PRODUCTIVITY
This chapter aims to address the challenge of helping the users reflect and learn from
their activities, which relates to stages of ‘integration’ and ‘reflection’ from Li’s five-stage
personal informatics model [142]. In terms of stages-of-change [191], this goal aligns with
challenges users experience predominantly in ‘Contemplation’ and ‘Preparation’ stages. Ex-
isting technology for supporting this goal in behavior change uses visual analytics dashboards
or journaling, which oftentimes rely on substantial prior user motivation, effort and graph
literacy [87] to be effective. The goal of work presented in this chapter is to engage users in
‘reflection’ on their activities and their data using conversational approach inspired, among
others, by approaches employed by human-coaches. I design and evaluate the use of conversa-
tional approach for reflection in two settings: physical activity and work productivity. Both
settings represent a unique set of challenges and have been indicated in prior work as in need
of support for reflection [139, 134]. Physical activity relies on user self-defined actions and
goals, is largely personal, and likely performed in a private setting. In productivity settings
work tasks are to some extent assigned by others and need to be reported. This setting also
involves semi-public space for interaction (i.e., office). Through the course of these works, I
explore two approaches to domain specific dialogue content generation: 1) workshop-based,
and 2) literature-based, and also investigate the differences between the two modalities of
conversational interaction: 1) voice-based and 2) text-based.
39
4.1 Physical Activity Setting
4.1.1 Background
In physical activity tracking context, mobile, wearable consumer devices allow people to
collect and examine large amounts of data about their activities, behavior, and wellbeing.
However, a gap remains between our ability to collect and visualize data, and our ability to
learn from-, and act upon this data in meaningful ways [143]. A key component for bridging
this gap is to facilitate reflection [23, 45, 144]. The value of engaging users in reflection has
been identified as a key element of successful health behavior change [144, 158]. Through
the process of reflection, users can increase their self-knowledge [22], formulate realistic be-
havior change goals [139], and increase self-control while promoting positive behaviors [144].
Reflection has been considered an impetus that moves the individual from examinations of
his or her data to action [23].
Technology Support for Reflection
Despite the importance of reflection, behavior change models reveal little about how reflec-
tion can, or should be triggered [22]. Consequently technology has struggled to successfully
support reflection in practice [82, 196]. As noted in [22] “prior work carries an implicit
assumption that by providing access to data that has been ‘prepared, combined, and trans-
formed’ for the purpose of reflection, reflection will occur.” Indeed, one of the main means
of facilitating reflection using technology relies on visualizations of self-tracking data, such
as Fish’n’Steps [148], UbiFitGarden [53] for physical activity; Affect Aura [161] for affective
states and LifelogExplorer [132] for stress. The other approach relies on journaling [188],
such as SleepTight [45] for sleep and Affective Diary [149] for manual journaling of emotions.
Both of these approaches assume that reflection will occur naturally when data is presented.
However, reflection is time consuming and not necessarily something that comes naturally
to people [82]. In many cases people need a reason to reflect or at least an encouragement
to do so [98, 168].
40
Human Health-Coaches
Taking inspiration from personal counseling, supporting reflection through conversation
seems like a promising approach. Several personal counseling techniques, such as motiva-
tional interviewing [198] and commercial behavior change programs (e.g., Weight Watchers
[113]) rely on engaging and insightful conversations with the goal of triggering reflection
on one’s own activity. Personal coaches “repeatedly ask questions to get at hidden mo-
tivations” and that asking reflection questions can help people understand and articulate
their underlying needs and goals [139]. Such conversations can elicit contemplative [114] and
metacognitive [81] thinking, encouraging people to think about the needs and wants beyond
their first answers that come to mind. In this chapter I therefore investigate: How should a
conversational system facilitate reflection on physical activity? Further, can a conversational
system support reflection that is engaging rather than burdensome?
4.1.2 Design Approach
The design process for the Reflection Companion conversational agent involved two parts:
1) a workshop with activity tracker users to generate reflection questions, 2) modification of
the questions to fit dialogue context and formulation of two-step reflection dialogues.
Workshop-based Content Creation
I organized workshops with activity tracker users to prompt them to write questions about
physical activity structured by provided reflection framing. Structured reflection models pro-
vide insights for designing reflection-centered interactions and offer support for how reflection
can be supported to evolve with time [82, 109]. Such models see reflection as a process with
stages or levels. Atkins and Murphy [14] in their review of literature on reflection, identified
three commonly-shared stages: 1) awareness of uncomfortable feeling and thought, 2) critical
analysis, and 3) development of new perspectives. My approach for structuring the reflection
dialogue aligns with these three stages, which for simplicity I refer to as stages of: Noticing,
41
Understanding, and Future actions (Figure 4.1)
Figure 4.1: Reflection depicted as a process with stages of levels synthesized based on multiple
structured reflection models.
Another critical component of a conversational system for reflection are the questions that
trigger users to reflect. In this work, I employed a workshop-based approach to generate a
set of reflective questions that could be used in the dialogue to trigger reflection. Working
with 12 existing users of activity trackers, (8 female, 4 male) with an average age of 27.3
(SD=2.9), the workshop approach helped me generate a diverse set of reflection prompts
(Table 4.1).
Workshop participants generated a total of 275 questions in 3 categories I prompted for:
Noticing (n=76), Understanding (n=116) and Future actions (n=83). Following analysis of
the generated questions, I found the questions within one reflection stage were not all the
same and, in fact, could be further sub-categorized by topical aspects of interest. I decided to
perform this categorization to be able to later select the most diverse representatives for each
discovered category. To do that, I performed affinity diagramming among 3 researchers. The
most frequent categories are presented in Figure 4.1. These categories represented different
specific aspects of behavior change the participant wanted to reflect on.
Conversational Agent Design
Based on the outcomes of the workshops, I set out to design a system with the following
three goals: 1) To guide users towards deeper reflection on physical activity through dialogue
42
Table 4.1: Examples of reflective questions generated during the workshop sessions. Questions
are grouped by the main prompted categories (rows) and categories identified in through affinity
diagramming (columns). Only the 6 most frequent categories are shown. The five white cells
represent intersections for which the workshop participants generated no questions. For creating
diverse and novel questions, I suggested questions for these intersections.
progression, 2) To provide engaging, novel and diverse conversations around reflection, and
3) To enable interaction on personal mobile devices (predominant platform used among
workshop participants).
I designed a conversational system - Reflection Companion - that engages users in reflec-
tion on aspects of physical activity through reflection prompts. Reflection Companion uses
SMS/MMS for the conversational exchanges. It initiates a short conversational exchange
with an opening question sent once a day at random time within a time range specified by
the user. I implemented this system as a PHP server using a Twilio API1 for managing the
SMS/MMS exchanges (Figure 4.2). To generate graphs of users’ physical activity, I used
FitBit API2 to download the latest synchronized user activity data periodically throughout
1https://www.twilio.com/docs/usage/api
2https://dev.fitbit.com/build/reference/web-api/
43
Figure 4.2: Example of an actual user exchanges with our system’s mini-dialogues on the left. On
the right a block diagram of an example dynamic mini dialogue with: actual user replies, user intents
recognized based on free-text replies, and the system tailored follow-ups. The red boxes represent
a path where user reply was not recognized and has been handled by a “generic” (non-tailored)
follow-up.
the day. To make the reflection conversation engaging and to encourage a deeper level of
reflection, I employed three strategies: the use of a two-step minidialogue structure, every-
day short reflection sessions, and personalization. Here I further specify the details of these
strategies.
Guiding Towards Deeper Reflection through Mini-Dialogues: To support deeper
reflection, I used a question-follow-up question design, or what I will refer to as a mini-
dialogues design. The mini-dialogues have an opportunity to direct user reflection towards
deeper levels by bringing users’ attention to different aspects of the reflection process based
on user response to the initial question. To build such mini-dialogues, I created follow-up
44
questions to most of the initial reflection prompts. I followed the progression of the reflection
process: questions about awareness would be followed by questions about understanding,
whereas questions about understanding would be followed by questions about future actions.
The follow-up is asked only after the user provides a response to the initial question. I
designed 25 different mini-dialogues, 10 of them have the same follow-up question regardless
of what the user writes in their initial response. However, the remaining 13 mini-dialogues
feature a dynamically tailored follow-up question. In such dialogues, a different follow-up
question is delivered depending on the user’s initial response. The tailored follow-ups are
designed in such a way as to build upon the user initial response and encourage a deeper
level of reflection on the shared information, e.g. if the initial question asked: “What are
some of the ways that your work has impacted your physical activity this week?” and the
user replied with “Work impacted my exercise because I sit at a desk most of the day”, then
the follow-up question would be “What could you do to prevent your work from impacting
your physical activity?” (Figure 5). On the other hand, if a user replied to the same question
with: “I walked a lot this week at work because we were changing offices” then the follow-up
would be: “How could you set up your work to help you be more active in the future?”
In this example the mini-dialogue is trying to guide the user from understanding how the
work impacted her activities, to future actions that can help with being more active. This
is following the progression suggested by structured reflection models depicted in Figure 4.
Everyday Reflection Session: An important aspect in the design of our system was the
frequency of prompting users to engage in reflective conversations. Too frequent requests for
reflection can potentially make the topics to reflect on repetitive and can lead to boredom or
frustration given similar activity data and finite diversity of our mini-dialogues. On the other
hand, too infrequent reflection can cause people to forget previous revelations, preventing
them from building up on past observations and disrupting support for reflection as a process
[218]. Human-provided counseling sessions happen infrequently, no more than once or twice
a week, similar to the frequency of meetings observed in programs such as Weight Watchers
45
[113]. These sessions are, however, much deeper and more extensive than what the Reflection
Companion can currently support. The mini-dialogues are designed to provide brief moments
of reflection, rather than support full motivational interviewing sessions. Given indications
from past work that users of mobile activity trackers frequently engage in short awareness
interaction sessions with their data within one day [93], along further feedback from the
workshops, where active tracker users indicated checking their data on their mobile phone
at least once a day, I decided to prompt users daily.
Providing Personally Relevant and Diverse Conversations around Reflection: To
make the reflection dialogues engaging, I personalized the experience by introducing questions
that referenced users’ own behavior change goals using an introductory phrase such as: 11Hi
Jake, you listed as one of your goals: 1taking regular breaks daily’. . . ” after which a
reflection question would be presented. The introductory phrases changed each time to
provide for a more natural experience. Five mini-dialogues referenced users’ behavior change
goals. These mini-dialogues were template based and automatically used the user reported
daily, weekly or long-term goal. Each dialogue also addressed users by name and employed
a friendly conversational tone following indications from [131, 239]. Furthermore, in order to
make the reflection focused and personally relevant, 17 mini-dialogues were delivered with
a graph showing the user’s physical activity metrics (15 plotting steps, one calories burned
and one sleep). 14 of these graphs showed a week worth of data, 3 showed a comparison of
two weeks of steps (see Figure 5). To provide an explicit link between the data shown in the
graph and the reflection questions, these mini-dialogues would open with phrases such as: “Hi
Kate, please take a look at your graph. . . ”. Such introductory phrases again varied each time
to provide a more natural experience. Finally, to diversify the dialogues and to keep users
engaged for longer and avoid boredom following indications from [30], I made the dialogues
different in terms of the behavior change aspect (reflection topics) they addressed. Following
the categorization from the workshop presented in Figure 4.1, 8 dialogues were related to
observations/patterns, 6 to goals, 4 to plans/schedule, 3 to tracking and general/context,
46
and 1 to motivations. I also diversified them in terms of the starting reflection level - 11
started with noticing, 8 with understanding, and 6 with future actions - and question format
- 15 were closed questions and 10 were open questions. This is on top of delivering some of
the mini-dialogues with associated activity graphs
4.1.3 User Study
To evaluate Reflection Companion’s performance, conversational design choices, and the
ability to trigger reflection and encourage participation, I conducted a 2-week field study
approved by the university’s Institutional Review Board.
Participants: A total of 33 active Fitbit users (29 female, 4 male) between ages of 21 and
60 (M=36.5, SD=11.2) were recruited through social media. They used Fitbit for at least 2
weeks, were willing to provide access to their Fitbit data, and were willing to receive up to
4 SMS/MMS messages per day on their mobile phone for a period of 2 weeks. Participants
logged 10,133 steps per day on average (SD=6, 521, range: 1, 768− 36, 757) during the week
before the study. Five participants logged fewer than 5k and 13 more than 10k steps per
day. 19 of the 33 participants were interviewed after the study.
Procedure: At the start of the study, participants provided access to their Fitbit data.
Then they completed a survey, in which they shared their daily, weekly, and long-term
behavior change goals and indicated the time frame during which they would like to receive
the reflection mini-dialogues. During the study, participants received one mini-dialogue per
day over the course of 2 weeks, delivered to their mobile phones via SMS/MMS. At the
end of the 2 weeks, participants completed a post-study survey. Finally, they were able to
choose to use the system for 2 more weeks without additional compensation (I clarified their
decision would not affect payment).
47
Measures: To assess the impact and success of Reflection Companion, I looked at mea-
sures of engagement. I looked especially at participants’ willingness to use the system for an
additional 2 weeks without compensation. Prior work indicates that continuous engagement
intention is strongly related to perceived value and satisfaction with the system [125]. Par-
ticipant interactions with the system were logged and analyzed. This includes the number
of dialogues responded to, the time until a response was made (and whether a reminder
was used), as well as the length and content of responses. These measures along with con-
tinued participation were used to assess engagement with the system. I further collected
self-reported health awareness (9-item questionnaire adapted from [106]), level of reflection
around self-tracking (Kember’s 12-items [122]) and general mindfulness (13-items [229]).
Changes in pre- and post- scale ratings were analyzed using paired t-tests. Further user
replies to mini-dialogues over two weeks were analyzed and categorized. Semi-structured
interviews (40 minutes on average) were conducted and audio-recorded following the study.
Interviews were first transcribed and quotes related to each of the categories covered in
the interview were extracted using a closed, selective coding approach following a general
procedure for analysis of qualitative data described in [138].
4.1.4 Results
I present the results of the field deployment in the physical activity setting by looking at
engagement measured by the system use behavior, impact on user self-reported reflection,
and feedback from the interviews. Given that the system relied on limited NLU for user
intent recognition, I also report system performance.
Engagement: During the 2 week main study deployment reflection companion sent a total
of 462 prompts and 429 follow-ups, receiving 829 responses from participants. Participants
responded to 96% of all initial questions and to 90% of the follow-up questions. While 11
participants responded to all questions, the lowest rate for participant responses to initial
and follow-up were 23% and 64%, respectively. Overall response rate stayed fairly consistent,
48
indicating generally high engagement throughout the study. However, Figure 4.3 shows a
decline in the length of response as the study progressed, decreasing from an average of 170.1
characters in the first week (SD=31.8) to 138.1 characters in the second week (SD=17.0).
Participants took 50 minutes on average to respond to the first question and 13 minutes
to respond to the follow-up. Reminders were sent in 39% of cases. Encouragingly, 16 out
of the 33 participants elected to continue using the system for 2 additional weeks without
reward. Furthermore, these participants continued to engage with the system at a high
rate, responding to 83% of the initial questions and 76% of the follow-up questions during
the additional 2 weeks. Average response length during the additional 2 weeks was 98.4
characters (SD=74.9).
Figure 4.3: Response rates to ini-
tial, follow-up questions, and aver-
age response length in characters
for 14 days of core study.
Table 4.2: Summary of pre- and post study measures. The
levels from Kember’s survey are mapped to the stages of re-
flection in the structured reflection process.
System Performance: Reflection Companion relied on the NLU classification to catego-
rize free-text user responses to select appropriate follow-up. For the 224 replies logged for
these dialogues in the core two weeks of the study, more than 72% have been automatically
matched with a known intent and resulted in presentation of a tailored follow-up. We coded
the quality of the follow-up question into: Good match (follow-up question provided a good
continuation of the dialogue), Acceptable match (follow-up question only partially build-up
49
on user question or required users to repeat some of the initial response), and Poor match
(follow-up question made no sense in the context of user reply). Out of the automatically
recognized intents, 95% of the presented follow-up questions would be a good (69%) or ac-
ceptable match (23%). This means that the system made very few “hard” mistakes, such as
recognizing that the user expressed a negative impact of work on physical activity, where in
fact the user described a positive impact. For the 62 (22.68%) cases where the system was
not able to recognize any intent from user response and for which a non-tailored follow-up
was presented, 92% of the presented follow-ups offered a good (58%) or acceptable match
(34%), with only 8% of “hard” mistakes.
Impact on Reflection: Analysis of user responses to the reflective mini-dialogues pro-
vides numerous examples where dialogues were successful in supporting discussions around
awareness related to goal accomplishment, self-tracking data, and trends in behavior: “I
like to be active on the weekend and it catches up to me on Mondays so I take it easy,
then it’s back to working out on Tuesdays and Wed.” Mini-dialogues also appear to have
helped participants to better understand their behaviors and helped users draw connections
between the step count and their context. Additionally, participants reflected on multiple
higher-level aspects such as the value of physical activity, the meaning of a healthy lifestyle,
the value of comparing oneself to others: “My best friend is a doctor and has 3 kids and
exercises way more than I do. (...) So sometimes I feel lazy when I compare myself to a
friend, but most of the time I realize this is my life and comparing myself to someone else
is not a mentally healthy practice, so I give myself grace.”. They also often reflected upon
things that worked for them:“Jogging helps me towards the goal of jogging a half marathon.
Writing out my training plan on a calendar has been helpful.” as well as the things they
could possibly change: “Short runs before or after work. I enjoy running but I don’t often
make the time anymore. Standing at my desk more. Taking breaks not just at lunch. Getting
a dog.” Aside from reflection, the dialogues provided additional benefits. For example, the
prompts enabled users to vent: “Annoyed that some of them are thin without even putting
50
in that much effort. Sometimes annoyed that I can try so hard for less rewards” and also
often served as additional reminders: “Today is my first day back at work so I have not done
it yet - will do it if I go to a diff floor”.
At the same time self-reported ratings (Table 4.2) indicate a significant difference in
Habitual Action (HA) for pre (M=3.16, SD=1.06) to post (M=3.53, SD=0.89) study mea-
surements; t(32)=−2.0386, p< 0.05 and a weakly significant increase in Understanding (U)
from pre (M=3.60, SD=0.98) to post (M=3.92, SD=0.84); t(32)=−1.8994, p=0.07. The
increase in Understanding level indicates an increase in users’ analysis of the situation from
different perspectives, formulating explanations and observations about the reasons for the
things noticed. On the other hand the increase in HA is somewhat surprising because it
reletes to activities performed habitually. One likely explanation for the increase in HA is
that our system enabled a decoupling of the activity (here, physical activity) from reflecting
on the activity (here, taking place when engaging with the system).
Interview Feedback
Types of Reflection Triggered: The 19 interviews confirmed and expanded on the re-
sults of the analysis of user responses to the mini-dialogues showing that the system was
successful in triggering reflection on past activity patterns, on possible future actions and on
new, previously not considered aspects.
Increased awareness: Ten participants reported that system increased their awareness
of past physical activity. It specifically helped them realize how much they were recently
doing and notice repeatable patterns in their own physical activity: “It made me more aware
that I am doing more steps when I’m at home and on the weekends. It just made me much
more aware of how little and how much I’m doing on certain days.” (P8). Four claimed
that the system helped them think about how they currently plan and allocate time to their
activities: “Got me to go back through my data and my calendar, and really stop and spend
time thinking about, ‘Okay, am I really prioritizing this or not?”’ (P14). Another four
51
reported that it led to them thinking about the relationship between activities, data, and
the health outcomes: “It opened my eyes to a few things. . . how my steps were affected by
what sleep I had. . . and tracking my patterns on what days I did what.” (P10).
Alternatives and Future Actions: Eight interviewees reported that interacting with
the system led to reflection on the actions they were currently taking to achieve their goals
and made them critically re-evaluate these actions to think about possible alternatives: “I
definitely thought about whether I was doing as much as I could to be able to reach those
goals. More about what were the barriers that were making it where I wasn’t reaching those
goals.” (P13). The prompts also triggered thinking about planning possible strategies to
achieve enough physical activity based on what they have learned from the past: “Partially,
it’s about reflection, but it’s more of planning ahead, like what I should do and what I will
do... by reflecting on the past behavior.” (P20). Such reflection was for many participants a
prerequisite for trying out new behaviors.
New Insights: Four participants indicated that interacting with the system led them to
reflection on aspects they had not thought of before, such as considering possible alternative
metrics: “It got me thinking about what other interesting metrics are there? I had never really
thought about what I track or pay attention to that carefully. I just kind of use whatever the
given dashboard is.” (P14). In other cases, it triggered critical thinking about how they
currently use the tracked metrics, and what they can learn from them. The system also
introduced new ways to evaluate data by presenting them in a different timeframe (e.g., two
weeks): “It was my first time to see an overview of my weekly activity... I had never done
it before. Thinking in a way of a week cycle was interesting... Thinking of two weeks in
parallel, is there any seasonality or any cycle.” (P20)
Benefits of Reflection: Reflection was beneficial in many ways: it increased motivation
towards physical activity, introduced changes to participants’ actual behavior, increased
52
mindfulness, and encouraged formulation of more realistic strategies for increasing activity.
Increased Motivation: Participants found the reflective dialogues to be motivating.
Five reported that the mere presence of the prompting mechanism provided focus, kept them
in check, and consequently led to increased motivation. In some cases, the daily presence of
the dialogues created a sense of accountability, which provided additional motivation: “They
were a form of encouragement to me, because it’s like I knew that there was accountability
on my part, that if I had a poor day that I had to explain why, reflect on that on, what would
I do the next day.” (P22). Eight further reported that the dialogues helped them realize
their barriers, formulate clear action plans and define small, concrete and attainable steps
for achieving their goals. Interviewees considered these aspects to be motivating: “It was
like ‘What little changes could I do?’ And that was helpful ’cause like making the time for an
hour workout every day seems daunting, but going for a walk on my lunch is doable. Going
for a walk after work is doable.” (P25).
Leading to New Behaviors: For many participants, engaging in reflection resulted in
the adoption of new behaviors. These behaviors were usually small changes to daily routines,
such as parking further away from office or parking meter to walk more, walking to a grocery
store instead of taking a car, or using stairs instead of an elevator: “I actually did little
things to make myself more active during the day. The prompts got me like, one day I’m
talking about walking more during break, and so since then I’ve made a point to get out of
the office and walk during my lunch. Just doing little things.” (P25). In some cases, the
dialogues served as an additional push on top of a request from a family member, e.g. a
request from participant’s daughter to go for a walk or an evening walk with wife in case of
another participants. In some cases, the prompts also triggered a return to past behaviors
that have been abandoned: “It actually got me to get back into running, which is what I had
gotten out of for a little while so that was kind of nice.” (P24). In a number of cases, the
mini-dialogues led to behaviors that facilitate physical activity, such as wearing Fitbit more
53
often, downloading an additional app for tracking running progress or scheduling a class at
the gym: “After I would get the message, if I hadn’t already scheduled class at the gym for
that day, it would usually be a good reminder.” (P14).
Increased Mindfulness and More Realistic Plans: Six of the interviewees said
that the mini-dialogues helped them better assess their progress and become more mindful
of their own tendencies and inclinations: “I realized something about myself that I like to
work out...[by doing] another activity. For example, going to the museum.” (P14). In many
cases, this led to an increased understanding of factors that help participants meet their
goals, or barriers that prevent them from doing so: “I guess just becoming more aware of the
barriers to some of the stuff keeping me from my goals.” (P26). This helped interviewees
realize the need for specific and realistic actions to achieve their goals: “I think it helped me
be more realistic. A lot of times where you’re like ‘Oh I can do this in a month or something
like that.’ But in reality, it’s a lot tougher so it’s nice to have that reflection” (P24).
Impact of System Features: Additionally I explored the impact of key elements of
the system on user experience: the two-step mini-dialogue structure, continuous reflection
through daily conversations, the need for typing & sending a response, and personalization
using the activity graph of personal Fitbit data as summarized in Table 4.3.
4.1.5 Discussion
In this work, I argue that a conversational approach, using what I refer to as “mini-dialogues”
design, can be effective in eliciting reflection. Indeed, in the deployment, Reflection Com-
panion conversational agent successfully led to reflection at three levels: awareness, under-
standing, and new insights for the future. I show that such reflection can help users become
more motivated and can lead to defining action plans better aligned with users’ long-term
goals and actual abilities. Here I further discuss some aspects of the approach.
54
Table 4.3: Summary of the positive/negative aspects of the system design choices based on feedback
from participants.
Benefits and drawbacks of reflection on physical activity: I have shown that re-
flection helps increase awareness, mindfulness, and triggers consideration of new aspects.
This is supported both through the interview data, as well as the pre-post study increase in
“understanding” rating. I also found that reflection activities can serve as a prerequisite to
better goal setting and more feasible future actions. While most of the participants did not
revise their physical-activity goals during the study, many reported that the 2-week period
was too short to compel such a revision. I found, however, that reflection serves as a prepa-
ration for considering new goals and feasible future actions. Further, I found that reflection
provides a non-judgmental, neutral interaction that was appreciated by many participants.
The reflection activities offer participants a break from the often judgmental and persuasive
nudges built into current behavior change systems. Nevertheless, for some, a concern was
55
that reflection activities are not necessarily actionable. Finally, I should note that reflection
might potentially lead to discouraging revelations (e.g., less activity than expected) as no-
ticed in the exploratory workshop. Encouragingly, I did not notice any indications of the
mini-dialogues having such negative effects during the field deployment, but this still remains
a remote possibility.
Insights About Designing a Conversational Agent for Reflection: Through the
study, I uncovered three key benefits of the conversational approach to reflection. One is
that it has an ability to actively shape the direction of user thinking. I found that the mini-
dialogues, through building-up on user responses, have an ability to guide user thinking in
a specific direction. I also found that having multi-step conversational exchanges extends
the time a user spends reflecting and that everyday conversations can help users learn over
time. Last but not least, the conversational approach provides an engagement boost through
perceived accountability and commitment (even if the user is aware of talking to a computer
system). The act of typing and committing to an answer brings benefits of precision in
planning, deeper thinking, and accountability.
However, there are also drawbacks of using a conversational approach. First is that doing
so runs the risk of building-up and disappointing user expectations. Second, conversational
interfaces are at least currently harder to design for; more effort and resources are required.
One key challenge with building a conversational system for reflection (or a conversational
system in general) is to generate a set of sufficiently diverse and topic-appropriate dialogues.
This is especially important for the purpose of continuous, everyday coach-like interactions.
Extending the Long-Term use of Reflection Companion: In order to make the dia-
logues even more engaging, especially for longer-term use, a number of potential approaches
such as diversification, tailoring, memory & adaptation can be explored. Diversification fo-
cuses on making the dialogues novel each time. It can be applied on syntactic (sentence
composition), semantic (topics), and dialogue structure level. Diversification, however, does
56
not build up on past exchanges or increasing knowledge collected about the user to make the
conversation more engaging. Another approach involves improved personalization & tailor-
ing. Reflection Companion used personalization by addressing the user by name, presenting
a plot of personal data, and weaving in user goals into selected mini-dialogues. The topics
introduced by the dialogues were, however, not tailored to the user’s interests in any way.
Future work could explore tailoring on the level of topics of interest using e.g. Schwartz’s 10
basic values, representing universal motivational constructs [209] which I have used in my
work described in Chapter 3. Yet another option could be tailoring the dialogue structure
itself, which has been explored for cultures [62]. Arguably most valuable for long-term, but
also most technically challenging, would be to remember aspects users shared and adapt
the mini-dialogues to include those aspects. Currently user response to the initial prompt
is classified and “remembered” only to decide on the follow-up to present. Unfortunately,
no long-term memory or common ground is retained. This requires asking each time e.g.,
what is the user barrier for a specific goal or activity, or having to switch to a new topic to
avoid repetition. Remembering information from user past responses has obvious long-term
benefits: it allows to deepen the reflection on relevant topics over time, it communicates to
the user that the shared information is appreciated, and it partially addresses the issue of
topics exhaustion as dialogues can also go in depth on one topic over time.
4.1.6 Conclusion
In this work I introduced a mobile phone based conversational system for supporting re-
flection on everyday physical activity - Reflection Companion. The system prompted users
daily to engage them in reflection on various physical activity related topics. Interaction
was in the form of mini-dialogues incorporating user’s personal goals, and activity graphs
(from Fitbit tracker). The conversation questions were generated via workshops with activity
tracker users and informed by structured reflection model to help fuel content diversity. In a
2-week deployment with 33 users I have found that Reflection Companion offered an engag-
ing interaction, with half of the users electing to actively use the system for an additional
57
2-weeks outside of the study without any compensation. The system was also successful in
increasing user awareness, supporting reflection on activity alternatives and future actions,
and prompting new physical activity insights. Users reported feeling more motivated, mind-
full, and encouraged to try new behaviors. On top of that, the users linked the majority of
the reported benefits to specific conversational design elements. They attributed deeper and
longer reflection and more truthful sharing to two-step mini-dialogue design. They linked
the ability to build-up on prior reflection and the lower cost of reflecting to continual daily
interactions. Further, they connected the need for deeper thinning, precision, as well as the
sense of commitment and accountability to having to type & send responses knowing that
“someone” is reading them. Finally they attributed the sense of progress, focus, attention,
and personal relevance to the weaving in of personal activity graphs and individual goals into
the conversation. The system offers empirical evidence for the value of conversational reflec-
tion, proposes a design process for feasibly realizing such design, and further offers insights
into promising future directions for most impactfull improvements.
4.2 Workplace Productivity Setting
4.2.1 Background
For knowledge workers in companies, keeping track of work activities and accomplishments
can be a useful practice but one that can be hard to sustain. Awareness of one’s own activi-
ties, and reflection on aspects of learning at work are important for professional development
[205] and can lead to tangible performance improvements [64]. It builds worker confidence
in the ability to achieve goals [64], improves the depth and relevance of individual learning
[168], supports emergence of self-insight and growth [166], and consequently leads to perfor-
mance increases [115, 250]. Performance increases are said to come from understanding of
the causal mechanisms behind actions and outcomes [250] and by learning from accumulated
past experience [64].
58
Challenge to Reflection at Work
Yet increasing time pressures in the modern workplace make taking time to step back and
engage in efforts to learn from one’s prior experience seem like a luxurious pursuit [63].
Employees would rather decide to gain additional experience doing the task than take time
to articulate and codify what they learned from prior experiences. In fact this kind of ‘doing
more’ behavior is still encouraged in many workplaces [64]. Finally, reflection itself is time
consuming and not necessarily something that comes naturally to people, they usually need
a reason to reflect or at least an encouragement to do so [98, 168]. Supporting reflection
through computerized systems has been identified as a vital field of research [22, 153] with
computer-supported reflective learning specifically in work settings being identified as crucial
[134]. Still, few systems exist for supporting reflection in the workplace.
Potential for Conversational Support
To help with professional development and learning from work activities, institutions of ca-
reer counseling and development exist in bigger companies [19] as well as outside of company
structures [189]. Conversational agents, whose use is growing in popularity, stand to play
an important role in supporting behavior change and well-being. While chat bots and other
“virtual assistants” have been motivated by, developed, and tested in a variety of contexts
from customer service [62, 181, 240] to health-related behavior change [27, 186], to simulated
job interviewing [145], our focus is on the role of conversational agents for organization,
productivity, and self-learning in the workplace. In such settings, user needs may be differ-
ent and avoiding disrupting work and improving efficiency are important. Prior work also
identified potential benefits of talking to an agent instead of a human in contexts where
people are less afraid of being judged and more willing to disclose [155]. In this chapter I
therefore explore: How can we design conversational experience to support reflection in the
work context?
59
Interaction Modalities of Conversational Agents
Furthermore there are indications about the differences in the impact of modalities on user
behavior and perceptions. In movie recommendation context spoken queries were longer
and more conversational, with more subjective features than typed queries [116]. A study
on using voice for providing edits and comments in writing tasks, showed that voice-based
comments may be easier and more natural to leave (as opposed to text) from the point
of view of an editor, and also that people leave different types of comments using the two
modalities [174]. This combined with the fact that a recent poll [94] showed an increased
adoption of voice interfaces, with 63% of Americans surveyed using voice assistants such
as Apple Siri, Google Assistant, or Amazon Alexa makes it interesting to explore the use
of voice modality for reflection as well. I therefore also explore: What are the differences
between voice and text-based conversational reflection?
4.2.2 Design Approach
The design process for the Robota conversational agent involved three parts: 1) generation
of meaningful reflection questions for the workplace context based on existing knowledge on
reflection in learning [168], education settings [8], behavioral questions from job interviews
[222], and career development sources [226], 2) designing dialogues combining work activity
reporting and reflection for voice and chat modalities, 3) designing supporting personal
informatics elements: dashboard and reminders
Literature-based Content Creation
For the conversational reflection in the productivity settings I generated a collection of work-
related reflection questions inspired by structured reflection theoretical frameworks such as
Moon’s reflection in learning [168], Gibb’s reflective cycle [89] and Bain’s 5Rs framework
[16]. I also drew from concrete examples of reflection questions in educational settings
[8], behavioral questions from job interviews [222] and career development sources [163].
60
I attempted to cover the following categories with my questions, aiming at encouraging
workplace reflection:
Task-related questions: These questions ask about tasks and activities and how aspects
of these tasks and activities may contribute to learning; for example: “How can you make
the activities you planned for today more enjoyable for yourself?”
Planning and organization: These questions focus on understanding factors affecting
performance and learning points from organization of work in scope of a day as well as the
week; for example: “How satisfied are you with how you organized your work today? Is there
anything you have learned?”
Short-term and long-term activities and goals: These questions focus on realizing
relations between activities and goals, barriers to goals accomplishments, as well as on ex-
ploring the value of having a longer-term goal; for example: “Do you feel the activities you
did today contributed to your goals? Why or why not?”
Motivation and satisfaction at work: Questions in this category triggered exploration
of sources of positive and negative emotions at work as well as moments of satisfaction; for
example: “What were some of the most satisfying moments at work for you this week and
why?”
Personalized questions: Questions in this category include dynamic elements extracted
from user’s work journal entries; for example: “Did ¡task¿ help you learn anything new that
could be valuable for the future? What did you learn?” Past work identified the use of record
of events as one successful way to enhance reflection [111]. Such a record can be looked
at again to provide time and focus attention on different aspects of the experience on each
return, especially if some guidance as to what to focus on is provided [216]. These ques-
61
tions further highlight the link between the journaling activity over Slack3 and a continued
engagement through the reflection questions.
Conversational Agent Design
I designed and implemented a custom conversational agent called Robota (which stands for
“work” in Polish) to support workplace journaling and reflection. Workers interact with
Robota through chat and voice, and can explore past interactions through a web dashboard.
Figure 4.4 illustrates the overall architecture of the system: The core Robota logic is im-
plemented in the cloud as a timed state-machine using Python’s Flask4 and SQLAlchemy5
frameworks on top of MySQL database6. This common backend supports the chat and voice
modules as well as the web dashboard, described later.
Chat Modality (Slack-bot): I implemented Robota’s chat module as a “Slack bot” via
the Slack API. The bot has the ability to send and respond to direct messages on Slack
(a Slack bot appears just like a person on Slack, appearing in the user’s contact list). A
journaling prompt, illustrated in Figure 4.5, consists of an introductory message followed
by a request for accomplished activities. Robota then asks the user to record her plans.
The user responds in open, unconstrained text. In addition to journaling, the chat module
is responsible for delivering chat-based reflection questions, and for prompting the user to
perform voice-based reflection (described next).
Voice Modality (Amazon Alexa Skill): I used Alexa Dash Wand7 - a handheld cloud-
connected device with a built-in speaker and microphone that allows the user to take it
to a quiet room and speak to it discreetly. The Dash Wand supports Alexa Voice Service
3http://slack.com and https://api.slack.com/bot-users
4https://www.fullstackpython.com/flask.html
5https://www.sqlalchemy.org/
6https://www.mysql.com/
7Dash Wand - https://www.youtube.com/watch?v=s7IExS483wE
62
Figure 4.4: System architecture of the Robota con-
versational agent. A common backend supports chat
interaction as a Slack bot and voice interaction as a
custom Amazon Alexa Skill using an Amazon Dash
Wand.
Figure 4.5: An example of interaction with
Robota using the chat module, in this case,
a mid-day journaling prompt.
Table 4.4: Work activity journaling
prompts for different journaling schedules
(schedule selected by the user).
63
(AVS) and custom-built apps (called “Skills”). I implemented a custom skill using Amazon
Alexa Skill API 8. To prompt the user for voice reflection, Robota sends a Slack message
asking the user to initiate reflection. This is to link the slack journaling and voice reflection
as parts of the same system, but also to give the user freedom to initiate voice interaction
at a convenient time. The user then holds down the Dash’s button and says “Start Work
Reflection.” Robota speaks one of the reflection questions (described later) and listens for the
user’s response. The user may ask Robota to repeat the question. Robota speaks a ‘thank
you’ message in voice, and also sends a ‘thank you’ message on Slack. Finally, both chat and
voice interaction collected information are collected in the user’s dashboard, described next.
Supportive Personal Informatics Elements
Web Dashboard: To allow users to review their work journal entries and their responses
to reflection questions, I implemented a web based dashboard. The dashboard uses badges
to represent each day to encourage continued participation. Reviewing journal entries and
reflection for a specific day is done by clicking on a badge. Due to low performance of
speech-to-text services, for user responses through the voice module, I provide links to the
original voice recording instead of a (likely faulty) transcription. Finally, to support sharing
work reports with others, the dashboard includes a link to a weekly compilation of all journal
entries.
Chat-based Reminders: An important aspect in designing successful conversational agents
for the workplace, is to balance engagement and interruptions. Since reflection questions
were designed to follow and, in some cases, rely on journal entries, I implemented a reminder
strategy that used long and growing time spans for subsequent reminders.
8Alexa Skills Kit - https://developer.amazon.com/alexa-skills-kit
64
4.2.3 User Study
I conducted a 3-week, within-subjects controlled deployment with 10 participants from a
company lab (3 female; 7 male). Five participants were between ages 25 and 43, three 35 to
44 and one for each age group of 18-24 and 45-54 years old. None of the participants were
involved in this research project. Participants included three research staff, four interns,
and three developers/support and represented a diverse set of accents: English native for
only 2 out of the 10 participants, the rest included Japanese, Chinese, and French. This is
particularly challenging given “Robota” used voice in one of the conditions.
Procedure: At the beginning the participants completed a short survey and each partic-
ipant also chose when they wanted Robota to prompt them to journal their activities and
plans, between morning, mid-day, and end-of-day journaling (as described above). During
the first week of the study, participants used Robota for daily journaling only, through Slack
(Journaling-only condition). In the following 2 weeks, participants would respond to reflec-
tion questions (10 questions total: one question a day, for two 5-day workweeks) through
chat (Chat-Reflection condition) for one week and using voice (Voice-Reflection condition)
in the other week in a counterbalanced order. At the end of each week, participants were
asked to compose a weekly report and respond to a survey. Finally, participants completed
an end-of-study survey and took part in a short interview.
Measures: On the Friday before the beginning of the study, participants were asked to
write a weekly report summarizing their work activities, and evaluated the difficulty of
writing the report, the report’s clarity and level of detail. Every Friday afternoon throughout
the study participants similarly wrote a weekly report of their work activities and provided
ratings. In addition to weekly reports, participants responded to questions regarding their
interaction with Robota during the week. At the ends of weeks 1, 2 and 3, these included
questions about the journaling activity. For example, “Did logging your daily activities
influence your work? If so, how?”, “Did logging your daily activities influence writing the
65
weekly reports? If so, how?”. And 7-point Likert scale: “How easy or difficult was it to log
daily activities?”. At the ends of weeks 2 and 3, these included questions about the modality
they used. For example, open-ended: “What are the main things you liked about using the
chat bot to reflect on your work?”. 7-point Likert scale: “How easy or difficult was it to
respond to the reflection questions?”. For the final survey, at the end of week 3, participants
were asked about the value of reflection: “What benefits, if any, did you get from reflecting
on your work (using either the chat bot or Alexa)?” and to directly compare their interaction
with the voice and slack channels: “Considering the two methods for reflecting on work (the
chat bot and Alexa), please compare your experience of the two.”
4.2.4 Results
I present the results of the field deployment in work productivity setting by looking at
system use, user reported value of work journaling support, impact on work reflection, and
differences in the impact of voice and chat modalities.
System Use
Participants used the system consistently throughout the study, responding to 99% of the
activity journaling and reflection requests. Responses arrived within a median of 31 minutes.
Robota sent a total of 174 reminders for journaling. Robota also sent 98 requests for reflection
followed by 59 reminders (34 in the Chat-Reflection condition and 25 in the Voice-Reflection
condition). The average length of a daily activity log was 292 characters (SD=239.62). The
average length of a response to reflection questions using chat modality was 131 characters,
compared to 98 using voice modality.
Value of Work Journaling Support via Chat:
Through analysis of the end of week surveys and interviews, I found that all participants
rated journaling as useful for composing weekly reports. Participants reported that the daily
66
activity journaling helped them directly with work tasks by: 1) increasing their awareness
and productivity and 2) helping with composing reports. Still, a number of challenges to
journaling surfaced, mainly: no tasks worth recording for a day, lack of progress perception,
and duplicate entries for long-running tasks.
Increased Awareness & Productivity: Three participants reported that journaling
increased their thinking about their daily activities and work organization as well as lead
to increased awareness of progress: “Sometimes it made me realize that there was little
progress on some days” (P9). Two others felt that journaling positively impacted their
productivity, this was mainly through the aforementioned awareness of limited progress: “If
I found I didn’t make much progress on a day, I would try to do more on the next day.”
(P4) or through concern that they will have nothing to report at the end of the day: “Maybe
more productive. I don’t want to have nothing to be logged at the end of a work day.”
(P10). Five other participants, when asked directly in a post-study interview, reported that
journaling had no specific impact on their work awareness and productivity. In one case
it was because the participant already regularly journaled her activities (P6). In the other
four cases, participants did not feel a direct impact on their work, as journaling itself didn’t
suggest concrete changes. They, however, still reported an indirect impact, such as help with
keeping track of time and tasks (P3, P8), assistance with work organization (P5) and help
with deciding on the relevant tasks to pursue (P2).
Helped with Composing Reports: All the participants considered daily activity jour-
naling useful for composing weekly reports. For eight individuals, activity journaling helped
by making it easy to recall things done throughout the week: “I didn’t need much effort to
remember this week’s activity because I logged it on Robota every day.” (P7). Some also
felt it helped them make sure they did not miss any important points from their reports:
“I can refer to these logs to have a better summarization without missing important points.”
(P4). For four participants, daily logs served directly as a source material for copy-pasting
67
relevant items into their weekly reports: “I simply picked the important points from the daily
reports and used them.” (P2). For two people, daily logs helped with organization of their
reports: “Yes, I think it helped me to remember and organize what I have done.” (P9).
Finally, for two more participants, having all the relevant information about their activities
in one place helped them avoid collecting information from various sources: “It was easier to
compose from Robota logs because I didn’t need to go back and forth within different sources
for collecting my activities.” (P7).
Work Reflection with Robota
Eight participants rated the act of answering reflection questions as useful, somewhat useful
or neutral (eight in chat, six in voice, and six in both). Comments from the interviews suggest
that reflection aspects of the system helped participants: 1) improve work organization, 2)
look at their work from different perspectives and even 3) consider higher level goals of their
careers.
Improved Work Organization: Three participants mentioned that the reflection prompts
made them think about how they organize their daily activity: “It makes me think about
the efficiency, the organization, and other things. This will further help me increase my
efficiency.” (P4). In some cases, it also helped with planning activities and making sure
that important things are not forgotten: “Remind me that some things are needed to do.”
(P5).
Helped Gain New Perspective: Six participants indicated that reflection with Robota
gave them opportunities to think about the value of activities they perform: “It made me
keep track of what I have learned from my work, which was different from what I usually write
on daily reports” (P9), or encourage new ways of thinking about work: “Robota pointed out
what I haven’t thought ever and it was a good chance to think about it.”(P7). Finally, they
also reported that it was valuable to find some time to think more deeply about their activity:
68
“Helps me take a moment to be reflective, almost meditative, during the day about the process
of how I work instead of just thinking about the content of the work.” (P6).
Helped Consider Higher-level Goals: Three participants also discussed how Robota
helped them think about the meaning behind their work: “Force me to think about the impact
of things I did.” (P5). Reflection also helped some participants consider their higher-level
goals at their current workplace: “Reflection questions lead me to think about what brings
me satisfaction, what I have learned. It was helpful for considering my goal at [company].”
(P7).
Challenges with Reflection Questions: Not all the reflection questions were seen as
equally valuable. A number of questions were considered too abstract and hard to even
answer: “The questions are too general and sometimes hard to have a specific or informative
answer.” (P10). The flexible and unscheduled nature of some participants’ work made
questions about planning and organization irrelevant. A participant whose main job is to
offer technical support for others said: “So far, I haven’t found it very useful to do work
reflection, mainly because my daily task(s) are pretty ad hoc and the question posted to me
may not be very relevant.” (P1). Four participants appreciated questions that explicitly
referenced their logged activities: “My favorite reflection questions were the ones specific to
my daily log.” (P2). However, personalized questions may sometimes incorrectly ask about
tasks that are not as meaningful: “I felt that some questions were too specific and I often
didn’t have anything meaningful to reflect on related to the question asked.” (P2).
Designing for Voice vs. Chat:
A key goal of our work was to explore the specific value and limitations of voice and chat
modalities in the workplace. Looking at self-report measures, a paired samples t-test shows
that responding through voice was seen as less easy(M=2.6 vs. M=4.0; t(9)=5.62, p< .001)
and more annoying (M=4.3 vs. M=3.2; t(9)=−2.28, p=0.05). Participants’ complaints
69
about voice modality mostly stem from (known) limitations of voice-to-text transcription and
limitations of the Dash Wand in particular. Nevertheless, a number of comments revealed a
potential value of voice modality that looks past current technical limitations.
Advantages & Challenges of Voice Modality
Separate Channel for Reflection Valuable: Four participants considered the ability
to use a separate voice channel for reflection useful, mainly due to being able to quickly
capture some of their thoughts: “It’s good to have another means to quickly capture some
useful points or thoughts.” (P1). Three participants also considered interaction via voice as
being more like having a personal conversation with someone that cares about them: “[voice]
has a slightly more personal feel to it” (P4), “This interaction is nice. I felt like Robota is
caring about me.” (P7). This feeling even led two participants to consider the voice-based
agent as more of a counselor or even a machine they could share with: “It does make it feel
more, it makes me feel more reflective. Almost like a counselor or a therapist.” (P4), “At
the moment I am unhappy. That’s the moment I want to complain and the machine gives
me an opportunity to complain and that’s very good.” (P8).
Easier to Answer Questions with Voice: Two participants felt they could generally
answer questions faster with voice. They appreciated that they didn’t need to type anything
while answering: “It doesn’t take much time to answer, is easier than writing report on
Slack.” (P7).
Interactive, fun and engaging: Still, the fact that the reflection questions were
revealed only after interacting with the Wand had the potential to be more engaging and
even fun: “It was kind of neat to use the wand and have the voice reveal to me what the
mystery reflection question was.” (P6), “Talking to a machine is somehow fun.” (P10).
70
Perceived pressure to respond immediately: Although participants were told they
could listen to a reflection question and then call the skill again after some time to respond,
most felt the pressure to respond immediately after being asked: “While using voice, it
seemed to encourage me to answer right away, which is a bit stressful” (P10). Such need
to respond quickly made people feel they had less time to think about their answers: “You
also have less time to think while speaking it aloud. So I’m not sure if the essential points
are captured.” (P4).
Listening to own responses inconvenient and uncomfortable: Two individuals
felt that reviewing voice-based responses afterwards was not ideal: “It is not transcribed and
listening to what I said many times is somehow troublesome.” (P9). There was also a dislike
for hearing one’s own voice played back: “Chat-robota was easier to review my answers after
logging. (Sorry I felt uncomfortable to listen to my voice...)” (P7).
Advantages & Challenges of Chat Modality
Easier to read questions, think about response: Half of the participants felt that it
was generally easier and faster to read the question: “Reading is much faster than listening.”
(P9). They also felt they could take more time to re-read the question if needed, think about
it, and then respond: “It was easier to read the question and think about it” (P2).
Easier to reply in own time and describe details: Seven participants felt that
chat-based interaction allowed them to enter their responses at their own pace: “As you type
in, you can pause and think.” (P4). They further felt that typing makes it easier to describe
the details. As most of our participants were non-native English speakers, this perceived
ease of typing sometimes came from the contrast with having to describe things in voice in a
foreign language: “It’s easier to answer than explaining in a voice. Since my English is not
so good, I couldn”t answer to a question immediately if I have to speak.” (P9).
71
Easier to review and change responses: Three participants liked how typed re-
sponses were editable: “I also could more easily change my response with the chatbot before
submitting.” (P2). Also, having their reflections in text made it easier to review afterwards
using the dashboard.
Typing is time consuming: Still, needing to type responses made some participants
write more concisely: “Sometimes the answers to the questions are a bit complex, but I write
something that is simpler and reductive because I don’t want to spend time detailing it out
on slack.” (P6)
Slack seen as less personal: Two participants mentioned that reflecting on Slack,
as compared to voice, felt less like having a conversation and more like formal reporting of
activities: “It is slightly less personal [Slack], maybe the voice felt a bit more personal” (P4),
“Typing on slack is slightly more formal I guess, it is something that goes into the record”
(P7).
4.2.5 Discussion
The field study provided some initial insights into workers’ behaviors and reactions to using
a conversational agent via different modalities. Participants generally appreciated having a
structured way of reflecting on their activities for planning and goal-setting. Unlike many
existing workplace reporting tools, my design supported workers’ individual work styles by
including journaling prompts for different parts of the workday. Some participants chose
mid-day journaling to encourage themselves to be more active.
Interacting with the agent via chat (as designed in my system) made non-native English
speakers feel they could more easily read and respond to the questions. At the same time,
interacting with the agent via a separate voice channel had the potential to be more engag-
ing and personal (e.g. voice modality seems more suited for complaining and being more
reflective). These add new dimensions to consider when designing for behavior change. Here
72
we provide further design considerations for future work based on the findings from our field
study:
Combining the benefits of both modalities: For the purposes of the study, I limited
users to only interact with the voice or chat modality for one week each, and saw that each
modality had pros and cons. However, outside of a controlled study environment, users could
be provided the opportunity to choose which modality they wished to use on a day-by-day
basis, based on their current context at the time of journaling. Additionally, the system could
rely on contextual cues to prompt the user to log and reflect in one modality versus another
based on what it infers to be the most appropriate form. My findings further suggest that
certain reflection questions may also be better suited for certain modalities. For example,
questions that are more personal or require a deeper level of reflection may result in more
valuable reflection activities when using voice-based input.
Integration with the work setting: Many participants mentioned that one benefit
of using text interaction within Slack was that it was seamlessly integrated with a tool and
platform (on their computer) where they were already doing much of their work. Perhaps
for this reason, using a personal device that takes a person away from their desk to speak
out loud and reflect upon personal topics may be less-suited for information workers whose
day is primarily carried out on a computer in a public or semi-public space. The benefits of a
mobile or portable solution for journaling and reflection may be greater for different types of
workers, where daily activities are more mobile and occur in different settings; for example,
people who engage in site visits or inspections, or frequently travel to visit customers on
sales calls.
4.2.6 Conclusion
I introduced Robota, a conversational agent for workplace journaling and reflection that
combines chat and voice interaction using a common backend. The three-week long de-
ployment of Robota with knowledge workers, revealed numerous benefits and challenges of
conversational reflection in the work setting. Robota successfully engaged workers in activity
73
journaling increasing their awareness of work progress, productivity, and helping them with
composing reports. Robota’s reflection questions helped workers improve work organization,
gain new valuable perspectives around their everyday work tasks, and even engaged them
in thinking about higher-level career goals. At the same time I identified several challenges
with literature-based reflection question generation leading some users to consider questions
to be too abstract, irrelevant for their specific context or relating to tasks they didn’t find
valuable to reflect on. These reveal context matching challenges and provide opportunities
for future design improvements. Furthermore, the comparison of reflection via voice and chat
modalities highlights tradeoffs between the modalities and points to areas likely to benefits
from intelligent sensing. The chat modality was reported to support easier reading & think-
ing about the reflection questions, easier replying in one’s own convenient time, an ability to
describe details, as well as reviewing responses and making changes. On the other hand, this
modality was considered more time consuming due to typing and less personal than voice.
At the same time, the voice modality was reported as a valuable separate channel just for
reflection that is also more personal and caring. Furthermore voice afforded an ability to
quickly capture thoughts, respond faster to reflection prompt, and offer a generally fun and
more engaging experience than chat. On the other hand, voice introduced more pressure to
respond immediately as well as inconvenient and uncomfortable need for listening to one’s
own responses (when revisiting reflection). My work provides practical design for supporting
conversational reflection at the workplace, tackling several challenges related to semi-public
space and use of voice. I offer insight into benefits of reflection via different modalities and
identify future improvement directions.
4.3 Discussion on Supporting Conversational Reflection in Both Settings
Reflection Companion and Robota were both conversational agents designed to engage users
in reflection. While sharing a common purpose, they were each designed following a different
content generation process, the were deployed in different settings and interacted with users
via different modalities. These similarities and differences offer several opportunities to make
74
comparisons.
4.3.1 Comparison of Impact
Both agents were generally successful in engaging users in reflection on physical activity
and workplace productivity respectively. Users of both Reflection Companion and Robota
reported benefits of increased awareness of their activities specifically in relation to the
progress they are making. In both settings conversational reflection also helped users gain
new perspectives. In physical activity this related to critical examination of their current
behavior and alternatives for physical activities, while in the work setting this related to new
ways of thinking about work and value of work tasks they perform. With both agents, it
seems that users also benefited from increased mindfulness. In the physical activity context
this amounted to being aware of ones tendencies, motivation, inclination and barriers, while
in work setting this related to reflecting on meaning behind work and higher-level carrier
goals. Interestingly, some of the more nuanced use of the agent, e.g., venting, was shared
across the settings as well.
4.3.2 Comparison of Conversational Reflection Design Approaches
The topics of conversation in Reflection Companion were generated via workshops with
active users of physical activity trackers. The generation was guided by structured reflection
model. Reflection questions in Robota on the other hand, relied on past literature on work
reflection as well as on multiple less formal counseling and interviewing resources. While it is
hard to make strong comparison claims in terms of quality impact these different generation
processes might have had on user engagement, it is worth noting that several users of Robota
complained about questions being too abstract to answer or non-applicable to their type of
work. Such issues could have been due to the conversational reflection prompts not being
sourced withing the target user population, but relying on on-size-fits all generalizations
about the work context.
75
4.3.3 Impact of Modality & Interaction Channel
Robota explored slack and voice based interaction, while Reflection Companion relied on
mobile SMS/MMS text messages. Users in the work setting considered voice interaction to be
more personal & feeling more like talking to someone than slack. In physical activity setting
the mobile SMS interaction was also reported by many as personal, due to the channel being
used mostly for communicating with friends & family. On the other hand slack was considered
much more formal and led users to spend more time on ‘refining’ their answers. Despite the
fact that typing, rather than speaking afforded by slack, seems like an interaction more
similar to mobile text-messaging, the impact seemed different. Slack was more associated
with work tasks & within office communication. It seems that perception of a separate
channel dedicated to personal reflection was the important distinction, not necessarily the
interaction style itself. There were, however, differences in how people interacted with voice,
specifically they felt a pressure to respond immediately, but they were also more spontaneous
in what they shared. Finally voice had a quality of being perceived as more interactive, fun,
and more engaging. Typing on the other hand offered more of a commitment in text and
something that goes ‘on-the record’.
4.3.4 Private vs. Semi-public Space
The physical activity reflection via Reflection Companion took place on user’s personal mo-
bile device, withing the hours specified by the user (which could be during or outside of work
hours) and revolved around individual health-related goals and activities. On the other hand,
reflection on work activities via Robota, was much more tied to the work context - inter-
action took place withing office hours, on work related tasks, and partially on work related
medium (i.e., Slack). In the design of Robota, I was aware that given such setting, users
might be less inclined to treat the reflection as a personal benefit rather than another work
chore. On the other hand, they may also want the reflection time to be tangibly beneficial to
their work (as it ‘consumed’ their work time). From the results it seems that having Robota
76
support work activity journaling & reporting (work support benefit) as well as providing
reflection prompts of more personal nature via a separate voice channel (personal benefit)
was a generally successful design strategy for engaging users in reflection at work.
4.4 Summary of Contribution
In this chapter I examined a conversational approach for engaging users in reflection on phys-
ical activity as well as on work tasks & productivity. Despite reflection being an important
step of behavior change [142, 73] offering multiple benefits [23], the support for engaging
users in reflection is limited in current technology [22]. To address this gap I designed two
conversational systems - Reflection Companion & Robota. With Reflection Companion I
demonstrate an effective process for designing reflection dialogues involving workshop-based
question generation informed by structured reflection models [14]. With Robota I demon-
strate the use of past literature and contextual resources for informing work related reflection
dialogues. Both agents personalize their interactions by involving user fitness tracker based
activity graphs & behavior change goals, in case of Reflection Companion, as well as user
journaled work tasks, in case of Robota. Robota also explored the voice and slack modalities
for engaging users in reflection. Both designs contribute to informing the design of technology
to support reflection and reusable processes to follow to achieve similar designs in different
settings. I further evaluate both systems in deployment studies in personal physical activity
and workspace productivity settings for 2 and 3 weeks respectively. I demonstrate that both
systems were successful in engaging users in interaction and meaningful reflection leading to
increased awareness, critical thinking, prompting new behaviors, and increasing motivation.
I also demonstrate that the benefits were directly linked to the specific aspects of conver-
sational design. Furthermore, I compare the trade-offs in conversational reflection design
for private and semi-public space as well as with voice and chat modalities. Designers and
practitioners in health & behavior change could use the proposed strategies to effectively en-
gage users in reflection. Also designers of conversational systems can leverage aspects of the
proposed designs to inform dialogue design with external data (e.g., activity graphs, work
77
tasks), and for leveraging domain-specific conversation topic based content diversification
(e.g., sourcing reflection topics).
78
Chapter 5
CONVERSATIONAL DATA COLLECTION: DESIGNING
HEALTH & SOCIAL NEEDS CONVERSATIONAL SURVEY
This chapter aims to address the challenge of data collection in health & behavior changes
settings. In this chapter I apply conversational design to help engage users with self-reporting
their social needs & health data at hospital emergency departments. This data is collected
for use by providers to offer patients assistance with supporting their social & health needs.
While data collection in personal informatics relates to collection of data about oneself for
personal use, my work in this chapter supports collection of personal data for the potential
use by a health provider and not necessarily directly by an individual. While this is a
bit different than the classic personal informatics model from Li et al [142], I note that: 1)
Personal informatics already goes beyond just keeping the data personal, with users explicitly
sharing their data with 3rd parties such as family and friends [4, 46], health providers [47],
or implicitly with tracker manufacturers. In all these cases the challenges of trust and
communication are present. 2) Challenges of engagement with data collection are prevalent
in both settings, especially when collection relies on user self-reporting. Given these, I
believe the setting in this chapter aligns well with the challenges of the ‘collection’ stage of
Li’s personal informatics model and my findings are similarly applicable to such settings.
5.1 Background
Accessing patients’ social needs is becoming a critical challenge at emergency departments
(EDs) [157]. EDs are designed to attend to acute conditions (e.g., heart attacks, accidents),
but increasingly, especially in public safety-net hospitals, are the first point of contact for
vulnerable populations with long-term social needs (e.g., homelesness, poverty, hunger) [157].
79
Unprepared for these kinds of challenges, the EDs have to devote valuable resources (hospital
beds, medical staff time) to understand the needs of such visitors and connect them to social
services that can offer much better help.
5.1.1 Challenges of Collecting Data From Vulnerable Populations
Unfortunately engaging vulnerable populations in sharing their social needs is hard to ac-
complish with traditional surveys. Such populations are also often of low health and general
literacy, are wary of sharing personal information in formal, impersonal surveys, and are
more willing to engage with and respond to an interviewer rather than to fill-out a survey
[32]. However, most EDs do not have extra staff to administer survey-based screeners, and
without personnel administration (such as research assistant, nurse, etc.), response rates for
both paper and electronic surveys are low [32]. This is compounded by the fact that only
12% of Americans have proficient health literacy and it makes the response rates especially
low for low health literacy patients [136].
5.1.2 Technology-based Data Collection vs. Human Interviewing
With the growing interests in clinical screening, research has examined the use of technol-
ogy based solutions to support the self-administering of surveys via online survey platforms,
mobile apps, and electronic kiosks to maximize scalability and speed of data collection while
reducing cost [96]. Despite these advantages of existing technology-based solutions, face-to-
face is still better when it comes to engagement and response rates (in one study, 92.8%
face-to-face response rate compared to 52.2% web-survey response rate) [104]. These differ-
ences, also in ED context [44], have been linked to the motivating impact of interpersonal
interactions, but reproducing such effects via technology is still a challenge.
80
5.1.3 Potential of Conversational Approach
Chatbots offer multiple potential benefits for social needs screening. Chatbots are systems
designed to engage with users through natural language, mimicking a human-to-human in-
teraction [127]. Popular examples of chatbots include Apple’s Siri, Google’s Now, and Mi-
crosoft’s Cortana. Extended to the context of social needs assessment chatbot can support
self-administering of social needs screeners to minimize personnel cost. In contrast to cur-
rent form-based surveys, a conversational approach would be more “chat” like, potentially
offering a sense of familiarity similar to mobile text messaging [40]. By creating a sense of
interacting with another person, the chatbots may also increase participation engagement
[124]. Furthermore, offering text-to-speech audio output can also facilitate comprehension
[96].
5.2 Design Approach
I designed and implemented a custom chatbot called HarborBot to test conversational ap-
proach to survey administration. HarborBot interacts with users through chat and voice. It
communicates via chat messages, which it can also read outloud, as if it is speaking. Users
interact with the system primarily through buttons (for structured responses) and text (for
text-based questions). HarborBot is implemented as a webapp and participants interact with
it on tablets.
5.2.1 Design Process
To create HarborBot I followed an iterative design process in which a team of 2 senior HCI
researchers and 6 design students followed three general design phases: 1) Requirements
gathering - based on feedback from 3 ED practitioners & literature [44], 2) Design exploration
- prototyped various low-fidelity versions and gathered feedback via small scale usability
tests, and 3) Refinement - most promising prototype was developed further and refined with
positive elements from others.
81
Figure 5.1: HarborBot GUI elements. On the left “Question response types” showing different
types of responses users available. On the right “Control buttons” show the 4 controls associated
with each question. “Other elements” show HarborBot icon and an ellipsis icon HarborBot used
for mimicking writing by a person in chat interaction.
5.2.2 User Interface & Response Options
I used BotUI1 - a Javascript framework to build conversational UIs. Users’ messages are
distinguished from the bot’s by different colors. Animated ellipses are shown with a delay
to denote that the bot is typing (Figure 5.1f). Interface supports different question types:
skip - move to next utterance without responding (Figure 5.1c), yes/no (Figure 5.1A), input
- free text response (Figure 5.1C), options (Figure 5.1B, or many options (Figure 5.1D).
5.2.3 Persona
I aimed for balance between serious and friendly tones to help users take the conversation
seriously, but also provide comfort when answering personal questions. I avoided use of
humor and sought to make HarborBot empathetic without sounding condescending. To ac-
complish this, HarborBot used occasional confirmatory phrases, such as: “Okay, I’m getting
a better idea of where you are at.”, “Got it”, and assurances, such as: “The next questions
1BotUI - https://botui.org/
82
are about your personal safety and may be tough to answer.” The use of voice was important
for understandability. HarborBot used a female voice taken from Microsoft’s Bing Voices2.
Users could adjust the volume of the voice or mute it entirely for privacy reasons or personal
preference.
5.2.4 Dialogue-Based Interaction
Survey questions and user replies were presented as streams of messages in threaded con-
versation akin to chat messaging. Each question could be skipped (Figure 5.1c) and the
conversation would continue. HarborBot supports rephrasing the question to offer its sim-
plified version (fifth grade reading level) for low literacy individuals (Figure 5.1b). An edit
button (Figure 5.1a) is present next to each past answer in case the user needs to change
it. On top of these functional aspects, HarborBot would occasionally respond with conver-
sational remarks. These utterances were essential to developing Harbor’s personality, and
engaging users in a conversation. Some of these interactions are dynamic based on a rule-
based approach. For instance, if a user indicated they did not have a steady place to live,
HarborBot would not ask the remaining housing questions. If the user response indicated a
negative social situation, HarborBot would acknowledge it with a sympathetic affirmation,
such as “That must be stressful, I’m sorry to hear that.”
5.3 User Study
I conducted a within-subjects study with 30 participants (17 male, 10 female, 3 declined to
answer, mean age: 39.63, SD=12.91) to compare the experience of answering a social needs
survey using two different platforms: HarborBot (Chatbot) and a more traditional interface
for taking surveys - Surveygizmo (Survey). I recruited participants with high (19 partici-
pants) and low (11 participants) health literacy at two study sites: 1) Seattle metropolitan
area and 2) safety net hospital in Los Angeles (Harbor-UCLA). Chatbot interface was ex-
2https://docs.microsoft.com/en-us/azure/cognitive-services/speech/api-reference-rest/bingvoiceoutput
83
pected to be 1) more engaging, 2) more understandable, and 3) more comfortable to share
information with, while 4) preserving response quality. I also expected these effects to be
pronounced with low literacy users.
Procedure: Participants interacted with both survey interfaces using a tablet’s web browser.
After interacting with one interface, participants reported their perceptions and experience
and then repeated the same procedure for the second interface. I randomized the order of
interaction. After completing both, I conducted an interview. In both platforms users an-
swered the social needs survey (LACHA) consisting of 36 questions related to demographics,
financial situation, employment, education, housing, food, and utilities as well as questions
related to physical safety, access to care, and legal needs. A number of questions can be
considered sensitive, such as: “Have you ever been pressured or forced to have sex?”, “Are
you scared of being hurt by your house?”, “Did you skip medications in the last year to save
money?”
Measures: Participants evaluated interfaces on workload (NASA TLX survey [200]), en-
gagement in the task (from O’Brian’s engagement survey [180]), understandability of con-
tent, and willingness to share information. These measures have been commonly used in
prior studies of chatbots [31]. Health literacy was measured using Rapid Estimate of Adult
Health Literacy (REALM) [58] and Newest Vital Sign (NVS) health literacy scale [234].
During the interviews I asked about preferences for the two survey platforms, the specific
features of the platforms, participants’ comfort in sharing information in each platform, and
perceptions of the personality of the chatbot.
Analysis: Analysis focused on descriptive statistics of user interactions, especially with
Chatbot, and on comparison of answer equivalence for the two platforms. Differences in
survey responses were assessed using paired t-tests and interactions between interface type
and participant’s health literacy levels were explored using linear mixed effects models.
84
The interviews took between 7 and 25 minutes (M=17.56, SD=9.21), conducted by three
and analyzed by four of the authors. Each researcher wrote a detailed summary of interviews
they had not conducted, including quotes. I then developed a codebook following a top-down
and bottom-up approaches. Initial codes for the top-down pass were informed by the interview
questions. I then refined the codes based on themes that emerged from the data in a bottom-
up fashion. Each interview summary was coded by a researcher on the team (who had not
conducted the interview, or written the summary). The coded interview summaries were
used to identify themes. Three of the authors discussed the overall themes until consensus
was reached. Researchers consulted with the audio and transcriptions of the interviews to
ensure validity of the coding.
5.4 Results
5.4.1 Preferences
Low health literacy (LL) participants preferred using Chatbot over the Survey with 8 out of
11 expressing such preference. At the same time, 17 out of 19 high literacy (HL) participants
preferred Survey. This difference was statistically significant (χ2 (2, N=30)=12.5, p> .001).
5.4.2 Time to Completion
Participants had to respond to 36 questions in the social needs survey, but they could also
skip answers. They spent significantly (t(27)=2.23, p> 0.05) more time answering questions
via Chatbot (M=9 : 26 min; SD=3 : 14 min) than via Survey (M=6 : 48 min; SD=6 : 28
min). There was no significant difference between answering time (avg. of both interfaces)
for LL (M=9 : 43 min, SD=3 : 23) and HL participants (M=7 : 36 min, SD=4 : 20). There
was also no significant interaction between the interface and literacy level on time.
85
5.4.3 Equivalence of Responses
An important question is whether the two interfaces result in the same data quality. I ex-
plore two measures: per-item response rates and data equivalence. On average participants
provided almost identical number of answers via the two interfaces: 32.93 (SD=3.48) ques-
tions answered with Chatbot and 33.00 (SD=2.95) with Survey. This suggests comparable
response rates. In terms of data equivalence 87.0% (SD=11.6%) of the responses per user
were the same across the two interface versions.
5.4.4 Reasons for Response Discrepancies
Skipping an answer in one interface, but not the other was the primary cause of answer
discrepancy (48% of mismatches). There was, however, no significant difference between the
two platforms in skipping behaviors. 25% of mismatches were a result of skipping a question
in Chatbot only and another 23% due to the opposite. Furthermore, the order in which users
encountered the interfaces had no significant impact on skip rates: 8.0% (SD=9.3%) when
answering the survey the first time, and 7.8% (SD=8.2%) when answering the survey the
second time. Hence the platforms are not different in this respect. One interesting finding
from the explorations is that there seems to be an anchoring effect with users skipping more
often when starting the study with Chatbot, for their responses to both platforms: Chatbot
(M=9.8%, SD=29.7%) and Survey (M=9.8%, SD=27.2%) than when starting with Survey:
Chatbot (M=5.2%, SD=22.3%) and Survey (M=4.9%, SD=21.6%). This is most likely due
to the skip option being more explicit in the Chatbot and users wanting to be consistent in
their answers.
Manual examination of the remaining mismatches revealed varied and non-systematic
reasons for discrepancies such as: low equivalence only in the very first introductory question
(53.3%), direct contradiction (e.g., user answered “Yes” in one interface and “No” in the
other); similar, but not the exact same answers (e.g., answer: “Yes, help finding work”
vs. “Yes, help keeping work”), ticking an additional option in a multi-choice answer (e.g.,
86
“Unemployed - looking for work” vs. “Unemployed - looking for work, Disabled”) and a
possible misinterpretation of the question (e.g., when asked for income per month, user
typed “2000” in one interface and “24,000” in the other).
5.4.5 Workload (NASA TLX)
Analysis of the NASA TLX survey responses revealed a difference in task load index (avg. of
all items denoting workload, α=0.83) between Chatbot and Survey. Participants reported a
higher workload when using Chatbot (M=2.460, SD=1.241), compared to Survey (M=2.167,
SD=1.284; t(27)=−2.020, p=0.05). Given the scale from 1–lowest to 7–highest, this still
represents a low perceived workload. There was also a main effect of literacy level: a higher
perception of workload across both platforms by the LL participants (M=2.955, SD=1.335)
than the HL ones (M=1.921, SD=0.948; t(27)=2.439, p> 0.05). The interaction effect was
not significant.
5.4.6 Engagement, Understandability, and Comfort with Sharing
Analysis of the engagement index (average of O’Brian’s engagement questions, α: 0.82),
revealed a higher reported engagement for LL participants (M=3.920, SD=0.502) than HL
ones (M=3.469, SD=0.402), (t(27)=2.672, p< 0.05). There was also a weakly significant
interaction between interface and literacy with LL participants being more engaged with the
Chatbot than HL ones, but less engaged with the Survey (Chatbot∗Low, β=0.485, SE=0.262,
p=0.064). This represents a half a point increase on a 5-point likert scale for engagement.
Trends in the same direction, but no significant differences were found for understandably
and comfort with sharing information.
5.4.7 Interview Feedback
In this section, following mixed-methods approach, we complement and expand on the quan-
titative findings. Participants varied not only in their preferences for Chatbot or Survey, but
87
also in the particular aspects they liked about each, as well as in which design aspects were
instrumental in creating particular perceptions and experiences. Participants valued the
engaging conversational aspects of the Chatbot. Especially LL participants found the con-
versational interface more caring in the context of a sensitive topic. In contrast, HL valued
the efficiency of the SurveyGizmo interface and felt slowed down by the Chatbot. Some par-
ticipants found the Chatbot more robotic, disingenuous or pushy at times, but these seem
to result from the particular way in which HarborBot implemented conversation.
Strength of Proposed Conversational Data Collection Approach
Engaging: Most participants, regardless of literacy level, found the chat more engaging
than the Survey. Participants felt like they were having a conversation with a person when
using the Chatbot. More than half the participants attributed such perception to the use of
voice: “she was reading the questions and I can answer it ... seemed like a conversation ...
like someone was talking to me and it gave me the opportunity to answer back and then they
answered back” (H59). Other participants felt the ellipses made it feel like having a chat
with someone (H76, L77), and even referred to the Chatbot as “she” (8 participants). Some
participants valued that the Chatbot felt like a person: “I liked... how it talked to you, reads
you the questions ... it spoke directly at me” (L60), “I thought it was someone asking me
those questions” (L72).
Aside from the voice and ellipses, the conversational utterances also contributed to the
perception of interacting with a person (L75, L58, L72, H32, L60, L36, H41, H59). One
participant found them motivating: Saying ‘you got it.’ It’s giving you motivation ... nice
to hear that once in a while” (H73). Another felt like the conversation was adapting to the
answers to be more relevant: “seem like they tried to give you a little positiveness based on
your answer” (H59).
Caring: Participants perceived the Chatbot as caring, particularly in the LL group. These
participants had a generally positive attitude towards the social needs survey questions (L51,
88
L55, L58, H73) and this topic resonated with their personal experiences “It felt like it was
telling me about my life. That was really amazing, like woow” (L71). Therefore, some of the
perceptions of the Chatbot might have been accentuated by the positive perception towards
the survey topic. Many participants described the personality of the Chatbot using terms
such as: caring, kind, patient, helpful, calm, familiar, or concerned (H35, H41, L52, L55,
L57, L61, L77). Participant also reported the voice of the Chatbot was aligned with this
caring personality: it was soothing (H57), had cadence (H32), helped a nervous participant
feel more comfortable (L55) and was “nice and sweet made me feel relaxed” (L77).
The Chatbot was designed to provide supportive utterances in response to some of the
participants answers. Many participants liked these utterances (L60, L36, H32, H41, H58,
H59). One participant though the utterances made him feel “comfortable to answer the
questions” (L61), and that they provided a positive reinforcement to keep on answering
(H59). Participants perceived Chatbot utterances such as “I am sorry to hear this” as
the Chatbot “trying to be understanding” (H59). Some found these utterances to be very
applicable to the conversation context. For example L61 considered the Chatbot response:
“That must be stressful” to be a reaction to the information she shared: “she probably said
that because of my financial situation” (L61), which she felt would be calming for people
“to not be stressed, I would think it would be helpful” (L61). Other participants felt the
supportive utterances gave them confidence: “nice lady giving me confidence ... with good
tone of voice” (L75).
Understandable: Several LL participants (5 of 11) reported having trouble with reading
and understanding the written questions in the Survey. They liked using the Chatbot because
it facilitated their understanding, which they attributed to the audio feature: “When I hear it
I have a better understanding of the question” (L61) or that “just hearing it I could ... relate
better to the question” (L53). Some participants reported using the feature that replayed
audio, to better understand a particular item (L51, L58, L61, L73). This was especially
useful when they missed some words or did not fully comprehend some of the contents at
89
first: “I didn’t get it at first, so I wanted to go back and listen to it again before answering”
(L58). Several also mentioned that they would have liked it if the answers were spoken via
audio as well, to make them more understandable (L61, L54).
Accessible: Some participants had particular needs that the Chatbot was able to satisfy
much better than the Survey. One participant who reported vision problems, preferred having
questions read to them: “If it is too small I can’t see it so I prefer to have the questions
read to me anyways” (H73). Another participant reported feeling very comfortable with the
Chatbot because she was regularly experiencing panic attacks and considered ED stressful:
“I was thinking I was texting somebody ... that made me forget where I was at ... it was
like texting my sister my mom and waiting for them to respond back. And that made me
feel patient” (L77). In contrast, she found it particularly difficult to take the Survey: “by
myself... it felt awkward and alone” (L77).
Weaknesses of Proposed Conversational Data Collection Approach
Inefficient: HL participants cared about efficiency, primarily reflected in the speed of
completing the survey. The majority of HL participants (17 out of 19) preferred to use the
Survey because of that. Several mentioned that the traditional interface enabled them to be
faster than the Chatbot (H21, H22, H24, H59), or to go at their own pace (H36). Participants
attributed being slowed down to various conversational features of the Chatbot. Some felt
the Chatbot was slower because they needed to wait for the ellipses before a new question
would appear (H35). They were also able to read faster than the questions were read by the
bot: “when she was talking at me. I felt like I was going at a slower pace” (H23). Also not
having to engage with additional conversational utterances was seen as more efficient (H35,
H56). The audio feature was perceived as interfering with reading and thinking (H23, H40,
H70, H21, L71). One LL participant preferred the Survey because they could concentrate
more: “to read is better ... Because that way I could like concentrate more and think about
more and you know ... I could read my letters more and makes it better for me.” (L58).
90
Pushy: Somewhat surprisingly, a few participants perceived Chatbot as being pushy, based
on the tone and the speed at which questions were asked. Some participants felt the questions
asked were very direct (H57, L72, H52). L72 felt like he was answering questions to a teacher,
and had to provide correct answers. H57 and L72 thought there could be more utterances
to help prepare the survey taker for some very sensitive questions in the survey. H57 also
felt that some of the questions were trying to repeatedly get information that he had already
declined to provide: “if I say none of the above ... don’t be pushy” (H57). Others also felt
rushed in providing the answers to the Chatbot. For example,the use of ellipses, and the
short delay between its messages made it feel like the Chatbot was moving faster than the
participants were comfortable with (H23, H63). Participant H63 felt like the questions kept
coming and he had no control over when they would be read.
Robotic and Disingenuous Voice: Some participants, primarily in the HL group, per-
ceived the Chatbot as being robotic. Some participants found the voice not sounding natural
(H21, H22, H23, L58, H59, H63, H70, H76), for example sounding “truncated .. mono-
tone. . . seemed pretty artificial to me.” (H70). Some perceived the Chatbot as disingenuous
when the utterances did not meet their intended purpose (H63, H40, H23, H52): “I feel like
they were trying [to make] the software to feel sympathetic, or empathetic, that was weird”
(H63). Another participant perceived utterances as defaults: “it felt like defaults rather than
someone ‘feeling for you”’ (H40). The perception of artifical responses led another partici-
pant to perceive the Chatbot as fake, and was reminded of customer support: “kind of just
programmed, recorded in, to appear to be more personal. . . hell there’s nobody there somewhat
disingenuous ... It reminded of ... dealing with the phone company” (H70).
Inconclusive Impact on Willingness to Disclose Information: Most participants,
regardless of health literacy level, reported being comfortable sharing information asked
by the survey questions. However, the human-like interactions of Chatbot did affect some
participants’ willingness to disclose information, although participants reported effects in
91
both directions. For some, if they thought they were interacting with a person, they felt
more reluctant to share sensitive information, or tell the truth: “I might be more honest if
I’m reading [the question] ... if someone else ask me about them, I might lie” (L72). Another
participant showed concern about the identity of the potential conversational partner: “it
was a robot, I didn’t mind, but I think if it was a human being I would mind... and you
really don’t know who’s on the other end” (H40). In contrast, some participants were more
willing to disclose because of the human-like interactions. “If it says ’I would like to more
about you’. It gives me the confidence to open up, because each question that follow sounds
so interesting and it gives me the opportunity to interact with the person on the other side
... it wettens my appetite to give out more information” (L75).
5.5 Discussion
In this paper, I proposed the use of a chatbot (HarborBot) for social needs screening at
emergency departments and compared it to a traditional survey tool (SurveyGizmo). Based
on interviews, interaction logs, and survey responses we demonstrate that the conversational
approach is perceived as more engaging by all the participants, and further as more caring,
understandable, and accessible among the low health literacy (LL) ones. Importantly, I
also demonstrate that the conversational approach results in similar response rates and 87%
equivalence in the collected data. At the same time, I found the conversational approach
to be more time consuming (in line with reports from prior work [164]) and prone to be
perceived as somewhat pushy, robotic, and disingenuous which was, however, mostly the
perception of participants with high health literacy (HL).
5.5.1 Positive Design Aspects
Numerous strengths of the conversational approach for LL population can be linked to con-
versational features. First, various features of the chatbot facilitate understanding. The
audio output is especially valuable for participants who are less proficient readers. Second,
the ability to ask the bot to rephrase the question offered a way to ask for clarification that
92
is currently not a feature in online survey platforms. Third, chatbots can create a sense of
interacting with a human. The utterances can make the survey takers feel cared for and
engaged. Such positive interactions made some participants feel relaxed and even motivated
to answer more questions.
5.5.2 Challenging Design Aspects
Conversational features felt pushy for some, especially HL participants. Such perception was
linked to the tone of the questions and to the speed of the interaction. In terms of tone
it is possible that the literal use of the wording of the survey questions was not the most
appropriate for creating a conversational feel. In terms of speed of interaction, the use of voice
might be a contributing factor. As reported in prior work agent asking questions via voice can
create a perception of response urgency[130]. This could be improved by adding assurances
like “please take your time.”, manipulating intonation, or making it more explicit that the
ellipses represent someone is typing (rather than the system is waiting for a response). The
second reason for pushy feel could be related to the fixed speed of conversation. Human-
human conversation involves not only exchanging information, but also coordinating various
aspects of the exchange, e.g., its speed [31]. If a participant needs more time to think, a
real person, would pick it up from verbal and non-verbal cues and adjust the speed. Our
HarborBot is currently incapable of making such adjustments. Such fixed speed may feel
too fast or too slow for some users.
HarbotBot felt “caring” for LL and “robotic” for HL participants. This might be related
to the different expectations and tolerance levels for voice quality and may be improved with
use of a better quality text-to-speech service (technical challenge), human pre-recorded audio
clips (which comes with limitations in flexibility), or modifications of intonation and prosody
using approaches such as Speech Synthesis Markup3. Another way may be to generate more
personalized and diverse utterances[131].
3https://www.w3.org/TR/speech-synthesis11/
93
5.5.3 Future Design Directions
Given the division of preferences for chatbot/survey between the HL and LL groups, one
possibility for a real-world use could be to have two versions of the tool and either intelligently
assign or have patients pick the version they would prefer to engage with. While, long waits
in healthcare setting make it less of a problem, a number of design opportunities can still
be explored to make the chatbot interactions more efficient, such as simplifying the script,
or providing user control over time between messages. While we focused on examining the
effects of the conversational approach for a LL population, our findings suggest a potential
for accessibility-focused uses of the chatbot. Participants who were hard of seeing mentioned
they appreciated the audio output. Further, one participant with anxiety attacks appreciated
the human-like interactions, which made them feel like chatting with a loved one, at home.
Finally, it is not clear based on our results, how the conversational approach affects
people’s comfort in responding to questions, and any potential desirability biases. Prior work
suggests that the self-administered screeners would reduce social desirability bias, and limit
the under-response in sensitive issues[91]. This is because people will not feel like someone is
monitoring or judging them. We thought the Chatbot may strike a happy medium between
being perceived as human-like to enhance engagement, while not being perceived as a person
for people to feel uncomfortable with disclosures. It is not clear if we were able to achieve
that balance. Some participants who thought the Chatbot was human-like did not mind
sharing and commented that it was more motivating, while others that thought the Chatbot
was human-like were concerned with sharing. It is possible that the very initial greeting from
the bot sets the tone for the rest of the interaction[118]. This requires additional research.
5.6 Summary of Contribution
In this chapter I demonstrated the use a conversational approach to engage users in sharing
information about their health & social needs in emergency department setting. I specifically
designed a mixed-modality (voice and text) chatbot called HarborBot for administering a
94
social needs screening survey in a conversational manner. HarborBot included specific con-
versational language features to boost user engagement: social phrases, empathetic reactions,
as well as conversational style and etiquette. It also supported chat GUI specific human-
likeness features, such as a response delay and ‘ellipses’ indicator. On top of that it included
features specific for supporting low health literacy populations, such as voice readout, and
question rephrasing. I evaluated HarborBot with 30 emergency department visitors (11
low literacy) and identified a number of positive design aspects of chat-based approach: 1)
improved engagement due to increased perception of care, calm personality, and feeling of
interacting with a human. 2) improved understandability due to audio support and question
rephrasing features (especially valuable for LL users). I also identified, that the high health
literacy users mostly preferred traditional survey as it was more efficient. My work advances
the understanding of the role conversational agents can play in supporting collecting data
from users at scale. It particularly offers insight into designing engaging and understandable
conversational interactions for low literacy, vulnerable populations and on sensitive topics.
Further it highlight the impact such interfaces can have on patient-provider information shar-
ing. Designers of conversational interfaces can use my findings to inform impact of different
features on low and high health literacy populations. Health professionals and social workers
can also leverage my work to improve conversational agent adoption in hospital settings to
reduce care costs.
95
Chapter 6
AUTOMATING THE DESIGN OF ENGAGING
CONVERSATIONAL DATA COLLECTION
In the previous chapter I have demonstrated the benefits of conversational administra-
tion of a survey on social needs in the emergency department context for improving user
engagement. In earlier chapters I have explored some intrinsic conversational design features
contributing to improved engagement, such as contents & language diversification, contextual
tailoring, socialization & empathy as well as general human-likeness principles. At the same
time across all the chapters I have demonstrated the substantial effort and work required
to design enticing conversational experiences, which has also been identified as a challenge
in prior work [97]. In this chapter I explore which and how common design components
can be reused and applied automatically to aid the design of engaging conversational expe-
riences. I explore this design automation support focusing specifically on my work from the
last chapter by attempting to automate the adaptation of survey-based data collection to a
more engaging conversational form.
Engaging users in sharing information about their health, behavior, preferences as well
as other aspects is important for successful behavior change [165], health interventions [50],
and also in a broad range of other domains [2]. As demonstrated in the previous Chapter 5
and in other work [124, 239], conversational survey administration can increase user engage-
ment resulting in numerous benefits. Yet beyond hand-crafted examples of conversions of
specific surveys, there is no clear systematic way of adapting a survey to the conversational
form. To address these challenges and to systematize the knowledge about conversational
design developed in previous chapters I propose a systematic automated process for adapt-
ing any form-based survey to chat-based conversational form following 4 augmentation tasks
96
informed by engaging conversation design principles distilled from prior chapters.
6.1 Background
The value and means of performing an automated adaptation of surveys to conversational
form are based on several areas of work. Prior work demonstrated the specific benefits
of administering a survey in a conversational format [124, 129, 239]. These works also
offered examples of how conversational adaptation may look like, even if only for a particular
survey and domain. The linguistic literature offers insights into what elements a conversation
is composed of and hence can inform the linguistic adaptations required. Finally, several
applied approaches help inform the technical solutions that can aid in making an automated
conversion feasible.
6.1.1 Engagement Benefits of Conversational Survey Administration
Recent work, including my own, has shown that survey administration in a more conversa-
tional form such as via chat has the potential to increase user engagement resulting in higher
quality responses and lower drop-out rates [124, 129, 239]. These benefits have often been
attributed to the chatbots’ ability to naturally deliver human-like interactions [239].
Kim et al. investigated in an experimental study whether it is the chat-like GUI or the
specific use of language that provides the conversational administration benefits. They found
that for conversational surveys to be effective, having GUI (chat-like interactions) alone is
not sufficient. The language needs to be in a conversational style as well [124]. In Chapter
5 I have shown the value of conversational administration of surveys for user engagement
particularly for questionnaires involving sensitive topics (e.g., about sexual abuse, violence,
or financial situation) and applied in sensitive settings (e.g., hospitals) [129]. Xiao et al.
emphasized the benefits of natural and familiar ways in which conversational interaction
allows users to express themselves [239]. Personified and anthropometric features have also
been linked to increased user attention and trust [221, 77]. Even framing the questions as
more personalized conversational messages has been shown to have the potential to improve
97
user engagement and response quality [131, 43].
Yet beyond hand-crafted examples of conversational adaptations of specific surveys, there
is no well defined and systematic way of adapting any survey to conversational form. This
increases the barrier to entry for survey administrators without a design background to make
their surveys more conversational and engaging for their audiences. This therefore leads to
my first research question:
• RQ1: How to support the systematic conversion of any survey to a more engaging
conversational form?
6.1.2 Linguistic Elements of Engaging Conversational Design
Linguistic theories provide several language elements that could inform augmentation. Pro-
posed by Austin [211] and further developed by Searle [212] speech acts theory differentiates
between five types of phrases: representatives (e.g., claiming, reporting), directives (e.g..,
advice, request), and expressives (e.g., promise, threat). Empirical resources such as [86]
offer concrete examples of some common phrases for speech act subcategories.
In the conversational agent domain, past work emphasizes the importance of various el-
ements, such as proper agent introduction [197], ending of the exchange, and a certain level
conversational etiquette [112, 197]. Engaging relational agents incorporated social behavior
such as social dialog, empathy or expressions of liking [30]. A recent review of social cues
identified several elements used in prior work, such as thanking, praise, and many others [78].
Conversational survey administration work focused on a subset of these, such as response
feedback [124, 239, 129], social acknowledgments [239], handling conversational flow and
transitions [124, 239, 129], response prompting or probing [239] as well as survey question
rephrasing [124]. Aside from concrete phrases, prior work also indicates that engaging con-
versation relies on some overarching language properties, such as ‘conversational style’ [124],
avoidance of repetitions [30, 131], lexical diversity [78], degree of human-likeness [38, 39, 15],
formality [78], and consistency [112].
98
Past work and linguistic theory provides common language elements and properties in
broader conversation as well as specifically used for designing engaging conversational expe-
riences. Not all of these are, however, easily and meaningfully applicable to survey adminis-
tration context. Furthermore, while broader categories are defined, concrete phrases directly
usable for survey augmentation are limited. Given these indication, my two subsequent re-
search questions focuses on their impact on survey respondents and the ability to support
these elements well:
• RQ2: What is the impact of a converted survey on user engagement and perceptions
of concrete conversational design elements?
• RQ3: Which aspects of conversion are handled well, and which are still problematic
for our approach to automation?
6.2 Making Survey Conversational - Design & Automation
I consider the whole text of a survey to be structured as a sequence of survey items. A
survey item can be a question, which requires an answer, or it can be a non-answerable item,
such as an instruction, section heading, etc. Each item is a piece of text. Question items
can be answerable via selection of one or more of answer options from an associated set or
via free-text input. The conversational adaptation involves: 1) the use of chat-like GUI and
2) a number of linguistic adaptations to the survey text and structure. Prior work indicated
that in survey context chat-like GUI alone without adjustment to the survey language may
not produce expected engagement benefits [124].
Given these design indication, linguistic adaptations in this work focus on two types of
changes: 1) additions of “conversational” utterances in-between the original survey items
(e.g., addition of acknowledgments, reactions, greetings) and 2) limited modifications to
the existing survey question text to fit conversational administration context (e.g., phrasing
utterances as questions and modifying the language to be consistently “conversational”). At
the same time, survey context applies certain constraints, most importantly the need for
99
preserving the text of the original survey questions. This is to avoid changes in meaning
and preserve survey validity (in case of validated instruments) [108]. The conversational
augmentation of a survey in this work is composed of 4 major tasks:
• Adding introduction & closing for the conversational interaction
• Adding reactions to user answers in question context
• Adding conversation progress communication
• Modifying survey questions to fit conversational style
6.2.1 General Design Principles
Here I describe a number of guiding principles for design & automation of the conversa-
tional survey augmentation as informed by prior work on conversational engagement and
the requirements of the survey context.
Avoiding repetitiveness I have shown the importance of diversification of language in the
design of engaging conversational agents. In Chapter 3, where I described the use of mobile
chat for exercise promotion, the repetition caused measurable drop in user engagement and
compliance with exercise suggestions [131]. Similarly in conversational reflection work with
Reflection Companion in chapter 4, as well as in conversational social needs screening with
HarborBot, described in Chapter 5, repetition was often mentioned in qualitative evaluation
as a cause of ‘artificial feel’ and a potential factor negatively affecting engagement. Prior
work on relational agents have also reported that repetition can be severely detrimental to
user engagement, even leading to drop-out [30]. The level of lexical diversity is also listed as
an important social cue [77]
The augmentation process tries to maximise the diversity of phrases used for augmenting
the survey. This is accomplished by: 1) providing several variants for each augmentation
phrase, 2) tracking frequency of use of particular phrase variants to minimize repetition.
100
Minimizing changes to the original text Non-research surveys can be designed for a
specific product or narrow one-time purpose and do not necessarily rely on a highly precise
language to collect valuable information. However, many surveys used in academic research
are validated instruments meant to measure specific latent variables predicted by some the-
oretical foundations. In such cases the exact phrasing of the questions has been carefully
designed and tested to ensure validity and consistency [108]. Major changes in the language
could invalidate internal consistency measures in such surveys.
I try to minimize the changes to the survey phrasing by: 1) applying only minimal changes
that would render the survey item applicable for use in conversation (making it 3rd person
& framed as a question), 2) prioritize additions (e.g., prefix “Could you tell me”), rather the
deletions of the original question contents.
Using empathy only in appropriate context Prior work used varying degrees of emo-
tional expressiveness. Kim et al. employed an expressive casual style, where the chat commu-
nicates enthusiasm (e.g. “Way to go!”, “Let’s go to the next step!”) and politeness (“Please
go on to the next section”) [124]. Survey questions in that work are related to demograph-
ics and product feedback. Xiao et al. use much more personal and emotionally expressive
phrasing (e.g., “I am very impressed by what you do”, “Thanks, I’m glad you are happy
with me”) in a free text survey asking for feedback on a game trailer. My own work on
social needs survey indicates that neutral reactions in the context of sensitive questions can
lead to the perception of chat as being ‘robotic’, ‘fake’ and reminiscent of customer support
lowering engagement [129]. Relational agents successfully employed expressions of empathy
and liking behavior for driving user engagement [30]. Yet, there are also indications that
overly expressive bot can feel disingenuous [24] and lead to heightened expectations [156].
The automated conversational augmentation tries to maintain neutral style in most of
the added utterances. For a survey to remain unbiased, the chat should not try to be to
positive or negative. The expressions of empathy are reserved for the context of questions
framed in a sensitive manner (e.g., “Do you feel threatened by violence?”), while the neutral
101
acknowledgements are used as reactions for questions on neutral topics and framed in a
neutral way (e.g., “What is your age?”).
Audience-sensitive augmentation Study of HR chatbot use in a company setting found
differences in the appreciation of socialization aspects of a chatbot among users [147]. In
Chapter 5 I have shown that high-literacy populations tend to prioritize efficiency and per-
ceive additional social phrases as an unnecessarily lengthening the interaction [129]. In a
counseling setting users may prefer neutral interaction to avoid judgmental tone [97]. Lan-
guage style used for survey augmentation can vary from polite formal [129] to expressive
[239] to casual [124]. Therefore the extent of social chat, empathy, and the style of language
may depend on the intended population and context of use for the conversational survey.
The amount of social chat (i.e., how many reactions, progress phrases should be used),
the tone of the interaction (e.g., whether a neutral or empathetic tone should be employed),
and augmentation style (e.g., formal, informal) needs to be adaptable to the audience or
even an individual. This is accomplished through: 1) parametrized application of all the
augmentation tasks (e.g., the frequency of reactions or progress phrases) and 2) use of a
separate augmentation phrase repository (e.g., polite repository can rephrase “What is your
age?” to “Please tell me what your age is?”, style used in [129]; while a casual repository
can rephrase it with “Plz, tell me what is your age?”, style used in [124]).
6.2.2 Building a Repository of Augmentation Phrases
I defined adaptation of survey to conversational form as composed of 4 tasks (i.e., adding
introduction & closing, adding reactions, adding progress communication, and question
rephrasing) and also provided several design principles. In order to support these tasks,
I need to create a repository of phrases constructed in advance and informed by speech
act theory [212, 86] and prior work [78, 239]. I can pick phrases from this repository as
needed and inject them between the survey items. Conversational question rephrasing can
be accomplished in a similar fashion, by appending conversational prefixes to survey items
102
to turn them into chat questions. The selection of the best phrases to pick from the reposi-
tory for a particular survey position can be determined dynamically based on local context
(e.g., question and user answer to decide on the reaction). Such retrieval based approaches
have already been successfully used in conversational context [244]. Augmentations such as
introduction & closing as well as progress communication can be largely accomplished us-
ing simple hand-crafted rules, while addition of appropriate empathetic reactions as well as
questions rephrasing require data-driven ML components. In both cases, the augmentations
are retrieved from a prewritten repository. Here I describe how this repository is constructed
and what elements it contains.
Augmentation Elements
As discussed in related work, various phrases and language adaptations have been used in
general conversational agents and in conversational survey administration specifically. Here
I focus on a subset of phrases to support via a repository. Prior work indicated the im-
portance of a conversational agent being able to properly initiate and end the conversation
[112, 97] and also to follow proper conversational etiquette [112, 197]. Several works indicated
the importance of chatbot properly communicating its purpose and capabilities [112, 127].
Hence the repository needs to contain Introduction & Closing phrases. Furthermore, the
need and expectation of response feedback and acknowledgments [239, 124, 129] as well as
the demonstrated value of empathy or expressions of liking on user engagement in relational
agent design [28] dictates that the repository needs to contain Acknowledgements & Empa-
thetic Reaction phrases. Prior work has also shown proper ‘conversational style’ is needed
for engaging conversational survey administration [124] and is important for communicating
engaging social cues [78]. The repository supports this goal by providing Question Rephras-
ing prefixes. Specific to the survey task the repository also contains Progress communication
Phrases. These phrases have dual purpose, the usability purpose is to communicate task
progress (survey completion is ultimately a task), the engagement purpose is to provide a
sense of accomplishment, acknowledge user effort [239] and thank the user for contribution
103
[78]. Finally to further reduce the repetitiveness [131] and monotony of the exchange mimick-
ing phrases used in conversational survey administration work [239, 129, 124], the repository
contains phrases for Topic continuation & topic switching.
In summary the augmentation repository needs to contain phrases for: 1) Introduction &
Closing of conversation; 2) Acknowledgments & Empathetic Reaction phrases to user answers
3) Progress communication phrases, 4) Topic continuation & topic switching phrases, and 5)
Phrases supporting conversational reformulation of survey items, in the proposed approach
- Question Rephrasing prefix phrases.
Generation of Repository Phrases
In order to populate the repository with concrete examples, I extracted phrases from prior
work on manual conversational survey adaptation [239, 124, 129], included phrases from
linguistic repositories such as CARLA [86], and further, especially for empathy expressions,
from sources such as motivational interviewing [198], counseling, and more casual interview
resources. There were two main aspects that had to be addressed in the process of adapting
the phrases from prior work: 1) consistency of language style, 2) generalizability of phrases
to different survey contexts. The first point required rewriting the phrases taken from prior
work such that they would maintain a consistent style. Xiao et al. used very expressive
style, Kim et al. used casual, teenage-like style, while in my HarborBot work for social
needs screening I used professional and polite style. The second point required removing
or replacing any survey specific words used in the phrases to ensure they are applicable in
various survey contexts. This common generation process has been used for all the repository
phrases except for conversational prefix phrase generation, where a more empirical approach
was used.
While the conversational prefix phrases were also inspired by the examples from prior
work on conversational survey administration [239, 124, 129], the generation process relied
more on an empirical and iterative process. For a set of survey items, I would try to find a
prefix phrase that could change the item into a question form in common polite conversational
104
style. In case the survey item was not formulated as a question, e.g., “Type in your gross
income”, I would try to create a prefix that would turn it into a question, e.g., “Could
you please...” In case an item was already in a question form, but phrased in a too direct
language (e.g., “Are you married?”, the prefix would add consistency of polite style as well
as diversification of phrasing, such as “Please tell me whether...”. Given a new question I
would try to match one of the existing prefixes and if none would much, a new prefix would
be created. Prefixes that seemed interchangeable for the same question context (e.g., “Can
you tell me whether...”, “Please indicate if”) would form a prefix group. Aside from prefix
phrases themselves a prefix group would also include empirically generated replacement rules
for the original survey item, e.g., “are you”  “you are” as well as “I”  “you”). The process
was designed to make the question conversational by use of additions rather than deletions
of the original survey item text.
For each category I provide several phrase variants to help avoid repetitiveness (‘avoid-
ing repetitiveness’ design guideline) and also several categories of augmentations to match
local context either from perspective of empathy (‘contextual use of empathy’ guideline)
or best fitting minimal grammatical augmentations (‘minimal changes to the original text’
guideline).
Augmentation Phrases Repository
The repository contains a total of 118 different phrases distributed among different aspects
of the four augmentation tasks.
Introduction & closing phrases: The repository contains 13 different templates for
chatbot introduction such as “Hi, my name is name. I would like to talk to you about topic.”,
“Hi, I am name. Let’s talk for a moment...” and 6 different templates for conversation
closing, such as “We’ve completed everything! Thanks a lot!”, “We are done! Thanks.”
for closings. These templates represent a coherent polite and professional tone to ensure
consistency important in conversational interactions [172]. The introductions also contain
105
slots with a chatbot name and survey topic to be instantiated for a particular survey.
Empathetic reaction phrases: The repository contains three classes of empathetic re-
actions: ‘Neutral acknowledgment’ , ‘Expression of satisfaction’, and ‘Expression of compas-
sion’. Neutral acknowledgments play a role of non emotionally expressive feedback to the
user that the chatbot is “listening” and “receiving” user input. These reactions are meant
for context where emotional expressions would not make sense or could lead to judgmental
tone. The repository contains 7 different phrases for this class such as: “Thanks for sharing”,
“I took a note of that”, “Okay, I’m getting a better idea of your answers”. Expressions of
satisfaction are meant to communicate positive emotional valence, encourage the user, and
share in the user’s positive emotion in an appropriate context. This class contains 10 differ-
ent phrases such: “I am glad to hear that”, “That’s good”, “That’s really great!”. Similarly
Expressions of compassion are meant to express chatbot’s concern and empathize with the
user, especially in the context in which the user might be disappointed or otherwise discon-
certed. Use of such emotional reactions is meant to make the chat more human-like and
natural drawing from work on relational agents. This class contains 6 different phrases such
as: “I am sorry to hear that”, “That sounds stressful”, “That’s hard to hear”.
Question rephrasing prefixes: The repository contains 6 classes of prefixes used for
prepending to the survey items to turn them into questions, make them more conversational,
provide diversification of phrasing, and also consistency of tone. There are a total of 37
different prefix phrases among the 6 prefix classes such as: “Can you tell me”, “Would you”,
“Have you experienced”, “Can you share if”, “Could you say that”. These also contain
replacement rules meant to rephrase the remainder of the survey item text into 3rd person
question form, such replacement rules are e.g., “i”  “you”, “am”  “are”, “are you” 
“you are”. Appendix B presents the classes along with the example survey items.
106
Progress communication phrases: The repository contains 12 progress communication
phrase templates, which contain slots for current and total survey questions or a progress
percentage, such as “We are currently at question d out of n.” or “We are done with
percent% of our questions.” There are also distinct progress phrases for use in the middle of
the survey: “We are now in the middle of the survey”, “We’re halfway there, still l questions
to go” as well as close to the end of it: “We are almost done, thanks for your patience”, “We
are almost at the end, thank you for staying that long”.
Topic continuation & topic switching phrases: The repository contains 12 topic con-
tinuation and topi switching phrases. These include the generic phrases without specific topic
slots, such as “Let me ask you some more questions...”, “Just few other things I wanted to
ask you about...” as well as templates with topic information, such as: “Let’s move on to
questions about {section topic}”.
It is worth noticing that this repository represents a particular consistent augmenta-
tion style that is meant to be polite and professional. It is possible and quite easy with a
repository-based approach, to create survey augmentation phrases that represent e.g., infor-
mal, teenage-like style such as used in [124].
6.2.3 Design of Augmentation Tasks
Here I discuss the design and automation details of each of the 4 conversational survey
augmentation tasks. These tasks rely on dynamically retrieving the most appropriate phrases
given the survey context from the repository described earlier. As a result of the automated
application of these tasks the form-based survey is adapted to a conversational form.
Task 1: Chatbot introduction & conversation closing
The goal of this task is to augment the survey with introduction and closing phrases selected
from the repository. The position of the phrases in the resulting dialogue is fixed as the first
and last utterances of the conversationally adapted survey. As described earlier, introduction
107
phrases are designed as templates with a chat name and the domain of the survey as slots to
be instantiated for a specific survey. Chat name is simply determined from the domain, by
appending Bot to the domain name. The closing utterance indicates to the users that the
interaction is complete and politely thanks the user for their involvement. Table 1 contains
example instantiations for specific surveys.
The chatbot name itself can communicate its purpose and capabilities and possibly set the
tone for the exchange [11]. Academic and commercial applications used varying approaches
to naming their conversational agents. Social and relational chatbots commonly use human-
sounding names such as Eliza, Alice, Mitsuku, Xiaolce, or arguably more modern-sounding
Tey or Zo. Various digital assistants, whose main task is to provide functional help, while
maintaining a limited degree of social interaction tend to be given less human-sounding
abstract names such as Siri, Cortana, Swelly, WoeBot, Tido. On the other hand, several
non-social service bots derive their names from the companies they represent (e.g., eBay,
Duolingo, Sephora), or function they perform (e.g., PizzaBot). Overpromising on agent ca-
pabilities can potentially lead to user disappointment and decreased engagement [156]. Given
the narrow purpose of survey administration, which is just to collect answers to survey ques-
tions rather than engage in free-form social chat, I chose a naming scheme that corresponds
to the survey domain, such as SleepBot, SocialNeedsBot or FinanceBot (Table 6.1).
Task 2: Survey question augmentation
The goal of survey question augmentation is to: 1) rephrase the original survey question to
make it amenable for use in chat context, 2) preserve consistency of utterance style across
conversation, and 3) introduce variation to the phrasing, that would avoid repetitiveness that
could decrease user engagement. At the same time I want to make sure that the original
text of the survey questions is preserved as much as possible.
To transform the originally phrased survey questions into form amenable for use in con-
versational context I perform several steps. First, I classify the survey question text into
6 phrasing categories (Table 6.2) derived empirically as described in the repository build-
108
Table 6.1: Introduction & Closing template examples and their instantiations for specific surveys.
Phrases in-between square brackets are survey specific slots that are filled-in dynamically.
ing section. Based on the detected phrasing class, a concrete prefix text in that class is
probabilistically selected from several different prefixes so as to minimize repetition (i.e.,
not reusing the same prefix in consecutive utterances). Selected prefix is prepended to the
question text. Each phrasing category may include sentence modification rules to ensure the
original question text is correctly phrased in 3rd person form (e.g., changing “are you” 
“you are” or “I”  “you”). In the final step, to further lower the sense of repetition and
emphasize the natural progression of the conversation an additional phrase such as “Moving
on”, “Next” is prepended to some of the utterances with a given probability. Several exam-
ples of the original survey items and the subsequent conversational phrasing resulting from
the augmentation process are presented in Table 6.2.
Task 3: Reactions to user answers
The goal of this task is to match the most appropriate chat reaction to the used answer in
the question context. There are three reaction classes in the repository used for this task,
each containing several concrete text phrases: Neutral acknowledgments (e.g., “Thanks for
sharing”, “Got it”), Expression of empathy (e.g., “Sounds great!”, “I am happy that’s the
109
Table 6.2: Examples of original survey items and the rephrasing resulting from the augmentation
process. Phrases in-between square brackets have been added or modified.
case.”), and Expression of compassion (e.g., “That sounds stressful.”, “I am sorry to hear
that.”).
To select the most appropriate reaction I perform several steps. First, I classify the
question text into 3 empathy framing categories: Positive, Neutral or Negative. This classi-
fication is somewhat similar to sentiment, but relies on how the question is framed in order
to match the empathetic reaction. Specifically questions related to demographics or poten-
tially “judgmental” topics should be classified as Neutral for empathy matching perspective.
In the second step I classify the answer option into similar 3 empathy framing categories:
Positive, Neutral or Negative. Positive framing for an answer communicates that the user
expressed agreement with the question, while the Negative framing expresses disagreement.
110
Neutral answer framing represents uncertain answer, mixed answer, or categorical option
without any clear opinion or sentiment. Certain categorical answer options can have their
own intrinsic valence for empathy matching purposes, e.g., “Eviction”, or “Crack/Cocaine”
from the social needs survey would be classified as Negative due to the negative meaning
of the concepts themselves. In the third step, the results of the question and answer clas-
sifications are combined using a fixed rule that decides on the reaction category to match.
Non matching question and answer framing (i.e., Pos & Neg or Neg & Pos) would match
Expression of compassion, while matching framings (i.e., Pos & Pos or Neg & Neg) would
match Expression of satisfaction. Presence of Neutral framing in either would match Neutral
acknowledgment. Appendix A presents examples of question and answer context in which
different reaction categories would be appropriate. Selection of concrete text from that re-
action class is done probabilistically and with keeping track of use frequency in the same
fashion as for prefix selection described earlier.
Task 4: Progress communication
The goal of this task is to communicate the progress of the exchange to the user and also
to further break the repetitiveness by injecting additional phrases related to topic & section
management. This task injects text from two repository classes: Progress communication
phrases and Topic continuation & topic switching phrases Table 6.3. Addition of these
phrases is not based on any data-driven ML components, but is probabilistic and injected
every n-th survey item as controlled by a meta-parameter. Progress communication phrases
have particular subclasses of phrases that only apply for middle and close to the end of
the survey, this is to introduce additional novelty to the dialogue. Similarly to the other
tasks, the selection of the concrete text being added from each class is probabilistic with
keeping track of the frequency of use to minimize repetitiveness. It is important to note that
topic switching phrases are not necessarily well aligned with any actual change in the survey
section or topics of questions.
111
Table 6.3: Template examples of progress communication and topic switching phrases and their
instantiations for specific surveys. Phrases in-between square brackets are survey specific slots that
are filled-in dynamically.
6.2.4 Automation of Augmentation Tasks
Out of the four conversational adaptation tasks described in the design section only progress
communication does not rely on data-driven ML components (i.e., it adds progress phrases
probabilistically). The automation, however, is also controlled by several meta-parameters
defining the prevalence of different augmentations. Non-repetitive selection of concrete text
phrases to be injected is based on keeping track of frequency of use and not learned from
data. Here, I describe the datasets used for training, testing and validating the ML compo-
nents. I also describe the data-driven ML components themselves as well as non-data driven
automation.
112
Development Survey Dataset
The dataset used for initial training and testing the ML components included 16 surveys
- 4 demographic surveys, 2 social needs surveys, reflection survey (Kember’s reflection),
stress survey (PANAS), physical activity motivation survey (TPB), workload survey (NASA
TLX), depression survey and a few others (see Appendix D). Most of the surveys were
related to health, wellbeing and behavior change given the nature of my work. The dataset
included both validated instruments used in research as well as informal questionnaires. All
the surveys have been represented in a common JSON format adapted from their original
sources (PDF, Website, Word document). This adaptation step is still manual at this point.
The text of questions and answers has been extracted from each survey and manually labelled
by the author to provide the data for training and evaluation of the ML components. The
269 extracted survey questions were labelled for rephrasing tasks (6 phrasing categories) and
for empathy framing tasks (3 categories). The phrasing labelling resulted in 138 (51%) of
the questions labelled as ‘adverb-based question’ and the rest as other phrasing categories
(Figure 6.1).
Figure 6.1: Distribution of 6 question phrasing categories in general and across the 16 development
surveys.
Question labelling for the purpose of empathy question framing resulted in 106 (39%) of
the questions receiving a Negative label, 85 (32%) receiving a Neutral label and 78 (29%)
113
labelled as Positively framed (Figure 6.2).
Figure 6.2: Distribution of 3 empathy question framing categories in general and across the 16
development surveys.
The answer dataset was composed of 577 answers extracted from the 16 surveys and
138 answers representing standard likert-scales extracted from [228]. The labeling of these
answers for empathy framing resulted in 382 (53%) answers labelled as Neutrally framed, 186
(26%) labelled as Negatively framing, and 147 (21%) labelled as Positively framed (Figure
6.3).
Figure 6.3: Distribution of 3 empathy answer framing categories in general (this includes the 138
answer examples extracted from common likert-scales [228]) and across the 16 development surveys.
114
Hold-out Survey Dataset
I selected 6 additional hold-out surveys after the conversion approach was finalized to evaluate
the conversion performance on a range of surveys with potentially challenging properties (see
Appendix C). Two of the surveys are informal, while others have been featured in published
research. The surveys also employ different question phrasing (e.g., 1st person, 3rd person or
mixed) and rely on different answer options (likert-scale-based vs custom scales). In summary
the hold-out surveys are used to: 1) evaluate the amount of manual user corrections needed to
make them applicable to the end-user scenario, and 2) collect user feedback on self-reported
engagement, usability and quality of conversational elements in the user study.
I employed the same labelling process for the hold-out surveys. The 88 questions ex-
tracted from these 6 surveys were labelled for phrasing and question empathy framing. For
phrasing, 51 (58%) of the questions were labelled as ‘Noun-based statement’ (Figure 6.4 A).
For empathy question framing 37 (42%) were labeled as Positively framed, 34 (39%) as Neu-
trally framed, and 17 (19%) as Negatively framed (Figure 6.4 B). Labelling of the 97 answers
resulted in 39 (40%) answers labelled as Neutral, 37 (38%) as Positive, and the remaining
21 (22%) as Negatively framed (Figure 6.4 C).
Figure 6.4: Distribution of labels in the 6 survey hold-out dataset. From the left: A) Distribution
of question phrasing classes among surveys, B) Distribution of Empathy Question Framing classes,
C) Distribution of Empathy Answer Framing classes.
115
Machine-Learning Components
The data-driven ML components fueling 3 of the 4 augmentation tasks (i.e., except for
progress communication) have been framed as a text classification problem. I use the Scikit-
learn ML library as well as Spacy NLP toolkit to process the data as well as train and
evaluate the text classification tasks. While all the tasks have been framed as text clas-
sification problems, the specific nature of each task and the data pose different challenges.
These properties result in different combinations of optimal features, preprocessing, and data
augmentation (Table 6.4).
The survey domain detection presents several challenges: 1) survey can be on virtually
any topic - open domain, 2) multiple suitable domains can be selected, e.g. sleep survey
would work well with “health”, “sleep” and “wellbeing” designations - hence the task can
be seen as multi-label, 3) survey domain name must be suitable for use as name for the
bot, e.g., HealthBot might be a more suitable name for a survey evaluating depression than
DepressionBot. I address these challenges by defining a large, but curated, set of possible
survey domains. I extract 90 different domain names from Wikipedia topics page (e.g.,
“culture”, “health”, “biology”) and adjust the domain names to make them usable for the
employed chat naming scheme. This set is likely to cover all possible survey domains at
least at the high level of abstraction (e.g., may not contain “sleep”, but will contain “health”
domain). A similar approach to handling topics has been used in [75]. Given that the
surveys in my dataset represent only a handful of domains, I seed the data for each domain
with 30 most similar worlds generated by Spacy (split into 3-4 groups of words to provide
few multi-word examples for each domain). During the survey classification I extract only
nouns and verbs from a given survey (based on pos-tagging) and calculate an average of the
embeddings from these extracted keywords to represent the whole survey (Table 6.4). Nouns
and verbs are much more indicative of topical domain than other parts of speech [159] and
embeddings are able to capture similarity of meaning [236].
Many, especially academic, surveys use variations of standardized likert-scales in their
116
answer options. I take advantage of this by adding a set of standard likert scale answers
extracted from [228] to my training data. This subset is used in all the classification in
addition to any survey specific data.
Through experimentation I have selected different sets of features for different classifica-
tion tasks (Table 6.4). Question Language Adaptation benefits particularly from including
part-of-speech tagging, while the use of word embeddings does not seem to be valuable. This
is perhaps not that surprising given the language-structure specific nature of the task. Ques-
tion Empathy Framing classification on the other hand benefits from embeddings and not
from pos-tagging, which is not surprising as the meaning and contextual use of the words
is important for this task. Value of word bi-grams is harder to explain. Answer Empa-
thy Framing classification benefits from character level bi-grams, which makes sense given
the short length of most answer options. In general effectiveness of combined use of word
grams and embeddings is a bit surprising, perhaps both symbolic and neural representations
convey valuable information. LogisticRegression also was more effective than SVM, when
embeddings and n-grams were used together.
Conversion Meta-parameters & Non Data-driven Automation
Aside from data-driven ML automation, several components of the conversion approach are
non data-driven. While the selection of augmentation category might be based on ML-based
text classification, the concrete text to use from a repository is selected probabilistically with
keeping track of the frequency of use to minimize repetitiveness. The following procedure
is followed. A concrete text to use from the category is selected at random from a set of
least frequently used texts (minus text used last time). Each use of the concrete text in the
category is tracked.
Several metaparameters control the conversion process at the higher level (Table 6.5).
The frequency of injection of the progress communication phrases as well as the frequency of
topic phrases are both controlled by meta-parameters which define that these phrases would
be injected every n-th survey item. The injection of the reactions and the use of empathetic
117
Table 6.4: Summary of the setup of between different classifiers supporting the augmentation
tasks. The best setups have been determined in a limited parameter exploration on development
set (however, no exhaustive grid-search has not been performed).
reactions (as opposed to always using only neutral acknowledgments) is controlled in a similar
fashion, but by a parameter defining the probability of a survey item getting a reaction.
6.3 Evaluation
The purpose of the evaluation was to: 1) understand how well the proposed automated
conversion can perform to support survey administrators in engaging their audience (RQ2)
and 2) further identify and understand the aspects of the conversion approach which are
handled well and the ones that are still problematic (RQ3). The evaluation is organized
as a 3-step process: 1) evaluation of ML components performance, 2) manual correction
effort and 3) user study based evaluation. I evaluate the performance of the ML components
118
Table 6.5: Meta-parameters controlling the automated conversion
using performance metrics - classification accuracy, weighted F1 score in a cross-validation
and leave-one-out evaluation setups (Figure 6.5 - ML performance). This captures the data-
driven automation performance. Further I carry out a user evaluation. First on a hold-out set
of 6 unseen surveys I evaluate the user correction effort (e.g., grammatical or other language
issues that need to be corrected manually). This represents the additional effort a survey
administrator would have to put in order to make the automatically converted surveys ready
for end-user administration (Figure 6.5 - Correction effort). Then I evaluate the impact of
the adapted conversational surveys (with minimal corrections) on survey respondents’ self-
reported engagement, usability and the quality of the conversational elements (Figure 6.5 -
User study evaluation).
6.3.1 ML Performance
Out of the four conversational adaptation tasks described in the design section only progress
communication does not rely on data-driven ML components (i.e., it adds progress phrases
probabilistically). The remaining tasks of: 1) adding introduction & closing, 2) modifying
questions to conversational form and 3) adding reactions to user answers, all rely on text
classification ML components. Adding introduction & closing relies on domain classification
119
Figure 6.5: The 3-step evaluation process: 1. ML performance - evaluation via accuracy and F1
score in leave-one-out and 5-fold cross validation setups. 2. Correction effort - manual editor effort
needed to correct basic issues (e.g, grammatical errors). 3. User study - impact of the adapted
conversational surveys on engagement, usability and the quality of the conversational elements.
for filling-out the bot name and domain slots in the text templates. Modifying questions to
conversational form relies on phrasing classification to select the most appropriate prefix and
utterance rephrasing rules. Finally, adding reactions to user answers relies on the results of
two text classifiers that classify empathy question and answer framing.
I evaluate the performance of the classifiers measuring accuracy and weighted F1 score
metrics (due to class imbalances in some of the tasks) in a 5-fold cross-validation as well
as leave-one-out evaluation setups. Cross-validation uses data across separate surveys and
captures more of a within survey performance (i.e., items from the same survey are likely
present in the training and testing data). Leave-one-out evaluation is more indicative of
likely performance on new, unseen surveys. Specifics of how these metrics are calculated are
given in the subsequent section on measures.
6.3.2 Correction Effort
Accuracy and F1 score capture the performance of ML components according to the provided
problem scoping (i.e., text classification), but may not capture other potential issues. These
could be related to grammatical mismatches (e.g., prewritten prefix template does not fit well
120
with the question despite correct classification), more nuanced aspects, such as “awkward”
phrasing of a reaction in a particular context or other unforeseen issues outside of how the
design was framed for automation. From a survey administrator’s perspective correcting
these mistakes would necessitate manual edits of the text. To evaluate such editing effort
I calculate the edit distance (also known as Levenshtein distance) defined in the measures
section. This is calculated between the phrasing resulting from automation and the corrected
version that I have developed by hand to fix the grammar with minimal edits.
The applied corrections were limited to 2 aspects: 1) corrections of grammatical errors in
the question rephrasing (see Appendix F), 2) corrections of mismatched empathetic reactions.
In case of reactions the correction would involve replacing a mismatched reaction with a pre-
written reaction text taken from the correct category from the repository (see Appendix
G). I focused only on these corrections as they can be objectively evaluated and represent
the minimal set of changes needed for making the conversational survey presentable to the
end-users. This is essentially mimicking the edits that a survey administrator would need
to apply at the very minimum. No other types of changes to the automatically generated
conversational survey text are applied and any other potential “issues” are presented to the
users in the subsequent user study.
6.3.3 User Study
The purpose of the user study was to: 1) capture user perception of conversationally adapted
surveys, 2) evaluate the quality of specific conversational adaptation elements, 3) collect
qualitative feedback. The study followed a between subject setup, where each participant
was exposed to only one randomly selected condition (i.e., one of the 6 hold-out surveys).
The setup has been approved by a university IRB and deployed on the Amazon Mechanical
Turk (AMT) crowd-working platform.
Participants: 30 AMT participants were recruited (5 per each conversationally adapted
survey). The participants were at least 18 years old and residing in the U.S. They were also
121
required to have completed at least 100 AMT tasks in the past with at least 90% approval
rate. Each participant interacted with only one survey. One participant completed the task
on a mobile device, while others used non-mobile devices. Participants spent on average 8:31
min (SD: 7.53 mn) on the task and the compensation was $1.5 (average rate: $10.50/hour).
Procedure: Participants were first asked to read and approve a consent form, which de-
tailed study procedures, participant rights, compensation and research staff contact infor-
mation as required by the IRB. They then completed a conversationally administered survey
(each participant answered one survey). The chat interface was based on a BotUI framework
with minor survey specific usability modifications introduced in the HarborBot study [129].
Following the chat interaction, the participants were asked 6 questions about engagement
(detailed in the measures) and prompted for first-impressions free-form feedback. They then
answered 10 usability questions (SUS survey detailed in measures) including an attention
check question. Lastly, the participants were asked for feedback on the 4 conversational
augmentations introduced to the survey. The page has been divided (see Figure 6.6). On
the left side (top on mobile) the log of the conversation with red-highlighted phrases of inter-
est was presented. The right side (bottom on mobile) included questions asking for quality
evaluation and for free-form feedback. This setup was introduced to aid with recall and to
present the conversational phrases in their context. Participants were asked to evaluate the
overall quality and give free-form feedback for: 1) chat reactions to their answers, 2) progress
communication, 3) introduction & closing, and 4) chat questions. The free-form feedback
prompt changed depending on participant’s quality score. This was done not to render one
choice less effortful than other (AMT workers can be inclined to complete the task as fast
as possible [60]). The final question asked for free-form feedback about anything “missing”
that could improve the survey answering experience. The whole setup was tested for proper
rendering on desktop and mobile devices.
122
Figure 6.6: Last page of the AMT user study asking for feedback on particular conversational
augmentation design elements. On the left participants were shown the log of their exchange with
red-highlighted phrases of interest. On the right they were asked to evaluate the overall quality of
the phrases as well as to give detailed free-form feedback. Pressing “Continue” would ask them to
evaluate another aspect (red highlights in the conversation would change accordingly).
123
6.3.4 Measures
ML Evaluation Measures: All the ML components used are text classifiers. I evaluate
their performance using standard metrics of Accuracy (fraction of correct category predic-
tions) and weighted F1 score (a weighted average of the precision and recall) as provided
by the Scikit-learn library 1. I select the weighted variant of F1 score to account for label
imbalance in some of the tasks (e.g., in empathy answer framing, 53% of the examples are
labelled as Neutral out of 3 classes, consistently assigning a Neutral label could yield 53%
accuracy).
Correction Effort Measures: I measure the manual correction effort by edit distance
defined as a minimum number of single-character edits (insertions, deletions or substitutions)
required to change one string into the other [246]. For example, a correction from “Would
you mind sharing do you have a workout buddy?” to “Would you mind sharing whether
you have a workout buddy?” would require 2 substitutions (“d” “w”, “o” “h’]]) and 5
insertions (“ether”) for a total edit cost of 7 characters.
User Study Measures: Participants evaluated the conversationally adapted surveys in
terms of engagement using 6 questions adapted from O’Brien’s engagement survey [180] (e.g.,
“I was really drawn into answering questions”, “I felt involved in answering questions”, “This
experience of answering questions was fun”). The same engagement questions were used in
my prior work on social needs screening with HarborBot described in Chapter 5. Usability
was evaluated using System Usability Scale (SUS) [18] using 10 questions adapted to chat
context (e.g., “I think that I would like to use this chat interaction frequently.”, “I thought
there was too much inconsistency in this chat interaction.”, “I thought the chat interaction
was easy to use.”). Additionally the 4 design aspects (i.e., reactions, progress, introduction
& closing, and question phrasing) were evaluated in terms of quality on a 5-point scale from
(“Very poor” to “Very good”, with mid-point set to “Acceptable”).
1https://scikit-learn.org/stable/modules/model evaluation.html
124
Several questions also asked for free-text feedback on overall aspects of the interaction
(i.e., “Please share any aspects of the interaction that you felt were particularly bad or good for
your experience”, “Please share one aspect that was missing in the interaction and that you
think would be valuable for improving your experience”) as well as for specific conversational
augmentation elements (e.g., regarding reactions to answers “Please give an example and
share what felt wrong about it.”).
6.3.5 Analysis
ML Evaluation Analysis: In a leave-one-out evaluation the accuracy and weighted F1
scores are calculated per validation survey (i.e., the survey not used for training) and then
averaged across all the surveys. In the 5-fold cross-validation, the data is randomly split into
5 parts, with 4 used for training and the 5-th used for testing.
User Study Analysis: The 6 engagement questions showed high internal consistency
(α=0.86) and were averaged to form an engagement score (same process as in [129]). Given
the meaning on the 5-point likert scale, average values above 3 represent positive engagement.
The 10 SUS items also showed high internal consistency (α=0.83). Scoring of the SUS survey
involves adding up all the items and multiplying the result by 2.5 to form a score from 0 to
100. Past research indicates that SUS scores above 68 represent above average usability2.
I also used a mixed-effect model with quality ratings for 4 design aspects (intro & closing,
reactions, progress, question phrasing) as predictors and the engagement rating as a predicted
outcome. I control for survey repetition by including survey id as a random effect. I used
the model to examine the impact of augmentations on user engagement.
The qualitative feedback is coded and grouped into themes relating to strengths and
weaknesses of applied conversational adaptations as well as based on feedback for particular
conversational augmentation aspects.
2Scoring System Usability Scale (SUS) - https://www.usability.gov/how-to-and-tools/methods/system-
usability-scale.html
125
6.4 Results
The results are presented according to the 3-step evaluation process: 1) evaluation of ML
components performance, 2) manual correction effort and 3) user study based evaluation.
Conceptually the ML performance evaluation estimates how well the proposed conversational
adaptation design can be executed automatically on unseen, but similar surveys. Manual
correction effort estimates how much work is needed from a human survey administrator to
correct the basic language mistakes and empathy mismatches resulting from the proposed
automation (this is evaluated as if corrections were made on raw text output). Finally, the
user study evaluation captures the impact of the automatically augmented surveys, after
applying the minimal manual corrections, on survey respondents. I present the findings from
each evaluation step, discuss insights and potential improvements for the future.
6.4.1 ML Performance
Table E.1 presents evaluation of the performance of different classifiers used for survey adap-
tation tasks. Question Language Adaptation classification selects the most appropriate prefix
to be used for rephrasing the survey item and ensuring “conversational style” (3rd person and
question form). The 83% accuracy in cross-validation indicates that the classifier performs
much better than random (17%) or simple majority class selection (51%). Small discrepancy
between the weighted F1 and accuracy scores indicates that class imbalance (51% of the data
is labeled as only one of 6 classes - ‘adverb-based questions’) is not negatively affecting per-
formance. The fairly large gap between cross-validation (83%) and leave-one-out evaluation
(68%) performances suggests that some question phrasing categories are present in a small
subset of surveys and not common across all the surveys (e.g., indeed PANAS survey [232]
provides 19 of 30 examples for ‘request-action’ question phrasing).
Question Empathy Framing and Answer Empathy Framing are two text classifiers used
jointly to decide on the empathetic reaction class to present to the user (detailed in the
design section). Both classifiers select from 3 classes (positive, negative and neutral). In the
126
Table 6.6: Classification performance for the 4 text classification tasks (+1 derived) used in auto-
mated conversational survey adaptation. Question Empathy Framing and Answer Empathy Fram-
ing classifications are part of empathetic addition - the results of these two classifications taken
together are used to decide on reaction class
case of Question Empathy Framing the classes are fairly balanced, which suggests that the
achieved 81% accuracy is better than expected random (∼33%). Similarly to the Question
Language Adaptation, the better performance in cross-validation (81%) than in leave-one-
out evaluation (69%) suggests that a particular question framing tends to be overrepresented
on a subset of surveys (indeed the demographics surveys have questions mostly labelled as
neutrally framed). The Answer Empathy Framing classification also performs better than
expected random (∼33%) or majority class selection (53%) with an average accuracy of 89%
(answer classes are fairly imbalanced in the data), but lack of discrepancy between weighted
F1 and accuracy scores does not suggest this to be a problem. Contrary to other classifiers,
the Answer Empathy Framing classification performs similarly in cross-validation and leave-
one-out evaluations, suggesting that answer framing is more reusable across surveys. This
classifier also benefits from being seeded by additional external cross-survey data for common
likert-scales containing 135 examples comprising ∼17% of the dataset. Removing this data
from training slightly decreases performance for leave-one-out evaluation (Acc=0.86 ± .15,
127
Table 6.7: Classification performance on 6 hold-out surveys used for correction effort estimation
and in user study.
F1=0.86±.17), but increases performance in cross-validation (Acc=0.90±.01, F1=0.89±.01)
introducing some of the same imbalance present in other classifiers.
In general, the leave-one-evaluation yields lower performance than cross-validation sug-
gesting unique aspects of specific surveys not necessarily shared across all the surveys in
the dataset. Furthemore, large standard deviations of 0.33 for Question Language Adapta-
tion and 0.24 for Question Empathy Framing in leave-one-out evaluation indicates that the
performance varies a lot for different surveys.
Table 6.7 presents the classification evaluation results on the set of 6 hold-out surveys.
Better accuracy in Question Language Adaptation suggests the hold-out surveys are very
similar to most of the development dataset surveys in that respect. I explore some of the
performance differences on hold-out surveys in the subsequent section on correction effort.
Performance on the full dataset (combined 16 development surveys & 6 hold-out surveys)
128
can be found in Appendix E
In summary, the classifiers used for various conversational adaptation tasks perform better
than random or majority class selection. Answer Empathy Framing classification seems more
reusable across surveys than Question Language Adaptation or Question Empathy Framing.
This classifier also benefits from additional cross-survey training data from standardized
likert-scales, which contributes to its better performance in leave-one-out evaluation.
6.4.2 Correction Effort
Each of the hold-out surveys required some corrections (Table 6.8). Normalized by the
characters automatically added in the conversational adaptation, the manual edits comprised
30.06% of the automatically added text on average. The Political views survey required
most edits (42.12%), while the PVQ survey the least (18.75%). It is worth noting that while
misclassifications almost certainly necessitate manual correction need (i.e., unless rephrasing
resulting from misclassification is still grammatically correct from editor’s perspective), some
corrections might also be needed even if the classification is correct (i.e., in case rephrased
question has grammatical issues despite seemingly correct classification). In fact, editing
needs in spite of correct classification may signal a systematic issue in the problem definition.
In total, question language adaptation corrections accounted for 13.65% of all the edits. This
is a sum of 2 sources for such corrections: 1) language adaptation misclassification (5.15%)
and 2) missing replacement rules (8.48%). Empathetic reaction corrections accounted for
84.05% of edits, caused by: 1) Question Empathy Framing misclassification (71.55%) and 2)
Answer Empathy Framing misclassification3 (12.48%). The remaining 2.31% of corrections
were due to survey domain misclassification in the introductions.
3In case both the question and the answer were misclassified, the correction effort is counted under
question misclassification to avoid double counting (from the editor’s perspective a reaction would need
editing only once).
129
Table 6.8: Correction effort quantified as character edits per hold-out survey. The corrections repre-
sent minimal changes to the grammar and empathetic reactions needed from a survey administrator
to present the conversational survey to end users.
Question Language Adaptation Corrections:
Question phrasing corrections were required when automated conversational rephrasing re-
sulted in grammatical errors. Out of the 88 questions among the 6 hold-out surveys, 37
(∼42.5%) required some form of editing (see Appendix F).
Question Phrasing Misclassification: The Personal Finance and Sleep Quality surveys
required most editing effort due to phrasing misclassification (16% and 6% of the survey
correction effort respectively). In the case of the Sleep Quality survey, 6 questions were
130
consistently misclassified as ‘verb-based statement’ instead of ‘noun-based statement’ result-
ing in rephrasing for e.g., “I have difficulty falling asleep.” to “Next, have you experienced
i have difficulty falling asleep?” instead of “Next, would you say that you have difficulty
falling asleep?” Both phrasing classes are the least represented in the training data (only 11
and 17 example sentences respectively), which likely explains the misclassification. Similar
situation takes place for the Personal Finance survey, where 4 questions are misclassified as
‘noun-based statements’ instead of more appropriate ‘verb-based questions’.
Missing Text Replacement Rules: The PVQ survey required the most question edits
due to missing replacement rules (43% of all the correction effort). The unique aspect of this
survey was its 2nd person question framing, which was not the case for any other survey
in the dataset. Lack of rules rephrasing 2nd person to 3rd person word use (i.e., “he” 
“you”’, “himself”  “yourself”) resulted in e.g. survey item “It’s important to him to show
his abilities. He wants people to admire what he does.” being rewritten as “Do you think that
it’s important to him to show his abilities. He wants people to admire what he does?” which
required 21 manual character edits (based on edit distance) to correct to “Do you think that
it’s important to you to show your abilities? Do you want people to admire what you do?”.
It is important to note that the replacement rules are not learned from the data and require
manual specification. There is, however, a finite set of replacement rules as they are based on
personal, possessive and reflexive pronouns, which are a closed class. Also the replacements
of 2nd person verbs (e.g., “thinks”, “runs”) can be accomplished via pos-tagging.
Verbosity & Multiple Viable Question Rephrasings: Two additional observations
emerge from the question phrasing corrections. First, the conversational rephrasing might
in some cases be unnecessarily verbose, for example “I feel in control of my current financial
situation.” is rephrased as “So, is it fair to say that you feel in control of your current
financial situation?” while a more concise rephrasing would simply be: “Do you feel in
control of your financial situation?”. This is partially an artifact of how the rephrasing was
131
designed to add diversification and ensure a consistently polite tone. An additional phrasing
category could offer more concise rephrasing. Second, it seems that the same survey item
could match more than one category of rephrasing, for example “My finances are a significant
source of worry for me.” could be rewritten using the ‘verb-based question’ category to “Can
you tell me whether your finances are a significant source of worry for you?” or a ‘noun-
based statement’ as “Would you say that your finances are a significant source of worry for
you?”. Given that the answer options for this question are likert-scale from “Not at all true”
to “Very true”, the second rewrite seems more appropriate. This in general might suggest
that Question Language Adaptation classification might benefit from including the context
of answer options into account.
Empathetic Reaction Corrections
Reaction corrections represented the largest portion of the correction effort (84.05%). On a
per reaction basis, this represents 149 of 382 (39.0%) of the automatically provided reactions
needing corrections.
Answer Framing Misclassification: Fitness and Sleep Quality surveys required the most
rewriting effort due to answer misclassification (38% and 32% of all the rewrites respectively).
The Sleep Quality survey uses a variation of a 4-point frequency scale for each question. One
of the answer options (“Sometimes: 1-2 times a week”) has been consistently misclassified
as Positive as opposed to Neutral, which resulted in the need for correction of subsequent
empathetic reactions for every survey question. In the case of Fitness survey, the answer
options are custom 4-point scales distinct for each question. Majority of misclassified answers
have fairly ambiguous framing for empathetic reaction purposes. For example, the answer
option: “Yes but I don’t always stick to it.” was misclassified as Positive, but in the context
of the question: “Do you have an exercise plan?” this answer option would not go well with
a chat reaction being either “Great to hear that’s the case” (in case of Positive classification)
nor with “I am sorry to hear that” (in case of Negative classification) and would better fit
132
a Neutral reaction such as “Thanks for letting me know.” Several answer options have been
misclassified in this survey due to answer options with mixed sentiment.
Question Framing Misclassification: Political Views survey scored particularly low for
question framing classification (44% accuracy) and corrections of these misclassifications
comprised 92% of editing effort in this survey. Indeed several neutrally framed questions
were misclassified as negatively framed (e.g., “A good government should aim chiefly at more
aid for the poor, sick, and old.” likely due to the presence of keywords with negative sen-
timent such as “poor”, “sick”.) or positively framed (e.g., “I would prefer a friend who
is practical, efficient, and hard working.” likely due to keywords “efficient”, “practical”).
Framing classification is used for deciding whether chat reaction should be neutral or empa-
thetic and is not equivalent to sentiment. In the context of a survey asking about political
views, empathetic reactions would not be appropriate (i.e., judgmental). While the ques-
tions themselves contain keywords which can reveal something about the survey domain (e.g.,
“government”), lack of similar examples in the training data likely makes this challenging.
The Question Empathy Framing in particular (as labelled in the data) implicitly relies on
broader context and may benefit from explicit contextual information (i.e., explicit domain)
or more training data (e.g., surveys representing different domains).
Nuance in Empathy Labelling & Oversized Impact of Certain Misclassifications:
Two additional observations emerge from corrections of automatic empathetic reactions.
First, the question and answer framing classification, as labeled in the data, rely on a more
nuanced understanding of a broader context. With limited context and data it is hard to dif-
ferentiate between classifying “I feel vigorous after sleep.” in the sleep quality survey context
as appropriate for an empathetic reaction, and recognizing that the question “Someone who
works all week would best spend the weekend trying to win at golf or other sport” in the con-
text of values survey is likely best matched with a neutral reaction (i.e., to avoid judgmental
tone regarding someone’s values). It is also worth noting that even appropriately labelling
133
such data for the empathy purpose can be challenging in itself. Secondly misclassification
of question framing incurs a high correction cost in raw-text editing. For a misclassified
question the reactions to all the answer options likely require rewriting. For example “I feel
vigorous after sleep.” classified as negatively framed, would result in reverse reaction valence
for all the answer options (i.e., answer “Rarely” would result in “That’s great!” and “Almost
always” in “So sorry about that”). Similar situation happens if a survey heavily reuses a
specific answer option that happens to be misclassified. While the editing cost is high with
raw text, these misclassification scenarios offer an opportunity for an editing tool support.
With such support relabelling the question or repeatedly used answer would require only one
or two clicks (i.e. all the reactions under such a question could be automatically updated
and changing the label for one answer could be automatically propagated to all the identical
answers through the survey).
6.4.3 User Study: Quantitative Results
The user study results comprise quantitative self-reported evaluations of engagement, usabil-
ity and quality of conversational augmentations as well as free-form qualitative feedback on
general interaction experience and on specific conversational design elements.
Engagement
Participants reported positive average engagement with the conversational surveys of 3.73
(SD=0.90), where 3 represents the mid-point rating for 5-point likert scales used in engage-
ment questions. For comparison, in my prior work the manual conversational adaptation of
social needs screening survey (Chapter 5) received a mean score of 3.59 (SD=0.77) on the
same engagement scale. Average reported engagement for all the surveys was above 3, with
the conversational version of Big5 personality survey rated as most engaging 4.27 (SD=0.42)
and Fitness survey rated the least 3.08 (SD=1.40). These differences were not statistically
significant. Only 4 of 30 participants reported engagement lower than 3 and each for a
different survey, suggesting no systematic issues with augmentation of a particular survey.
134
Usability
Participants reported an average usability of 70.17 (SD=19.11) for SUS survey on a scale from
0 to 100. Any score above 68 can be interpreted as above average according to [175]. The
conversational version of Big5 personality survey received the highest usability score of 86.5
(SD=20.89), while the usability of Sleep quality survey was rated the lowest with score of 59.0
(SD=10.55). These differences were not statistically significant. 13 of 30 participants rated
the usability of their surveys at below the average 68, indicating some potential usability
issues. I look at qualitative feedback to understand these potential issues.
Quality of Conversational Augmentations
Participants were also asked to rate the quality of different augmentation utterances on a
5-point likert scale (from “Very poor” to “Very good”). These were highlighted in the context
of their interaction (see Figure 6.6). All augmentations were rated high on the quality scale
(see Figure 6.7). The introduction and closing phrases were rated the highest with 87% of
users rating them as “Good” or “Very good”. The quality of reactions to user answers were
rated the lowest, with 10% of the participants rating them as “Very poor” or “Poor” and
only 70% as “Good” or “Very good”.
To check if added conversational augmentations are indeed positively impacting user
engagement I used a mixed-effects model to predict engagement by user-reported quality
ratings for different conversational elements (Table 6.9). To control for differences across
surveys I include survey as a random effect. Empathetic reactions quality has a positive
significant impact on engagement (β=0.34, p< 0.05), while Introduction & Closing quality
(β=0.34, p=0.067) as well as Question quality (β=0.34, p=0.071) are only weakly significant.
The overall model fit is R2=0.302. Given that engagement is measured on a 5-point likert
scale, the effect sizes are all within half a point increase. It is hard to directly compare
the effect sizes as reactions are much more frequent in a given survey than Introduction &
Closing or Progress utterances.
135
Figure 6.7: User rated quality of conversational elements
in AMT study on a 5-point likert scale.
Table 6.9: Mixed-effects model pre-
dicting engagement by conversa-
tional element quality rating.
Survey respondents generally rated the quality of conversational augmentations high and
the quality of these augmentations seems to positively impact engagement. It is interesting
to note that usability was not correlated with engagement (r=0.125, p=0.51).
6.4.4 User Study: Qualitative Feedback
Here I present the themes from qualitative user feedback on the interaction as a whole as
well as on specific conversational augmentation aspects.
Positive Perceptions
Most of the participants reported positive experience answering a survey with conversational
chat. Several specifically described the chat as interactive & responsive (P1, P6, P15, P25),
e.g., “it felt very interactive did not really felt like chatting with a bot” (P1). Participants also
reported feeling comfortable and engaged in the interaction (P4, P9) e.g., “All good. I felt
comfortable and eager to engage.” (P4) and that the chatbot felt natural and easy to talk to
(P5, P8, P10, P12, P18): “It was a cool experience, the bot felt very natural and easy to talk
too.” (P13). Several also described the interaction as easy to follow and straight-forward
136
(P16, P19, P29): “I felt it was easy to follow along and answer the questions, this was good.”
(P16). I further report the positive feedback for specific design aspects.
Empathetic Reactions Perceived as Good Quality: Several users perceived the em-
pathetic reactions positively, reporting that they are of good quality (P5, P14, P26, P29),
e.g., “I think it is OK now” (P5) and that they would not change anything (P4, P5, P14,
P16, P18, P21, P25), e.g. “I don’t think they need to be improved upon.” (P16). One of the
participants specifically described the reactions as natural, encouraging and pleasant: “No
it sounds natural and encouraging and pleasant really in my opinion” (P15) and another
considered them ‘cute’: “I liked that there was a reaction, it was cute...” (P9).
Progress Phrases Helpful & Appropriately Timed: Most participants considered the
progress update phrases such as “We are currently at question 4 out of 19.” not needing
any improvements (27 of 30). Some specifically reported them as being natural, e.g. “They
sounded completely natural to me.” (P3) as well as helpful, e.g., “Nothing needs to be
improved, it is simple and helpful.” (P16) and reported them as concise and informative:
“This was on point and thoroughly appreciated” (P12). They also reported the progress
updates to be provided at just the right frequency: “I thought it was just the right amount of
updating.” (P27) or “The progress info was provided timely and effectively, I liked it” (P30).
Introduction & Closing Uniformly Perceived as Good Quality: The introduction
and closing were considered of good quality by almost everyone. Several participants ex-
plicitly described them as appropriate: “It was perfectly fine and appropriate.” (P27) and
not needing any improvements: “No need for improvement.” (P16). Only two participants
suggested possible additions. One related to the name and more information about the bot:
“Any formal name? How the Bot was created?” (P5) and the other comment suggested
improved conversation closing: “It could say goodbye.” (P7).
137
Conversational Question Adaptation Natural & Human-like: Participants gave
feedback on survey questions in their final conversational form (i.e., 3rd person question
rephrasing and prepended prefix), without being aware of how the question looked like in
the original form-based survey. Several participants reported the questions being ‘natural’
and feeling like a part of the conversation (P3, P5, P14, P15, P30), e.g., “They sounded
completely natural and like a normal part of conversation” (P3). and reported the questions
to be well-formulated which suggests the adaptations blended in well with the whole questions
text: “The questions were well formulated and gorgeous, nothing to say about them” (P30).
One participant referred specifically to the conversational prefixes forming a good connection
between questions: “I thought the reaction into the next question was the more natural,
human like of the whole thing” (P15). Several other participants explicitly reported questions
to be of ‘good’ quality (P8, P10, P17, P29) and others reported that no changes are needed
(P4, P11, P18, P19, P21, P24, P25), e.g., “Great questions, no improvement.” (P19).
Remaining Challenges
While most users perceived their chat-based survey experience positively, those that did
not pointed mostly to issues with empathetic reactions. One of the participants found the
reactions to every answer a bit artificial and awkward when coming from a stranger in a
personality survey: “The feedback on every answer was the only thing that sounded artifi-
cial. ‘That’s hard to hear’ is a somewhat strange response from a stranger when answering
personality questions.” (P3). Other two felt they were judgmental: “I hate that he would
go ‘so sorry to hear that’ like buzz off with your judgemental self.” (P24) and “I didn’t like
it when the bot said ‘that’s hard to hear.’ Like gee thanks, it’s good to hear parts of my
personality disgust you.” (P27). The second issue is actually a mislabeling problem, rather
than an intrinsic design issue. Two of the participants also pointed to the interaction still
feeling a bit mechanical: “I thought the chat was a bit mechanical and didn’t feel personal.”
(P11) and a little repetitive: “It was a little repetitive, but not bad overall, and surely inter-
esting” (P30). In the subsequent sections I report more detailed improvement opportunities
138
for specific conversational aspects.
Reactions Suffer from Mismatched Empathy & Could Make Better use of Con-
tents: A few participants reported wanting the reactions to be more elaborate (P8, P11),
e.g., “Make it so that the sentences seem more complete.” (P11). This is likely for short
neutral reactions present in the repository such as “Got it” or “Noted”. Others felt that
the reactions could make better use of specific answer context (P17, P2, P19, P25), e.g.,
“It could mention that it received your answer (repeating the answer back to you).” (P2)
or “Maybe a unique reply to my preference in relation to the topic asked.” (P19). The
biggest reported issue with the reactions was when participants felt judged or patronized
by an attempted empathetic reaction in the seemingly wrong context (P3, P7, P24, P30).
Due to this perception the chat was described as over-familiar: It’s awkward and maybe a
little over-familiar” (P3) and “judgey” (P24). One user was specifically unhappy about an
empathetic reaction in the context of a personality survey: “I did not like the phrase “that’s
hard to hear” because it’s who I am, why would that be hard to hear. Also, why is the bot
sorry to hear that I’m not trusting?” (P28). Other participant felt that there actually is no
need for empathy and just acknowledgments would be sufficient: “There’s no need for too
much empathy, acknowledging my replies is enough” (P30) or that the empathetic reactions
make the chat less natural by making it too polite: “not interacting like a normal human
too polite” (P1).
Progress Repetitive & May Decrease Respondents Attention: Despite generally
positive perceptions of progress communication, a few users reported specific improvement
opportunities. For some the specific topic switching phrases such as “Let’s move on to
talking about a few more things...” felt out of place, e.g. “it acts like it is going into
something else but basically asks me what it should already know from the above” (P7).
These phrases are indeed treated as part of progress update and added probabilistically, not
necessarily separating questions on different topics. For others the progress utterances felt a
139
bit repetitive: “seems repetitive” (P1) and one user reported that having such information
in general can actually rush answering and negatively impact attention: “I don’t think these
percentages should be used because it may rush workers on answering rather than giving their
full and undivided attention.” (P22).
Conversational Question Adaptation Slightly Repetitive & Can be Personalized:
There were very few negative perceptions of question phrasing from just a few users. One user
reported the conversational survey questions still felt repetitive: “questions are repetitive”
(P1) and another felt that the chat seems to ask for the same information repetitively:
“questions seem to ignore the preceding answer instead of relating to my input.” (P7). This
is likely an artifact of a survey asking similar questions to measure the same latent construct.
One participant also suggested further personalization of the questions with his/her name,
e.g,. “Possibly asking for a name and then personalizing the messages each time.” (P22).
6.5 Discussion
I first discuss the results in relation to my research questions. For RQ1 about supporting
conversational adaptation with automation, I proposed an automated process consisting of 4
augmentation tasks: 1) Adding introduction & closing, 2) Adding reactions to user answers
in question context, 3) Adding conversation progress communication, and 4) Modifying sur-
vey questions to fit conversational style. These tasks rely on retrieval of phrases from a
reusable augmentation repository. My approach led to conversational surveys that can be
deployed with respondents after applying only minor tweaks. In relation to RQ2 about the
impact on survey respondents, I have used the proposed process to automatically gener-
ate conversational versions of 6 unseen surveys. I further quantified the remaining survey
administrator’s correction effort (to manually fix misclassfications & grammar issues) and
evaluated the impact of such adapted surveys with 30 participants. Mixed-methods results
demonstrated: 1) positive self-reported engagement (comparable to manual conversational
survey adaptation in my prior work), 2) positive impact of conversational adaptation ele-
140
ments on engagement (via a mixed-effects regression model), and 3) nuanced understanding
of the engagement impact of specific augmentation aspects based on thematic analysis of
qualitative feedback. Finally, in relation to RQ3 about conversion aspects handled well &
the problematic ones, I have shown that the fairly simple approach involving a repository and
trained on a limited set of 16 surveys can achieve reasonable results leading to positive user
engagement with only about 30% of automated augmentations needing manual corrections.
Further discussion focuses on: 1) design definition & manual correction effort improvement
opportunities, 2) automation performance & capability improvements, 3) intrinsic trade-offs
between survey requirements and an ideal conversational experience, and 4) augmentation
tasks expansion as well as support for prototyping and tailoring.
6.5.1 Design Definition Improvement Opportunities
Evaluation results highlighted several opportunities for design improvements to the empa-
thetic reactions specifically, but also to broader aspects.
Improvements to Empathetic Reactions: Empathetic reaction matching suffers from
3 major issues: 1) lack of appropriate reaction class for specific scenarios, 2) insufficient
use of broader context, and 3) lack of specificity to survey contents. In relation to the
first point, just 3 empathetic reaction categories might not be sufficient for some contexts.
Prior work points to the richness of empathy expressions [243, 214, 215]. Some limitations
are apparent in case an answer option is ambiguous (e.g., “Yes but I don’t always stick to
it.”) or context implies a reaction beyond simple empathizing (e.g., Q: “Do you want help
with school or training?”, A: “Yes”). These examples are, however, rare in the dataset and
hence pose a challenge for automated matching. Secondly, proper matching of reactions may
rely on broader context than just question and answer. Even the labelling itself implicitly
incorporated the broader context, with demographics questions about age, gender, ethnicity
and education being labeled as neutrally framed to ensure ‘Neutral acknowledgments’ will be
matched as reactions. The situation becomes more difficult when questions on the otherwise
141
neutral topic, such as employment, are asked in the sensitive survey context, such as social
needs (e.g., Q: “Which of the following describes your employment situation right now?”, A:
“Unemployed - looking for work”) in which case an empathetic reaction might be appropriate.
Thirdly, reactions could benefit from directly incorporating user answers (e.g., “Thanks for
saying yes”) or specific mention of the question content (e.g., “Thanks for letting me know
about your housing situation”). Directly referencing user input is in line with indications
from prior work [78, 242] and also aligns with findings that users prefer sophisticated choices
of words, as well as well-constructed and long sentences [154, 224].
Rushing Towards Completion, Limited Personal Feel & Language Verbosity:
Other challenges could be grouped into: 1) issues with rushing towards completion 2) need
for more personal interaction, and 3) verbosity of question rephrasing. In relation to progress
repetitiveness and rushing towards completion, the informational part (i.e., what question is
user at) could always be paired with interaction encouragement. For more personal interac-
tion, use of person’s name, sharing bot ‘background’, and improved politeness suggested in
user feedback are in-line with relational agent design and could easily be included [28, 131].
Finally, the verbosity of question rephrasing is a by-product of addition rather than removal
of contents. This is to avoid question text modification and also a technical limitation of
‘deep’ rewriting [133, 241]. Lengthening of the interaction and more reading effort can be
detrimental to some users as shown in my prior work [129].
6.5.2 Correction Effort Reduction
More than 70% of the correction effort was due to question empathy framing misclassification
alone. Such misclassification almost certainly invalidates all the empathetic reactions applied
to all the answer options for a question. As shown on an actual example from the Sleep
Quality survey in Figure 6.8-left, misclassifying a question “Next, could you say that you fall
into a deep sleep?” as Negatively framed results in answer “Rarely: None or 1-3 times a
month” being matched with an incorrect reaction: “Sounds good”. This is the case for all
142
the reactions even if all the individual answer options are classified correctly. Manual effort
of rewriting all these reactions amounts to 66 character edits (rewriting to the reactions in
Figure 6.8-right is assumed). A simple editing tool support, which would allow a survey
administrator to correct the empathy framing classification via GUI drop-box could reduce
such effort to just 2 clicks and also further provide the training data for improving future
classification accuracy (see Figure 6.8-right).
Figure 6.8: Correcting reactions - Left: manual correction of question misclassification results in
the need of rewriting all the reactions (a cost of 66 character edits). Right: with GUI support from
an editing tool, the correction could involve just re-labeling the question (a cost of 2 mouse clicks).
Similar magnitude in reduction of correction effort applies to scenarios when a particular
misclassified answer option is frequently reused throughout the survey. This happened in
the Sleep Quality survey when a misclassified answer option “Sometimes: 1-2 times a week”
was used in all the 18 questions leading to a need for manual correction of all associated
reactions (213 character edits representing close to 32% of all the manual correction effort for
this survey). With a tool support such effort could be reduced to just 2 clicks as correction
in one answer could be propagated to all the answers used in the survey.
6.5.3 Automation Performance and Capability Improvements
To enable ML automation with limited domain-specific training data I intentionally reduced
the adaptation problem to a series of simple text classification tasks. Employed classical
ML algorithms, with some custom feature tuning, are in line with performance that could
143
be achieved using some off-the-shelf end-user tools such as LUIS4 or MonkeyLearn5. The
automation can further be improved by 1) increasing the accuracy within the current problem
framing (i.e., keeping tasks as text classification problems) or by 2) relaxing the problem
beyond this definition (e.g., unconstrained text generation) to enable modeling more complex
relations. I discuss these two ideas further.
Accuracy Improvements: Higher accuracy can be achieved by: 1) collecting more la-
belled data and 2) use of more capable algorithms. The current dataset (including devel-
opment and hold-out) contains 22 surveys with 357 question examples labelled for phrasing
and empathy framing, as well as 812 answer examples labelled for empathy framing. More
survey examples could be automatically scraped from various sources such as SurveyMon-
key templates6 and SurveyPro templates7. These repositories, while convenient for scraping,
contain only informal surveys. Validated tools included in research papers and white papers
are harder to obtain at scale. The text classification simplification I used would make the
labeling task easy to automate via crowd-sourcing approach. Beyond and in addition to
obtaining more labelled data, more capable text classification algorithms can be employed.
Unfortunately a more capable algorithm in presence of limited data is unlikely to yield better
results [99] and can be prone to overfitting. A promising solution to this limitation is a do-
main adaptation approach, where a very capable language model comes already pre-trained
on vast amounts of general language data [34]. An initial exploration of such approach using
pre-trained bert model8 fine-tuned on my limited dataset shows promising results increasing
reaction matching accuracy from 0.57± .15 to 0.68± .11 and question language adaptation
from 0.68± .33 to 0.81± .24 in a leave-one-out setup (see Table 6.10). Going forward, models
4LUIS - a machine learning-based service to build natural language into application - https://www.luis.ai/
5MonkeyLearn - machine learning service for designers and developers - https://monkeylearn.com/
6https://www.surveymonkey.com/mp/university-student-satisfaction-survey-template/
7https://www.questionpro.com/survey-templates/
8bert-base-uncased from Huggingface library: https://huggingface.co/transformers
144
pre-trained on sentiment analysis can be used as a basis for empathy framing classification
[20] and survey domain detection can leverage topic models [245]. An empathetic reaction
generation model from a mental health domain could also potentially be adapted [170].
Table 6.10: Comparison of accuracy for a classical ML model used and a pre-trained deep learning
model fine-tuned on the task dataset
Capability Improvements: Improved automation capability (e.g., being able to include
broader context or even generate reactions from scratch word by word) can be achieved by
re-framing and relaxing the problem definition in various ways. For example, instead of
separately classifying the question and answer for empathy framing and then selecting an
appropriate empathetic reaction using a fixed rule, the classification could used combined
features to directly select the best empathetic reaction class. Initial exploration of such joined
classification showed an improvement in reaction classification accuracy from 0.57 ± .15
to 0.70 ± .29 in a leave-one-out evaluation. Additionally reaction selection context can
include information about the survey domain in some form. Further expanded context, would
unfortunately also require more training data and may not accommodate GUI supported
corrections described in the previous section equally well. In the most unconstrained fashion
the conversational question rephrasing could be seen as a translation task (similar to language
translation [68]). Existing survey items would be treated as text in one language that needs
145
to be translated to a “conversational” language. Similarly empathetic reactions could be
treated as any chat utterances generation problem [213]. Both approaches would not impose
any design constraints, but would require large amounts of domain specific data and would
likely provide less control over the output [233].
6.5.4 Intrinsic Challenges of Conversational Survey Adaptation
Several challenges seem to be particularly difficult due to the conflict with surveying practices,
or because the type of data needed to address them is unlikely to ever be available.
Rephrasing 2nd and 1st Person Survey Items: Current design tries to make sure that
all conversational utterances are questions in 3rd person form. This choice is, however, not
the only possibility. In some cases survey questions are intentionally written in 1st or 2nd
person to facilitate more honest answers [108]. In other cases the questions might have been
tested in the exact form presented and changes could invalidate the survey. From a design
perspective, it might make sense to provide alternative ways of rephrasing such questions,
e.g., question “I feel in control of my current financial situation.” could be rephrased as
“Is it fair to say that you feel in control of your current financial situation?”, but also as
“How would you respond to the following statement: ‘I feel in control of my current financial
situation.’?” This challenge requires further research.
Addressing Survey Intrinsic Question Repetition: Repetitiveness of conversational
questions has been reported by some users. The design approach tried to minimize that by
attaching dynamically varying conversational prefixes, but some repetitiveness is, however,
likely intrinsic and intentional especially in validated surveys [108]. It is possible that users
tolerate the repetition more in form-based surveys than in conversation where it is not
‘natural’. It is also possible that the increased engagement and attention with conversational
administration [124] is making users notice repetitions more. It seems that two approaches
could be possible: 1) either a new standard for obtaining validity could be defined (as
146
suggested in [124]), or 2) chatbot could manage user expectations by explicitly announcing
in the introduction, e.g., “I will ask you some questions that might seem repetitive, but this
is intentional to make sure we have a common understanding of the topic” or preface such
a similar question with “I know I asked you a similar question before, but...”.
Conversational Form of Tabular Questions: Several surveys organize questions in a
tabular form to help streamline answering especially when all the questions are answered
on a common scale (e.g., likert). A good example is Big 5 personality survey which in the
PDF format is composed of a prefix phrase: “I see myself as someone who...” and then each
table row refers to a specific concept, e.g., “...is reserved”, “...is generally trusting” answered
on a common 5-point likert scale from “Disagree strongly” to “Agree strongly”. The current
adaptation approach concatenates the prefix and each concept to form a separate survey item,
e.g., “I see myself as someone who is reserved”, “I see myself as someone who is generally
trusting”. Although this is an approach taken in prior work using manual adaptation [124],
it may introduce additional repetition, even after the diversified conversational prefix is
appended, rendering: “Moving on, is it fair to say that you see yourself as someone who is
reserved?”, “Do you think it’s fair to say that you see yourself as someone who tends to be
lazy?” A different approach might be to support such questions with a rich GUI element
rendered directly in the chat window. Rich GUI elements are suggested in prior work [127].
Matching Socialization and Empathy to User Characteristics: Some users did not
seem to appreciate empathetic reactions in conversational surveys in general, indicating that
“There’s no need for too much empathy, acknowledging my replies is enough” (P30). This
echoes general findings from prior work reporting that some users do not react well or expect
socialization in chat [147]. Similarly some user groups may appreciate a different style of
conversational adaptation (e.g., casual style for teenagers as in [124] or formal style for older
audiences [129]). This is in principle supported through meta-parameters and a repository
of phrases (I will discuss this in subsequent section), but an ability to automatically tailor
147
the style to an individual would require user specific data that might not be available.
6.5.5 Additional Augmentation Tasks; Prototyping & Tailoring Support
An automation approach I proposed in this work provides an opportunity to not only lower
the conversational design effort, but also easily supports additional extensions and can facil-
itate research and practice of designing conversational interactions.
Expanding the Set of Augmentation Tasks: The initial exploration performed in this
work limited the conversational adaptation aspects to a set of key tasks, which can arguably
be seen as representing a minimal set of needed adaptations. In other chapters I have shown
that different conversational design aspects can offer a tailored experience to serve the needs
of different populations. The automation framework I proposed can easily be extended by
such additional components. The text-to-speech generation can be added to provide voice
capabilities as in Chapter 5 using commercial tools 9. Similarly language paraphrasing for
understandability employed in Chapter 5, could be integrated as an additional adaptation
task using off-the-shelf models [248, 160]. Additional augmentation tasks such as adding a
domain matching chat avatar icon could be supported as well by leveraging existing work on
matching text and images [231]. These could be added as modules that can be turned on or
off as needed to easily render conversational interactions with different properties.
Leveraging Repository for Altering Chatbot Language Personality: In a similar
fashion more substantial changes to the language can be supported by leveraging the provided
repository-based approach. The ML models used to retrieve phrases from the repository rely
on survey content and general augmentation categories, but not on the concrete text of the
repository utterances. What this means is that a different set of augmentation phrases can
be provided without the need to retrain any of the models. This can be used to render a
different chatbot language personality. Instead of current professional and polite style (i.e.,
9https://aws.amazon.com/polly/
148
“Hi, my name is {name}. I would like to talk to you about {topic}.”) a different repository
could be linked with e.g., informal teenage introduction style, such as “Hey, awesome to
meet you, I am {name}, let’s chat about {topic}”, and similar expressions of compassion
(e.g., “Wow, that really sucks :(”), and satisfaction (e.g., “That’s really awesome!”). The
augmentation categories provided by the current repository inform the types of phrases that
need to be provided to render a consistently different language personality.
Support for Prototyping & Tailoring: Taken together, the ability to turn on and off
different augmentation aspects as well as the ability to easily replace the chatbot language
without the need for retraining any of the ML models offers several opportunities. It enables
quick prototyping of different variations of conversational interactions for the purpose of
design exploration. It can be used for quick A/B testing of different designs with different
user populations to help answer some of the persistent questions in conversational design
related to the level of socialization [146] and chatbot personality [41]. Fully automated
approach helps ensure consistency of comparisons in such use cases. Finally, a detailed
parametric control over augmentation aspect provides a step towards automated tailoring of
conversational interactions to different populations in case user profiles are available.
6.6 Summary of Contribution
In this chapter I explored the feasibility of supporting the design of engaging conversational
survey administration with automation in order to automate the design process, I had to
perform manually in Chapter 5. To do that, I built up on the common linguistic and dia-
logue aspects of an engaging conversational interaction I identified in my prior work such as
content diversification, conversational language style, contextual reactions, social dialogue &
empathy, as well as conversational etiquette. I defined the conversational adaptation of a sur-
vey as composed of 4 tasks: 1) addition of introduction & closing, 2) addition of contextual
empathetic reactions, 3) addition of progress communication & topic handling, 4) adapta-
tion of question language to conversational style. This adaptation is also guided by 4 design
149
principles: 1) avoiding repetitiveness, 2) minimizing changes to the original survey items, 3)
contextual use of empathy, 4) audience sensitive augmentation. The tasks rely on retrieving
phrases from a reusable repository of 118 conversational augmentation phrases informed by
prior work [129, 78, 239, 125] and linguistic resources [86]. I further employed data-driven
machine learning (ML) techniques to enable the automation to learn from example survey
data. In an evaluation with 30 respondents from a crowd-sourcing platform I show that the
proposed approach can produce engaging conversational surveys (comparable to my manual
design used in Chapter 5) with only 30% additional manual correction effort (as opposed to
100% effort that had to be put to make the adaptation from scratch). I also discuss the re-
maining issues and further improvements related to 1) design definition improvements, 2) user
manual correction effort reduction opportunities, 3) automation performance & capability
improvements, and 4) intrinsic challenges related to trade-offs between survey requirements
and an ideal conversational experience. My work contributes to the understanding of what
it means that the survey is conversational. It systematizes and automates the steps needed
to make any survey conversational, advancing prior work which relied on one-time manual
redesigns. I also highlight some of the intrinsic trade-offs between survey administration
requirements (i.e., dictated by validity) and an engaging conversational experience. Out-
comes of my work can directly support survey administrators, particularly without design
background, in creating more engaging data collection experiences for their respondents with
less effort. My work can also help engineers of data-driven conversational systems train their
approaches with less data by leveraging survey domain knowledge insights I provided.
150
Chapter 7
DISCUSSION
Technology has been used to support health & behavior change applications, but despite
offering valuable support for automating various aspects [165, 93], it generally struggled
with supporting user engagement [45, 82]. Conversational agents, whose recently resurgent
popularity has been fueled by advancements in technology, enable rethinking the support
technology can offer in this context to engage and motivate users. Unique human-likeness
aspects of conversational agents, can make them more engaging [27, 239], and coupled with
the ease of incorporating additional application specific features (e.g., use of voice, adaptive
behavior) make them valuable for supporting various user groups (e.g., low literacy) in
different contexts (e.g., workplace) and for otherwise challenging purposes (e.g. reflection on
behavior).
In this dissertation, I demonstrated how conversational interfaces can be designed to im-
prove user engagement in the key health & behavior change challenges of activity promotion,
learning form past behaviors (reflection), and data collection. My work further expands our
understanding of user perceptions of such systems, as well as their strengths & weaknesses.
I propose concrete design & implementation artifacts, as well as, several well documented
reusable design processes for reproducing the proposed designs in different contexts. Finally,
I also address the substantial effort required to design an engaging conversational experi-
ence, by exploring the use of automation to support non-designers in applying the findings
of my work in their applications with ease. Through these findings I have demonstrated
the claims made in my thesis statement. Here I summarize the key benefits and challenges
conversational design can offer especially in the health & behavior change domain which I
have identified in my work.
151
7.1 Benefits of Conversational Design in Health & Behavior Change
Across the conversational applications I have designed, implemented & evaluated in this
work I identified several common benefits a well designed conversational interaction can offer.
Through the exploration of these benefits I have confirmed my thesis statement showing that
conversational interactions can be designed to support user engagement in health & behavior
change applications.
7.1.1 Engagement in Interaction
Across the studies users consistently reported the benefits of increased engagement when
interacting with conversational systems I designed. In Chapter 3 users reported engagement
with conversational prompts’ contents specifically crediting the personalization and content
diversification aspects. Users reported increased attention (an engagement dimension [180]),
informational value, and personal relevance. All of these factors led them to be more en-
gaged in interaction with conversationally redesigned prompts as compared to the baseline
non-conversational ones. Similar engagement was reported in Chapter 4 in the context of
workspace reflection with Robota. Users reported being more engaged due to the perceived
benefits such as increased awareness of work tasks, improvements in work organization &
productivity, as well as new perspectives and understanding of their high-level career goals.
In this context the engagement mechanisms based on perceived benefits of interaction echoes
a dimension from relationship investment model [201]. Several participants also felt more
engaged with the voice finding it fun and enjoying the ‘surprise’ aspects of changing voice
interaction. Some of this might be associated with novelty. Field study with Reflection
Companion in Chapter 4 provided a strong behavioral evidence of user engagement, when
11 of the 33 participants elected to use the system for additional 2 weeks without any com-
pensation (this is on top of 2-week paid study period). They also used the system very
actively putting the effort to type in free-text replies to 83% of the daily dialogues they
received during this time. In Chapter 5 direct measurements with engagement survey from
152
O’Brien’s engagement scale [180] revealed conversational engagement benefits for a vulner-
able user group (low health literacy) in a particularly sensitive data collection setting. I
complimented these findings with evidence of high engagement on the same scale across
broader survey-based conversational data collection with general population in Chapter 6.
Engagement benefits or potential for such benefits has been reported by various studies,
especially from Bickmore er al., using embodied conversational agents in medical health do-
main [30]. My work further expands on these findings in the context of health & well-being
with non-embodied more cost-effective chat like interactions. Furthermore several studies
reported the issues of interaction repetitiveness leading users to lose motivation to continue
using the agent and follow recommendations [28]. Similar effect have been reported in com-
puter counseling where the approach itself can be effective if users stay with the system,
but high drop-out rates limit the positive impact [190]. My work contributes an impor-
tant content diversification techniques that help keep users engaged. I also offer detailed
understanding of the mechanisms leading to increased engagement in the novel context of
reflection.
7.1.2 Motivation to Perform Activity
Across several studies users reported increased motivation to perform activities directly or
indirectly promoted by the conversational agent. In Chapter 3 the diversified conversational
prompts were promoting physical exercise challenges 4 times a day. I have shown through
quantitative analysis that this significantly increased user activity, making users 3.7 times
more likely to exercise in the 2 week study period as compared to baseline. In Chapter 4
although Reflection Companion did not directly promote activity, several user had their own
goals related to being more active. These users reported that through conversational reflec-
tion they gained increased motivation thanks to the sense of accountability to the agent, and
improved understanding of their behavior barriers. They also discovered small and concrete
attainable steps, and were able to construct more well thought-out action plans. Several
users directly reported new behaviors, usually small changes in routine or returning to posi-
153
tive past behaviors that have been abandoned. Users also reported increased additional side
activities facilitating physical activity, such as wearing their fitness tracker more consistently
or scheduling classes at a gym. In the workspace setting in Chapter 4 reflection with Rob-
ota also indirectly increased motivation and productivity. Users reported that awareness of
limited progress they felt they have made, led them to try to be more productive. They
reported being worried they will have nothing to write in their communication with the
agent at the end of the day (this echoes accountability from physical activity setting). Users
also reported that the interaction helped with the side tasks facilitating work. It specifically
helped with composing reports due to the ability to quickly recall things and the fact that
the interaction logs served as a source of concrete information. Several users also reported
that work reflection helped them with organizing daily activity due to facilitating planning
and making sure important things are not forgotten.
Prior work reported persuasive capabilities of conversational agents [29, 37, 207]. These
works usually relied on direct persuasion techniques, which could be effective, but can also
lead to undesired side effect such as reactance [238] (where user pressed with excessive
persuasion acts the opposite) and with time have been shown to suffer from high drop-out
rates [28]. My work shows an important indirect route to motivation and activity promotion
through conversational reflection. Further an ability to indirectly promote an activity I
introduced in Chapter 3 is valuable for sensitive settings where mentioning the topic of
persuasion directly can be especially detrimental ( e.g., antismoking campaigns talking about
smoking may actually induce more smoking from smokers [95]). Finally, my work shows
that some similar benefits to human counseling sessions, can be achieved with well designed
conversational agents which can lower the cost and fill the gap in support for some users.
My work also rises some ethical questions about the impact agents could have on user stress
at work, which should be investigated further.
154
7.1.3 Accessibility: Familiarity & Understandability
Throughout my work I have demonstrated several accessibility benefits of conversational
interaction. In Chapter 3 and Chapter 4 with Reflection Companion I used mobile text
communication (SMS/MMS), which had the benefit of familiarity and ease of use. Users
specifically reported the lower barrier to start using the system by not having to install addi-
tional applications on their mobile devices and not having to learn any new interface. These
benefits have been reported in prior work [85]. In Chapter 4 on an example of workspace
reflection with Robota I used company internal Slack chat-like platform which was also easy
to use and familiar. In this work I also used voice-assistant-like interaction with an Alexa-
enabled personal mobile device to support reflection. For the voice part users specifically
commented on the ease and speed of answering questions using voice. They also praised
an ability to quickly capture some points or thoughts with voice. Voice recording was also
considered easier for non-native speakers than writing. Finally in Chapter 5 I augmented
the standard chat interface with voice readout (which was also available on demand) and
an ability to rephrase the question asked by the agent. These features were specifically
praised by the low health literacy users who directly reported that the system facilitated
their understanding. These features also occasionally helped high literacy users who had
vision problems or felt fatigued and found audio feature lowering their interaction effort.
In general the accessibility benefits of conversational interaction I explored can be cate-
gorized as familiarity & ease of use and understandability related. The first type of benefits
have been indicated in prior work [13] praising frictionless and natural interaction that can
replace mobile apps [127]. Naturally not all interactions are easier via mobile text and few
reviews pointed to the difficulty of properly designing mobile text prompts to exploit the
benefits and avoid the challenges [80, 102]. My work offers a confirmation of the famil-
iarity benefits of chat-based interaction in new settings, particularly reflection. Prior work
reported the potential for audio to improve the understandability among low literacy users
in medical domain [96]. My work expands on these findings and improves our understanding
155
of the benefits of the use of audio for sensitive data collection in hospital setting. Further
I contribute the conversational question rephrasing features as another understandability
enhancing aspect.
7.1.4 Comfort & Sharing
In several applications users were willing to share and disclose personal information to the
agent, even if not directly asked to do so. In interactions with Reflection Companion in
Chapter 4 users shared various personal aspects with the agent in their daily mini-dialogues.
They often described their personal and work plans, activities, and detailed schedules. They
shared many personal aspect such as their relationships with friends and family. Often
also freely sharing their emotional states ‘telling’ Reflection Companion they feel stressed,
lazy, annoyed or even jealous of the physical fitness of their friends. Several users also felt
comfortable to use the interactions for venting. It is worth nothing that the system did
not directly asked about any of these personal aspects. Surprisingly some similar sharing
also took place in the semi-public workspace setting with Robota in Chapter 4. This was
especially the case with the dedicated voice channel, where several users felt the interaction
was more personal and they reported feeling like Robota ‘cares about them’. This feeling
also made some users consider Robota more as a counselor or therapist to whom they can
vent when they are unhappy and want to complain. While in reflection setting the agents
never asked about specific personal or sensitive details, my work in Chapter 5 on social
needs screening involved highly sensitive direct questions about ‘sexual abuse’, violence’,
and ‘extreme poverty’ asked to a vulnerable population. The agent also featured empathetic
design. In this setting users’ willingness to share was somewhat inconclusive with some users
reporting that human-likeness features made them more comfortable to share, while others
feeling just the opposite. Still even in this highly sensitive setting many participants described
the agent as ‘caring’, ‘helpful’ and ‘concerned’. Several also reported that empathetic aspects
were ‘calming’ and gave then ‘confidence’. These findings were echoed in Chapter 6 where I
applied conversational empathy design to broader set of surveys, which revealed that impact
156
of empathy design on user comfort is highly contextual and possibly user specific.
These indications support the strong potential of the conversational approach to establish
trust & encourage sharing. Prior work indicated such potential in controlled lab experiments
with embodied agents [59], in job interview lab setting [145], as well as in stress relief in
crowd-sourced setting [204]. My work build up on these findings by exploring the sharing
and self disclosure in the novel reflection setting as well as in the semi-public space in a
naturalistic settings of longer term field studies. I also provide novel insights into the complex
mechanism in which human-likeness can affect willingness to share for different users. My
findings suggest the need for understanding individual user characterises and the potential
for tailoring empathy features. It also highlights the important ethical issues of user self
disclosure which should be considered.
7.1.5 Guidance
Conversational interaction offers a natural support for sequential interaction, which could be
particularly beneficial in health & behavior change. In Chapter 4 I leveraged this aspect by
supporting reflection on physical activity. I designed the dialogue progression to mimic the
progression of structured reflection process based on a theoretical model [14]. I found that
the dialogue guidance encouraged deeper thinking, more meaningful answers in reflection
and also extended the time users spent reflecting. Users also reported that having a bigger
overwhelming reflection task split into small more manageable pieces guided by the dialogue
lowered the effort of reflection for them. In that sense dialog can be used to decompose
more complex task into smaller more manageable activities following the recommendations
of goal-setting theory [152]. Another benefit of dialogue guidance is helping users avoid
being stuck in negative perceptions [10]. This is something I have found in the pre-study
workshop, when one of the participants reported being discouraged from looking at own
activity graph due to fear of low performance. In the design of Robota I also used the
dialogue progression to support reflection on workspace productivity. In this setting the
dialogue was designed to connect separate activities (i.e., journaling & reflection). The
157
journaling was a task beneficial for work and report writing, while reflection was meant to
engage users in personally meaningful task that could benefit their professional development.
Users appreciated these benefits and also liked the connection between the dialogue stages
by means of mentioning the work task they scheduled. Given that activity reporting is
something user might need to do regularly, the dialogue connecting such aspects can help
form a habit [227].
Several theoretical conceptualizations of user behavior in health & behavior change pro-
pose processes (e.g., personal informatics [142], structured reflection [14]) or cyclical processes
(e.g., stages of change [191], lived informatics model [73]) to capture user journey. The se-
quential aspect of the conversational interaction seems particularly well suited to support
such progression on a macro and micro scales. Furthermore, taking an inspiration from
human-health coaches a conversational interaction can proactively guide users in specific
beneficial directions [198]. My work specifically contributes to our knowledge of how to
design a dialogue-based support for such processes and also make going through the cycles
novel, personalized and engaging.
7.2 Challenges of Explored Conversational Design for Health & Behavior
Change
My work uncovered several challenges that a designers of conversational systems would likely
have to address in their designs. It is important to note that the challenges I identified are to
some extent related to the application and the implementation of conversational interaction
I provided in my work. I therefore relate these challenges to the broader literature on
conversational interfaces.
7.2.1 Efficiency
Several conversational interactions I designed in this work revealed somewhat lower efficiently
and higher effort of interaction conversational interfaces can introduce in some situations. In
Chapter 4 users interacted with Reflection Companion by typing-in responses to the agent on
158
their mobile phone. While they perceived it as valuable for their engagement, they also felt
this required additional typing effort. It is worth noting that this design choice is arguably
intertwined with the reflection support purpose where free expression could be particularly
valuable. In interactions with Robota, which offered a voice-based and text-based communi-
cation, users considered it easier to read than listen to voice, especially when the questions
were long or complex. At the same time typing was considered more time consuming and
effortfull than providing responses with voice. This shows how voice and text modalities
can influence interaction efficiency. Finally the application of HarborBot with high and low
literacy populations in Chapter 5 provided the most insights about the efficiency challenges.
This application focused on data collection and hence required the highest amount of input
from the users. While HarborBot supported structured graphical input to lower effort; the
use of voice readout, reaction delays and additional socialization utterances lowered interac-
tion speed which was negatively perceived by the high literacy users. It is worth noting that
low literacy users did not mind the lower speed as understandability benefits outweigh these
shortcoming for them.
The efficiency aspect being an issue is not uncommon for conversational agents. Prior
work indicated that waiting for audio readout can be less efficient [223]. Several works
specifically in conversational survey data collection reported longer completion times and
lower perceived efficiency compared to form-based methods [124, 239]. Perceived efficiency
in the broader context of conversational interaction is an important underlying theme of
many task-oriented uses of such systems [156, 112, 97]. My work, however, shows that
efficiency is not a universal problem for all the populations. Furthermore due to the detailed
understanding of the specific causes of perceived inefficiency uncovered by my work it could
be possible to optimize waiting times and tailor the interaction (specifically use of audio) to
satisfy all user groups.
159
7.2.2 Artificial Feel
Across all the studies users reported certain aspects of interaction that felt ‘artificial’. In
Chapter 3 the conversational triggers relied on topic and lexical diversification mimicking
conversational diversity reported in [78], which felt more natural in general. This diversifica-
tion, however, invited higher scrutiny of content. When the content started repeating after
some time, almost all the users noticed that and considered it artificial. This paired with
the fact that people remember the negative more than the positive [21] led to quantitatively
lower rating of helpfulness than a fixed repetitive prompt from baseline. This shows that
the illusion of natural conversational interaction can easily be broken. In Chapter 4 the
Reflection Companion mini-dialogues created an expectation of an ‘intelligent’ and ‘mean-
ingful’ follow-up to user’s free-text response in the first part of the dialogue. If this did not
materialize to user’s satisfaction, the follow-up was reported as ‘generic’ and ’computerized’.
Similarly, several aspects of the HarborBot system for social needs screening in Chapter 5
felt artificial to the users. The biggest contributor to the artificial feel in that work was the
text-to-speech voice used, which was described as ‘truncated’ & ‘monotone’. Such limita-
tions of voice combined with the sensitive questions made some users report the HarborBot
as ‘pushy’ and interaction as coming form a teacher. Secondly due to the underlying survey
contents, the users felt some information was asked repeatedly even after they declined to
answer. Finally in this system issues with contextual empathy matching led users to see
the agent reactions as ‘defaults’ and the agent to feel ‘fake’ and reminiscent of customer
support. Several of these aspect have been echoed in Chapter 6 where users also complained
about questionable use of empathy in some contexts, the artificiality of being asked the same
or similar question repeatedly, and the repetitive nature of the progression communication
utterances.
There are a few things here to consider. First, despite these negative perceptions, the
conversational systems were still largely successful in accomplishing their goals of engaging
users. Secondly, several of these challenges are related to the provided implementations and
160
most of them can feasibly be resolved with current technology for particular applications.
Some are arguably more technically challenging, such as the quality of text-to-speech [193].
Thirdly, the resolution of the issues seem to be a trade-off between quality and cost & design
effort (e.g., text to speech can be replaced with human voice recordings, richer content can
be crowd-sourced to avoid repetition).
7.2.3 High Expectations, Contextual & Social Intelligence
While artificial feel I described earlier relates to relatively small aspects that felt unnatural
or disappointing about the interaction, users also reported more fundamental issues related
to agent’s contextual knowledge and the fundamental acceptance of a computerized system
to act socially or emotionally. In Chapter 3 the users expected the conversational prompts
promoting physical activity to be somewhat ‘intelligent’ and ‘meaningful’. They expected
the agent to be aware of their status and the prompts’ contents to fit the context of their ac-
tivity, location and schedule. Furthermore they expected the contents of the conversational
prompts to always supply new and unexpected information they could learn from. In the
workspace setting, Robota asked several personalized questions which mentioned user previ-
ously journaled work tasks. While this specificity was appreciated for its personal focus, the
participants complained that the agent picked tasks that were not meaningful for them (e.g.,
routine tasks or tasks that were not challenging). Users expected the system to be aware of
the specifics of their work and also capable of deciding which tasks are the most meaningful
for them to reflect on (it is worth noting that we used wizard-of-oz approach for this task,
which shows that this can be fundamentally challenging for a human). The HarborBot agent
for social needs screening I described in Chapter 5 employed the use of empathy to ease
users into answering sensitive questions. Aside from challenges with matching empathy to
context, several users reported that even if they felt the social and empathetic utterances
were well designed, they would simply not subscribe to the ‘illusion’ that a computer system
can or should exhibit such qualities. Some of the similar findings were echoed in Chapter 6.
There are several aspects here to consider. In some cases high user expectations of the
161
conversational system’s capabilities could arguably exceed what a human could do, hence
user expectations of conversational agents may go beyond human provided assistance. High
expectations have been reported in conversational agent context [156]. While rich contextual
sensing is possible in principle, although challenging in practice due to possible misinterpre-
tations [203], such sensing could raise several ethical and data privacy issues users might not
take under full consideration. Finally, regarding the users’ fundamental acceptance of social
aspect of the agents, some past works pointed to a possible fundamental individuals’ pref-
erence for socialization in agent context [147, 146]. This aspect has yet to be well explored.
My work contributes specific case studies further improving our understanding of the deeper
challenges involved in designing conversational agents and user varying expectations of their
performance.
7.2.4 Effort of Creating Engaging Content
I explored several approaches to generating enticing contents for conversational agent in-
teraction in this work. In Chapter 3 I used crowd-sourcing, past literature, computational
semantic relatedness, and theoretical models to create content that is diverse and novel to
engage users in repeated interactions. I used value profiling to make the conversations per-
sonalized. Similarly in Chapter 4 I addressed the problem of diverse domain-specific contents
with workshops, past literature, and informal resources. Supporting personalization in the
dialogue required content generation to incorporate user goals, work tasks and fitness tracker
data. In HarborBot design the core dialogue data relied on predefined survey, but the addi-
tional conversational content required design in consultation with domain experts and careful
empathy crafting.
These examples show that creation of engaging conversational contents requires substan-
tial effort which has been acknowledged in prior work [97]. Part of the challenge, especially
in long-term behavior change domain, is the need for creating diverse and novel dialogues to
keep users engaged as I have demonstrated in Chapter 3. Another challenge is making the
interaction personalized and specific to the domain of application. In Chapter 6 I separated
162
some of the common reusable parts of a conversational experience (e.g., acknowledgments,
transitions, introduction) to lower the design effort with automation, but this offers just a
first step in lowering such effort. Fully data-driven approaches are hard to apply as they
require substantial dialogue data in a particular domain [88], are hard to control [110], and
can still suffer from repetition and consistency issues [240, 247]. Recent developments in
data-driven approaches leverage models pre-trained on large scale generic dialogue data and
fine-tune them to specific domain. While these approaches are still being researched for dia-
logue systems [184], their successful application relies on existence of small to medium scale
domain-specific dialogue data. My work offers various design-driven processes to support
data generation for such approaches.
163
Chapter 8
LIMITATIONS
There are several limitation of my work that apply to all of the applications as well
as specific to the particular contexts. First of all, while I evaluated all of the systems
I developed in field deployments, which boosts their validity, the deployments have been
fairly short 2-3 weeks, which makes it hard to estimate their long-term impact. Somewhat
associated with the study lengths is the issue of novelty. Conversational interfaces are still
fairly new and could attract additional attention due to this aspect alone. The fact that
in my studies I complimented the quantitative findings with interview feedback linking the
impact to particular conversational design elements mitigates this worry to some extent.
Another common limitation relates to the sizes of the user groups which varied from as few
as 10 (Robota) to 33 users (Reflection Companion) in field studies. Small user groups could
have limited my ability to statistically detect some true effects and could have introduced
an outsized impact of outliers. Finally, while the challenges I addressed are general, I tested
the conversational approaches on examples of particular applications in specific setting and
with limited specific user groups. This rises the worry that the results may not generalize
to other user groups. This can be a worry especially in workspace reflection which took
place in a particular company. Similar limitation may apply to data collection in emergency
department (although it has been conducted at two sites in different cities).
164
Chapter 9
FUTURE WORK
The findings in this dissertation reveal several possibilities for building up on the work as
well as for new avenues for future research in the use of conversational interaction for health
& behavior change:
Unified conversational support for multiple stages of behavior change: In this
work I have explored applying conversational design to addressing several concrete appli-
cations in health behavior change loosely aligned with the personal informatics stages of
data collection, learning from data (reflection), and activity promotion & maintenance [142].
While these applications share several common challenges such as repetition, engagement, I
have applied separate conversational systems to support them. Naturally a complete behav-
ior change support could benefits from a unification of these conversational systems under
one coherent conversational agent driven support.
Exploring other challenges in behavior change: Similarly, while I explored the chal-
lenges and applications mentioned above, personal informatics models identified other areas
that could benefits from conversational support, such as lapsing and re-engagement [73].
Similarly the stages of change model [191] identifies the ‘precontemplation’ and ‘contem-
plation’ stages in the behavior change cycle. In these stages people are unaware that their
behavior is problematic or produces negative consequences. Conversational approach could
try to engage users in these early steps as well. A particular challenges here would be to
attract user attention, provide informational value to sustain user interest, and help guide
the user to ‘discover’ a behavioral problem.
165
Studying long-term impact: Behavior is a complex, difficult, long-term process. While
this dissertation shows that short-term conversational support can improve user engagement,
motivation and even lead to increase behaviors, there is still a need to understand how to
support this longitudinally. While this need not be a study of a couple of years to understand
the efficacy of technology [126], the longitudinal nature of behavior change might surface
different needs and support that users have as they work towards maintaining behavior
[191]. How to design for longitudinal interactions that continuously keep people engage is
an open question. Especially the aspects of content diversification and novelty would need
to be addressed. One possibility could be to support forming a habit of interaction with
conversational agent [225]. This requires future work.
Improving content diversification for long-term: One of the main challenges I identi-
fied in Chapter 3, and repeatedly encountered in other chapters is the challenge of repetitive-
ness in long-term interactions. While I introduced several successful strategies to increase
both the diversity of topics as well as lexical diversity of the language, the problem was never
entirely solved. Ultimately the number of unique topics that the user could be presented as
motivations or prompted to reflect on is finite. As I have found in Chapter 3 when users
recognize repetition it has detrimental impact on engagement. Future work could explore
two different approaches to address this: 1) Given a sufficiently large set of topics, people
might start forgetting past discussions. There might be an optimal threshold for total topic
count dependent on frequency of interaction. 2) Interaction could be designed to reuse and
build up on the same topics over time, possibly incorporating information user shared in
the past. Remembering information from user past responses (e.g. a shared barrier of “not
having a person to run with”), would allow the agent to bring back such information in
future interaction, e.g., “What could you do to try to find someone to run with?” or “Is not
having someone to run with still an issue for you?”.
166
Improving contextual understanding with sensing: In several of my deployments
users expected a certain level of contextual awareness of the conversational agent. In Chapter
3 users expected the motivational prompts to tailor to their activity, location and schedule. In
the workspace reflection setting in Chapter 4 intelligently sensing a worker’s context, recent
activity, and main accomplishments will help workers derive greater meaning and insights
and will likely lead to improved productivity and work satisfaction. A future promising
direction could try to incorporate such contextual sensing to enhance the conversation.
Improved automation for design support: In Chapter 6 I explored an automated ap-
proach to lowering the design effort of conversational survey-based data collection. I showed
the basic feasibility of automating addition of some reusable components of conversational
approach: social etiquette, acknowledgments, empathy, and conversational language style.
Yet as I have demonstrated in other chapters a lot of effort is required to generate domain-
specific data via workshops, literature search, and crow-sourcing. Future work could explore
how such effort could be lowered further with use of modern deep learning or other automa-
tion technologies.
167
Chapter 10
CONCLUSION
Health & Well-being is increasingly important in modern society with aging populations,
obesity, mental health issues, and multitudes of other challenges. Technology has successfully
supported many crucial aspects of behavior change such as automated activity logging, visual
analytics of behavior data, as well as facilitating health communication with peers and health
professionals. Yet it had in many cases struggled with keeping users engaged, especially over
longer time periods. Conversational interactions have demonstrated the potential for sup-
porting user engagement, motivation and well as providing various accessibility benefits for
vulnerable populations in need. In this work I take advantage of the technical advancement
in conversational technology and growing popularity of conversational systems to explore
how they could play a role in addressing various health & behavior change challenges.
In this thesis I designed and implemented four conversational systems: Fitness Challe-
neges, Reflection Companion, Robota, HarborBot and a process to lower the design effort of
engaging conversational data collection: Survey Converter. I evaluated these systems in mul-
tiple deployment studies in personal, workspace, as well as, hospital settings. I demonstrate
that the conversational design can be used to successfully engage people in various aspects
of health & behavior change, such as physical activity promotion, reflection on behavior,
and for increasing the understandability and comfort of sharing sensitive social needs data
among vulnerable populations. Using mixed methods in my studies I also take advantage of
qualitative findings to understand the mechanisms in which specific aspects of conversational
interactions affect users.
My work identifies and provides evidence for several benefits of the use of conversational
interactions in this space pointing to engagement in interaction, improved motivation for
168
performing activities, accessibility benefits related to familiarity, ease of use, comfort with
sharing, and an ability to guide the users in the behavior change process via dialogue. I
also identify several important challenges, such as perceptions of artificiality, managing high
expectations of contextual knowledge and social intelligence, as well as lower efficiency that
could negatively affect the experience for some user groups. I further investigate the concrete
links between conversational design elements and these benefits and challenges. My thesis
demonstrates various design processes that can lower the effort of designing conversational
experiences. As technology progresses conversational interactions can offer valuable support
complimenting the existing automated activity tracking and the efforts of health coaches.
My work offers an important contribution to our understanding of how conversational inter-
actions can play such a beneficial role.
169
BIBLIOGRAPHY
[1] The measurement of communication processes : Galileo theory and method. Contem-
porary Sociology, 11:328, 1982.
[2] Johan S Abildgaard, Per Ø Saksvik, and Karina Nielsen. How to measure the in-
tervention process? an assessment of qualitative and quantitative approaches to data
collection in the process evaluation of organizational interventions. Frontiers in Psy-
chology, 7:1380, 2016.
[3] W. Abrahamse, L. Steg, C. Vlek, and T. Rothengatter. A review of intervention studies
aimed at household energy conservation. Journal of Environmental Psychology, 25:273–
291, 2005.
[4] Elena Agapie, Lucas Colusso, Sean A Munson, and Gary Hsieh. Plansourcing: Gener-
ating behavior change plans with friends and crowds. In Proceedings of the 19th ACM
Conference on Computer-Supported Cooperative Work & Social Computing, pages 119–
133, 2016.
[5] Lionbridge AI. 15 Best Chatbot Datasets for Machine Learning. https://lionbridge.
ai/datasets/15-best-chatbot-datasets-for-machine-learning/, 2019. [Online; Retrieved
September 27, 2020].
[6] I. Ajzen. The theory of planned behavior. Organizational Behavior and Human Deci-
sion Processes, 50:179–211, 1991.
[7] Dolores Albarrac´ın, Kristina Wilson, Man-pui Sally Chan, Marta Durantini, and Flor
Sanchez. Action and inaction in multi-behaviour recommendations: a meta-analysis
of lifestyle interventions. Health psychology review, 12(1):1–24, 2018.
[8] Rusul Alrubail. Scaffolding student reflections + sample questions. https://www.
edutopia.org/discussion/scaffolding-student-reflections-sample-questions, 2015. [On-
line; Retrieved September 28, 2020].
[9] Rusul Alrubail. Scaffolding student reflections+ sample questions. Edutopia. Retrieved
January, 8, 2018.
170
[10] Jessica S Ancker, Holly O Witteman, Baria Hafeez, Thierry Provencher, Mary Van de
Graaf, and Esther Wei. “you get reminded you’re a sick person”: personal data tracking
and patients with multiple chronic conditions. Journal of medical Internet research,
17(8):e202, 2015.
[11] Otto Antikainen et al. Effective chatbot conversations: Experiments with bot identity
and tone of voice. 2020.
[12] Zahra Ashktorab, Mohit Jain, Q Vera Liao, and Justin D Weisz. Resilient chatbots:
Repair strategy preferences for conversational breakdowns. In Proceedings of the 2019
CHI Conference on Human Factors in Computing Systems, pages 1–12, 2019.
[13] Julie A Ask, Michael Facemire, Andrew Hogan, and HB Conversations. The state of
chatbots. Forrester. com report, 20, 2016.
[14] Sue Atkins and Kathy Murphy. Reflection: a review of the literature. Journal of
advanced nursing, 18(8):1188–1192, 1993.
[15] Jeremy N Bailenson and Nick Yee. Digital chameleons: Automatic assimilation of non-
verbal gestures in immersive virtual environments. Psychological science, 16(10):814–
819, 2005.
[16] John D Bain, Roy Ballantyne, Jan Packer, and Colleen Mills. Using journal writing
to enhance student teachers’ reflectivity during field experience placements. Teachers
and Teaching, 5(1):51–73, 1999.
[17] A. Bandura. Self-efficacy: toward a unifying theory of behavioral change. Psychological
review, 84 2:191–215, 1977.
[18] Aaron Bangor, Philip T Kortum, and James T Miller. An empirical evaluation of the
system usability scale. Intl. Journal of Human–Computer Interaction, 24(6):574–594,
2008.
[19] Yehuda Baruch. Career systems in transition. Personnel review, 2003.
[20] Enkhbold Bataa and Joshua Wu. An investigation of transfer learning-based sentiment
analysis in japanese. arXiv preprint arXiv:1905.09642, 2019.
[21] R. Baumeister, E. Bratslavsky, C. Finkenauer, and K. Vohs. Bad is stronger than
good. Review of General Psychology, 5:323 – 370, 2001.
171
[22] Eric Baumer, Vera D. Khovanskaya, M. Matthews, Lindsay Reynolds, Victo-
ria Schwanda Sosik, and G. Gay. Reviewing reflection: on the use of reflection in
interactive system design. Proceedings of the 2014 conference on Designing interactive
systems, 2014.
[23] Eric PS Baumer. Reflective informatics: conceptual dimensions for designing tech-
nologies of reflection. In Proceedings of the 33rd Annual ACM Conference on Human
Factors in Computing Systems, pages 585–594, 2015.
[24] Austin Beattie, Autumn P Edwards, and Chad Edwards. A bot and a smile: In-
terpersonal impressions of chatbots and humans using emoji in computer-mediated
communication. Communication Studies, 71(3):409–427, 2020.
[25] E. Bessarabova, Edward L. Fink, and M. Turner. Reactance, restoration, and cognitive
structure: Comparative statics. Human Communication Research, 39:339–364, 2013.
[26] T. Bickmore, A. Gruber, and Rosalind W. Picard. Establishing the computer-patient
working alliance in automated health behavior change interventions. Patient education
and counseling, 59 1:21–30, 2005.
[27] T. Bickmore, Daniel Mauer, F. Crespo, and T. Brown. Persuasion, task interruption
and health regimen adherence. In PERSUASIVE, 2007.
[28] T. Bickmore and Rosalind W. Picard. Establishing and maintaining long-term human-
computer relationships. ACM Trans. Comput. Hum. Interact., 12:293–327, 2005.
[29] Timothy Bickmore and Toni Giorgino. Health dialog systems for patients and con-
sumers. Journal of biomedical informatics, 39(5):556–571, 2006.
[30] Timothy Bickmore, Daniel Schulman, and Langxuan Yin. Maintaining engagement
in long-term interventions with relational agents. Applied Artificial Intelligence,
24(6):648–666, 2010.
[31] Timothy W Bickmore, Laura M Pfeifer, Donna Byron, Shaula Forsythe, Lori E
Henault, Brian W Jack, Rebecca Silliman, and Michael K Paasche-Orlow. Usabil-
ity of conversational agents by patients with inadequate health literacy: evidence from
two clinical trials. Journal of health communication, 15(S2):197–210, 2010.
[32] A. Bowling. Mode of questionnaire administration can have serious effects on data
quality. Journal of public health, 27 3:281–91, 2005.
172
[33] Samuel R Bowman, Luke Vilnis, Oriol Vinyals, Andrew M Dai, Rafal Jozefowicz,
and Samy Bengio. Generating sentences from a continuous space. arXiv preprint
arXiv:1511.06349, 2015.
[34] Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla
Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al.
Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020.
[35] John T Cacioppo and Richard E Petty. Effects of message repetition and position on
cognitive response, recall, and persuasion. Journal of personality and Social Psychology,
37(1):97, 1979.
[36] Ana Caraban, Evangelos Karapanos, Daniel Gonc¸alves, and Pedro Campos. 23 ways
to nudge: A review of technology-mediated nudging in human-computer interaction.
In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems,
pages 1–15, 2019.
[37] Valentina Carfora, Francesca Di Massimo, Rebecca Rastelli, Patrizia Catellani, and
Marco Piastra. Dialogue management in conversational agents through psychology of
persuasion and machine learning. Multimedia Tools and Applications, 79(47):35949–
35971, 2020.
[38] Justine Cassell. Embodied conversational interface agents. Communications of the
ACM, 43(4):70–78, 2000.
[39] Justine Cassell and Kristinn R Thorisson. The power of a nod and a glance: En-
velope vs. emotional feedback in animated conversational agents. Applied Artificial
Intelligence, 13(4-5):519–538, 1999.
[40] Pew Research Center. Demographics of mobile device ownership and adoption in
the united states. https://www.pewresearch.org/internet/fact-sheet/mobile/, 2019.
[Online; Retrieved September 29, 2020].
[41] Ana Paula Chaves and Marco Aurelio Gerosa. How should my chatbot interact? a
survey on human-chatbot interaction design. arXiv preprint arXiv:1904.02743, 2019.
[42] Jilin Chen, Gary Hsieh, Jalal U Mahmud, and Jeffrey Nichols. Understanding indi-
viduals’ personal values from social media word use. In Proceedings of the 17th ACM
conference on Computer supported cooperative work & social computing, pages 405–414,
2014.
173
[43] Yukina Chen. The Effects of Question Customization on the Quality of an Open-Ended
Question. Nebraska Department of Education, Data, Research, and Evaluation, 2017.
[44] H. Chiu, Nadia Batara, R. Stenstrom, L. Carley, C. Jones, L. Cuthbertson, and E. Graf-
stein. Feasibility of using emergency department patient experience surveys as a proxy
for equity of care. Patient Experience Journal, 1:78–86, 2014.
[45] Eun Kyoung Choe, Bongshin Lee, Matthew Kay, Wanda Pratt, and Julie A Kientz.
Sleeptight: low-burden, self-monitoring technology for capturing and reflecting on sleep
behaviors. In Proceedings of the 2015 ACM International Joint Conference on Perva-
sive and Ubiquitous Computing, pages 121–132, 2015.
[46] Chia-Fang Chung, Elena Agapie, Jessica Schroeder, Sonali Mishra, James Fogarty,
and Sean A Munson. When personal tracking becomes social: Examining the use of
instagram for healthy eating. In Proceedings of the 2017 CHI Conference on human
factors in computing systems, pages 1674–1687, 2017.
[47] Chia-Fang Chung, Kristin Dew, Allison Cole, Jasmine Zia, James Fogarty, Julie A
Kientz, and Sean A Munson. Boundary negotiating artifacts in personal informat-
ics: patient-provider collaboration with patient-generated data. In Proceedings of the
19th ACM Conference on Computer-Supported Cooperative Work & Social Computing,
pages 770–786, 2016.
[48] Chia-Fang Chung, N. Jensen, Irina A. Shklovski, and S. Munson. Finding the right
fit: Understanding health tracking in workplace wellness programs. Proceedings of the
2017 CHI Conference on Human Factors in Computing Systems, 2017.
[49] Leigh Clark, Nadia Pantidi, Orla Cooney, Philip Doyle, Diego Garaialde, Justin Ed-
wards, Brendan Spillane, Emer Gilmartin, Christine Murad, Cosmin Munteanu, et al.
What makes a good conversation? challenges in designing truly conversational agents.
In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems,
pages 1–12, 2019.
[50] Andrew Clarke and Robert Steele. A smartphone-based system for population-scale
anonymized public health data collection and intervention. In 2014 47th Hawaii In-
ternational Conference on System Sciences, pages 2908–2917. IEEE, 2014.
[51] Heather L Coley, Rajani S Sadasivam, Jessica H Williams, Julie E Volkman, Yu-
Mei Schoenberger, Connie L Kohler, Heather Sobko, Midge N Ray, Jeroan J Allison,
Daniel E Ford, et al. Crowdsourced peer-versus expert-written smoking-cessation mes-
sages. American journal of preventive medicine, 45(5):543–550, 2013.
174
[52] Mark Conner and Paul Norman. Health behaviour: Current issues and challenges,
2017.
[53] Sunny Consolvo, Predrag Klasnja, David W McDonald, Daniel Avrahami, Jon
Froehlich, Louis LeGrand, Ryan Libby, Keith Mosher, and James A Landay. Flowers
or a robot army? encouraging awareness & activity with personal, mobile displays. In
Proceedings of the 10th international conference on Ubiquitous computing, pages 54–63,
2008.
[54] Sunny Consolvo, Predrag V. Klasnja, D. W. McDonald, and James A. Landay. Design-
ing for healthy lifestyles: Design considerations for mobile technologies to encourage
consumer health and wellness. Found. Trends Hum. Comput. Interact., 6:167–315,
2014.
[55] Sunny Consolvo, D. W. McDonald, Tammy Toscos, Mike Y. Chen, Jon Froehlich,
B. Harrison, Predrag V. Klasnja, A. LaMarca, L. LeGrand, R. Libby, I. Smith, and
James A. Landay. Activity sensing in the wild: a field trial of ubifit garden. In CHI,
2008.
[56] Justin Cranshaw, Emad Elwany, Todd Newman, Rafal Kocielnik, Bowen Yu, Sandeep
Soni, Jaime Teevan, and Andre´s Monroy-Herna´ndez. Calendar. help: Designing a
workflow-based scheduling agent with humans in the loop. In Proceedings of the 2017
CHI Conference on Human Factors in Computing Systems, pages 2382–2393, 2017.
[57] R. Davis, R. Campbell, Z. Hildon, L. Hobbs, and S. Michie. Theories of behaviour and
behaviour change across the social and behavioural sciences: a scoping review. Health
Psychology Review, 9:323 – 344, 2015.
[58] Terry C Davis, Sandra W Long, Robert H Jackson, EJ Mayeaux, Ronald B George,
Peggy W Murphy, and Michael A Crouch. Rapid estimate of adult literacy in medicine:
a shortened screening instrument. Family medicine, 25(6):391–395, 1993.
[59] Ewart J De Visser, Samuel S Monfort, Ryan McKendrick, Melissa AB Smith, Patrick E
McKnight, Frank Krueger, and Raja Parasuraman. Almost human: Anthropomor-
phism increases trust resilience in cognitive agents. Journal of Experimental Psychol-
ogy: Applied, 22(3):331, 2016.
[60] Xuefei Nancy Deng and Kshiti D Joshi. Why individuals participate in micro-task
crowdsourcing work environment: Revealing crowdworkers’ perceptions. Journal of
the Association for Information Systems, 17(10):3, 2016.
175
[61] Laura Dennison, Leanne Morrison, Gemma Conway, and Lucy Yardley. Opportuni-
ties and challenges for smartphone applications in supporting health behavior change:
qualitative study. Journal of medical Internet research, 15(4):e86, 2013.
[62] Beant Dhillon, Rafal Kocielnik, Ioannis Politis, Marc Swerts, and Dalila Szostak. Cul-
ture and facial expressions: A case study with a speech interface. In IFIP Conference
on Human-Computer Interaction, pages 392–404. Springer, 2011.
[63] Giada Di Stefano, Francesca Gino, Gary P Pisano, and Bradley Staats. Learning by
thinking: Overcoming the bias for action through reflection. Harvard Business School
Cambridge, MA, USA, 2015.
[64] Giada Di Stefano, Francesca Gino, Gary P Pisano, Bradley Staats, and Giada Di-
Stefano. Learning by thinking: How reflection aids performance. Harvard Business
School Boston, MA, 2014.
[65] A. Dijkstra. The persuasive effects of personalization through: name mentioning in a
smoking cessation message. User Modeling and User-Adapted Interaction, 24:393–411,
2014.
[66] J. Dillard and L. Shen. On the nature of reactance and its role in persuasive health
communication. Communication Monographs, 72:144 – 168, 2005.
[67] Leslie D. Dinauer and Edward L. Fink. Interattitude structure and attitude dynamics
a comparison of the hierarchical and galileo spatial-linkage models. Human Commu-
nication Research, 31:1–32, 2005.
[68] Daxiang Dong, Hua Wu, Wei He, Dianhai Yu, and Haifeng Wang. Multi-task learning
for multiple language translation. In Proceedings of the 53rd Annual Meeting of the
Association for Computational Linguistics and the 7th International Joint Conference
on Natural Language Processing (Volume 1: Long Papers), pages 1723–1732, 2015.
[69] Mateusz Dubiel, Alessandra Cervone, and Giuseppe Riccardi. Inquisitive mind: A
conversational news companion. In Proceedings of the 1st International Conference on
Conversational User Interfaces, pages 1–3, 2019.
[70] William Ebben and Laura Brudzynski. Motivations and barriers to exercise among
college students. Journal of Exercise Physiology Online, 11(5), 2008.
[71] Ofer Egozi, S. Markovitch, and Evgeniy Gabrilovich. Concept-based information re-
trieval using explicit semantic analysis. ACM Trans. Inf. Syst., 29:8:1–8:34, 2011.
176
[72] Daniel A Epstein, Felicia Cordeiro, James Fogarty, Gary Hsieh, and Sean A Munson.
Crumbs: lightweight daily food challenges to promote engagement and mindfulness.
In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems,
pages 5632–5644, 2016.
[73] Daniel A Epstein, An Ping, James Fogarty, and Sean A Munson. A lived informatics
model of personal informatics. In Proceedings of the 2015 ACM International Joint
Conference on Pervasive and Ubiquitous Computing, pages 731–742, 2015.
[74] Paul Falcone. 96 great interview questions to ask before you hire. Amacom, 2018.
[75] Hao Fang, Hao Cheng, Elizabeth Clark, Ariel Holtzman, Maarten Sap, Mari Ostendorf,
Yejin Choi, and Noah A Smith. Sounding board–university of washington’s alexa prize
submission. Alexa prize proceedings, 2017.
[76] Hao Fang, Hao Cheng, Maarten Sap, Elizabeth Clark, Ari Holtzman, Yejin Choi,
Noah A Smith, and Mari Ostendorf. Sounding board: A user-centric and content-
driven social chatbot. arXiv preprint arXiv:1804.10202, 2018.
[77] Jasper Feine, Ulrich Gnewuch, Stefan Morana, and Alexander Maedche. A taxonomy
of social cues for conversational agents. International Journal of Human-Computer
Studies, 132:138–161, 2019.
[78] Jasper Feine, Stefan Morana, and Alexander Maedche. A chatbot response generation
system. In Proceedings of the Conference on Mensch und Computer, pages 333–341,
2020.
[79] B. Fjeldsoe, A. Marshall, and Y. Miller. Behavior change interventions delivered by
mobile telephone short-message service. American journal of preventive medicine, 36
2:165–73, 2009.
[80] Brianna S Fjeldsoe, Alison L Marshall, and Yvette D Miller. Behavior change inter-
ventions delivered by mobile telephone short-message service. American journal of
preventive medicine, 36(2):165–173, 2009.
[81] J. Flavell. Metacognition and cognitive monitoring: A new area of cognitive-
developmental inquiry. American Psychologist, 34:906–911, 1979.
[82] R. Fleck and G. Fitzpatrick. Reflecting on reflection: framing a design landscape. In
OZCHI ’10, 2010.
177
[83] James Fogarty. Code and contribution in interactive systems research. In Workshop
HCITools: Strategies and Best Practices for Designing, Evaluating and Sharing Tech-
nical HCI Toolkits at CHI, 2017.
[84] B. J. Fogg. A behavior model for persuasive design. In Persuasive ’09, 2009.
[85] Asbjørn Følstad, Petter Bae Brandtzæg, Tom Feltwell, Effie LC Law, Manfred Tsche-
ligi, and Ewa A Luger. Sig: chatbots for social good. In Extended Abstracts of the
2018 CHI Conference on Human Factors in Computing Systems, pages 1–4, 2018.
[86] Center for Advanced Research on Language Acquisition. The center for advanced
research on language acquisition (carla): Pragmatics and speech acts. http://carla.
umn.edu/speechacts/thanks/american.html, 2020. [Online; Retrieved September 29,
2020].
[87] Mirta Galesic and Rocio Garcia-Retamero. Graph literacy: A cross-cultural compari-
son. Medical Decision Making, 31(3):444–457, 2011.
[88] Jianfeng Gao, Michel Galley, and Lihong Li. Neural approaches to conversational ai.
In The 41st International ACM SIGIR Conference on Research & Development in
Information Retrieval, pages 1371–1374, 2018.
[89] G. Gibbs. Learning by doing: A guide to teaching and learning methods. 1988.
[90] K. Glanz, B. Rimer, and K. Viswanath. Health behavior and health education : theory,
research, and practice. 1991.
[91] Laura Gottlieb, Danielle Hessler, Dayna Long, Anais Amaya, and Nancy Adler. A
randomized trial on screening for social determinants of health: the iscreen study.
Pediatrics, 134(6):e1611–e1618, 2014.
[92] R. Gouveia, Fa´bio Pereira, E. Karapanos, S. Munson, and M. Hassenzahl. Exploring
the design space of glanceable feedback for physical activity trackers. Proceedings of the
2016 ACM International Joint Conference on Pervasive and Ubiquitous Computing,
2016.
[93] Ru´ben Gouveia, Evangelos Karapanos, and Marc Hassenzahl. How do we engage with
activity trackers? a longitudinal study of habito. In Proceedings of the 2015 ACM
international joint conference on pervasive and ubiquitous computing, pages 1305–1316,
2015.
178
[94] Jefferson Graham. Alexa is for fun, siri is because typing is hard:
survey. https://www.usatoday.com/story/tech/talkingtech/2017/06/05/
alexa-fun-siri-because-typing-hard-survey/102436072/, 2017. [Online; Retrieved
September 28, 2020].
[95] Joseph Grandpre, Eusebio M Alvaro, Michael Burgoon, Claude H Miller, and John R
Hall. Adolescent reactance and anti-smoking campaigns: A theoretical approach.
Health communication, 15(3):349–366, 2003.
[96] James N Gribble, Heather G Miller, Susan M Rogers, and Charles F Turner. Interview
mode and measurement of sexual behaviors: Methodological issues. Journal of Sex
research, 36(1):16–24, 1999.
[97] Jonathan Grudin and Richard Jacques. Chatbots, humbots, and the quest for artificial
general intelligence. In Proceedings of the 2019 CHI Conference on Human Factors in
Computing Systems, pages 1–11, 2019.
[98] Kent L Gustafson and Winston Bennett Jr. Promoting learner reflection: Issues and
difficulties emerging from a three-year study. Technical report, GEORGIA UNIV
ATHENS DEPT OF INSTRUCTIONAL TECHNOLOGY, 2002.
[99] Alon Halevy, Peter Norvig, and Fernando Pereira. The unreasonable effectiveness of
data. IEEE Intelligent Systems, 24(2):8–12, 2009.
[100] A. Harrison and R. Crandall. Heterogeneity-homogeneity of exposure sequence and
the attitudinal effects of exposure. Journal of personality and social psychology, 21
2:234–8, 1972.
[101] Matthias R Hastall and Silvia Knobloch-Westerwick. Severity, efficacy, and evidence
type as determinants of health message exposure. Health Communication, 28(4):378–
388, 2013.
[102] Katharine J Head, Seth M Noar, Nicholas T Iannarino, and Nancy Grant Harrington.
Efficacy of text messaging-based interventions for health promotion: a meta-analysis.
Social science & medicine, 97:41–48, 2013.
[103] health.gov. Physical activity guidelines for americans. https://health.gov/our-work/
physical-activity/previous-guidelines/2008-physical-activity-guidelines, 2008. [Online;
Retrieved September 27, 2020].
[104] Dirk Heerwegh and Geert Loosveldt. Face-to-face versus web surveying in a high-
internet-coverage population: Differences in response quality. Public opinion quarterly,
72(5):836–846, 2008.
179
[105] Guillaume Hervet, Katherine Gue´rard, Se´bastien Tremblay, and M. Chtourou. Is ban-
ner blindness genuine? eye tracking internet text advertising. Applied Cognitive Psy-
chology, 25:708–716, 2011.
[106] Hyehyun Hong. Scale development for measuring health consciousness: Re-
conceptualization. that Matters to the Practice, page 212, 2009.
[107] Floris Hooglugt and Geke DS Ludden. A mobile app adopting an identity focus to
promote physical activity (movedaily): iterative design study. JMIR mHealth and
uHealth, 8(6):e16720, 2020.
[108] I-Han Hsiao, Shuguang Han, Manav Malhotra, Hui Soo Chae, and Gary Natriello.
Survey sidekick: Structuring scientifically sound surveys. In International conference
on intelligent tutoring systems, pages 516–522. Springer, 2014.
[109] Gary Hsieh, Ian Li, Anind Dey, Jodi Forlizzi, and Scott E Hudson. Using visualizations
to increase compliance in experience sampling. In Proceedings of the 10th international
conference on Ubiquitous computing, pages 164–167, 2008.
[110] Zhiting Hu, Zichao Yang, Xiaodan Liang, Ruslan Salakhutdinov, and Eric P Xing. To-
ward controlled generation of text. In International Conference on Machine Learning,
pages 1587–1596. PMLR, 2017.
[111] Barry Hutchinson and Peter Bryson. Video, reflection and transformation: action
research in vocational education and training in a european context. Educational
action research, 5(2):283–303, 1997.
[112] Mohit Jain, Pratyush Kumar, Ramachandra Kota, and Shwetak N Patel. Evaluating
and informing the design of chatbots. In Proceedings of the 2018 Designing Interactive
Systems Conference, pages 895–906, 2018.
[113] K. Jolly, Amanda K. Lewis, J. Beach, J. Denley, P. Adab, J. Deeks, A. Daley, and
P. Aveyard. Comparison of range of commercial or primary care led weight reduction
programmes with minimal intervention control for weight loss in obesity: Lighten up
randomised controlled trial. The BMJ, 343, 2011.
[114] D. Kahneman. Maps of bounded rationality: Psychology for behavioral economics.
The American Economic Review, 93:1449–1475, 2003.
[115] Prashant Kale and Harbir Singh. Building firm capabilities through learning: the role
of the alliance learning process in alliance capability and firm-level alliance success.
Strategic management journal, 28(10):981–1000, 2007.
180
[116] Jie Kang, Kyle Condiff, Shuo Chang, Joseph A Konstan, Loren Terveen, and F Maxwell
Harper. Understanding how people use natural language to ask for recommendations.
In Proceedings of the Eleventh ACM Conference on Recommender Systems, pages 229–
237, 2017.
[117] Evangelos Karapanos, Ru´ben Gouveia, Marc Hassenzahl, and Jodi Forlizzi. Wellbeing
in the making: peoples’ experiences with wearable activity trackers. Psychology of
well-being, 6(1):1–17, 2016.
[118] Yuta Katsumi, Suhkyung Kim, Keen Sung, Florin Dolcos, and Sanda Dolcos. When
nonverbal greetings “make it or break it”: the role of ethnicity and gender in the effect
of handshake on social appraisals. Journal of Nonverbal Behavior, 41(4):345–365, 2017.
[119] J. Kaye, Mary McCuistion, Rebecca Gulotta, and D. Shamma. Money talks: track-
ing personal finances. Proceedings of the SIGCHI Conference on Human Factors in
Computing Systems, 2014.
[120] M. Kelly and M. Barker. Why is changing health-related behaviour so difficult? Public
health, 136:109–16, 2016.
[121] R. Kelly, S. Zyzanski, and S. Alemagno. Prediction of motivation and behavior change
following health promotion: role of health beliefs, social support, and self-efficacy.
Social science & medicine, 32 3:311–20, 1991.
[122] David Kember, Doris YP Leung, Alice Jones, Alice Yuen Loke, Jan McKay, Kit Sin-
clair, Harrison Tse, Celia Webb, Frances Kam Yuet Wong, Marian Wong, et al. De-
velopment of a questionnaire to measure the level of reflective thinking. Assessment &
evaluation in higher education, 25(4):381–395, 2000.
[123] Tom Kenter and M. Rijke. Short text similarity with word embeddings. Proceedings of
the 24th ACM International on Conference on Information and Knowledge Manage-
ment, 2015.
[124] Soomin Kim, Joonhwan Lee, and G. Gweon. Comparing data from chatbot and web
surveys: Effects of platform and conversational style on survey response quality. Pro-
ceedings of the 2019 CHI Conference on Human Factors in Computing Systems, 2019.
[125] Young Hoon Kim, Dan J Kim, and Kathy Wachter. A study of mobile user engage-
ment (moen): Engagement motivations, perceived value, satisfaction, and continued
engagement intention. Decision support systems, 56:361–370, 2013.
181
[126] Predrag Klasnja, Sunny Consolvo, and Wanda Pratt. How to evaluate technologies for
health behavior change in hci research. In Proceedings of the SIGCHI conference on
human factors in computing systems, pages 3063–3072, 2011.
[127] Lorenz Cuno Klopfenstein, Saverio Delpriori, Silvia Malatini, and Alessandro Bogliolo.
The rise of bots: A survey of conversational interfaces, patterns, and paradigms. In
Proceedings of the 2017 conference on designing interactive systems, pages 555–565,
2017.
[128] Ahmet Baki Kocaballi, Juan C Quiroz, Dana Rezazadegan, Shlomo Berkovsky, Farah
Magrabi, Enrico Coiera, and Liliana Laranjo. Responses of conversational agents to
health and lifestyle prompts: investigation of appropriateness and presentation struc-
tures. Journal of medical Internet research, 22(2):e15823, 2020.
[129] Rafal Kocielnik, Elena Agapie, Alexander Argyle, Dennis T Hsieh, Kabir Yadav,
Breena Taira, and Gary Hsieh. Harborbot: A chatbot for social needs screening.
In AMIA Annual Symposium Proceedings, volume 2019, page 552. American Medical
Informatics Association, 2019.
[130] Rafal Kocielnik, Daniel Avrahami, Jennifer Marlow, Di Lu, and Gary Hsieh. Designing
for workplace reflection: a chat and voice-based conversational agent. In Proceedings
of the 2018 Designing Interactive Systems Conference, pages 881–894, 2018.
[131] Rafal Kocielnik and Gary Hsieh. Send me a different message: utilizing cognitive space
to create engaging message triggers. In Proceedings of the 2017 ACM Conference on
Computer Supported Cooperative Work and Social Computing, pages 2193–2207, 2017.
[132] Rafal Kocielnik, F. M. Maggi, and N. Sidorova. Enabling self-reflection with lifelogex-
plorer: Generating simple views from complex data. 2013 7th International Conference
on Pervasive Computing Technologies for Healthcare and Workshops, pages 184–191,
2013.
[133] Barbara Konat, Katarzyna Budzynska, and Patrick Saint-Dizier. Rephrase in argument
structure. In Proceedings of the Foundations of the Language of Argumentation (FLA)
Workshop, pages 32–39, 2016.
[134] Birgit R Krogstie, Michael Prilla, Daniel Wessel, Kristin Knipfer, and Viktoria Pam-
mer. Computer support for reflective learning in the workplace: A model. In 2012
IEEE 12th International Conference on Advanced Learning Technologies, pages 151–
153. IEEE, 2012.
182
[135] Kimberly Kulavic, Cherilyn N. Hultquist, and J. McLester. A comparison of motiva-
tional factors and barriers to physical activity among traditional versus nontraditional
college students. Journal of American College Health, 61:60 – 66, 2013.
[136] Mark Kutner, Elizabeth Greenburg, Ying Jin, and Christine Paulsen. The health
literacy of america’s adults: Results from the 2003 national assessment of adult literacy.
nces 2006-483. National Center for Education Statistics, 2006.
[137] Andre´ Sousa Lago, Joa˜o Pedro Dias, and Hugo Sereno Ferreira. Conversational inter-
face for managing non-trivial internet-of-things systems. In International Conference
on Computational Science, pages 384–397. Springer, 2020.
[138] Margaret D LeCompte. Analyzing qualitative data. Theory into practice, 39(3):146–
154, 2000.
[139] Min Kyung Lee, Junsung Kim, Jodi Forlizzi, and Sara Kiesler. Personalization revis-
ited: a reflective approach helps people better personalize health services and motivates
them to increase physical activity. In Proceedings of the 2015 ACM International Joint
Conference on Pervasive and Ubiquitous Computing, pages 743–754, 2015.
[140] James Lester, Karl Branting, and Bradford Mott. Conversational agents. the practical
handbook of internet computing. Chapman & Hall. ISBN-10: 9781584883814, 8:2–3,
2004.
[141] Omer Levy, Yoav Goldberg, and Ido Dagan. Improving distributional similarity with
lessons learned from word embeddings. Transactions of the Association for Computa-
tional Linguistics, 3:211–225, 2015.
[142] I. Li, Anind K. Dey, and J. Forlizzi. A stage-based model of personal informatics
systems. Proceedings of the SIGCHI Conference on Human Factors in Computing
Systems, 2010.
[143] I. Li, Anind K. Dey, and J. Forlizzi. Understanding my data, myself: supporting
self-reflection with ubicomp technologies. In UbiComp ’11, 2011.
[144] I. Li, J. Forlizzi, and Anind K. Dey. Know thyself: monitoring and reflecting on facets
of one’s life. CHI ’10 Extended Abstracts on Human Factors in Computing Systems,
2010.
[145] Jingyi Li, Michelle X Zhou, Huahai Yang, and Gloria Mark. Confiding in and listening
to virtual agents: The effect of personality. In Proceedings of the 22nd International
Conference on Intelligent User Interfaces, pages 275–286, 2017.
183
[146] Q Vera Liao, Matthew Davis, Werner Geyer, Michael Muller, and N Sadat Shami.
What can you do? studying social-agent orientation and agent proactive interactions
with an agent for employees. In Proceedings of the 2016 acm conference on designing
interactive systems, pages 264–275, 2016.
[147] Q Vera Liao, Muhammed Mas-ud Hussain, Praveen Chandar, Matthew Davis,
Yasaman Khazaeni, Marco Patricio Crasso, Dakuo Wang, Michael Muller, N Sadat
Shami, and Werner Geyer. All work and no play? In Proceedings of the 2018 CHI
Conference on Human Factors in Computing Systems, pages 1–13, 2018.
[148] James J Lin, Lena Mamykina, Silvia Lindtner, Gregory Delajoux, and Henry B Strub.
Fish’n’steps: Encouraging physical activity with an interactive computer game. In
International conference on ubiquitous computing, pages 261–278. Springer, 2006.
[149] M. Lindstro¨m, A. St˚ahl, K. Ho¨o¨k, P. Sundstro¨m, Jarmo Laaksolahti, Marco Combetto,
A. Taylor, and Roberto Bresin. Affective diary: designing for bodily expressiveness
and self-reflection. In CHI EA ’06, 2006.
[150] Yang Liu and Mingyan Liu. An online learning approach to improving the quality of
crowd-sourcing. IEEE/ACM Transactions on Networking, 25(4):2166–2179, 2017.
[151] I. Llovera, M. Ward, J. Ryan, Thalia LaTouche, and A. Sama. A survey of the emer-
gency department population and their interest in preventive health education. Aca-
demic emergency medicine : official journal of the Society for Academic Emergency
Medicine, 10 2:155–60, 2003.
[152] Edwin A Locke and Gary P Latham. New directions in goal-setting theory. Current
directions in psychological science, 15(5):265–268, 2006.
[153] Robert Loo and Karran Thorpe. Using reflective learning journals to improve individual
and team performance. Team performance management: an international journal,
2002.
[154] Catherine L Lortie and Matthieu J Guitton. Judgment of the humanness of an inter-
locutor is in the eye of the beholder. PLoS One, 6(9):e25085, 2011.
[155] Gale M Lucas, Jonathan Gratch, Aisha King, and Louis-Philippe Morency. It’s only
a computer: Virtual humans increase willingness to disclose. Computers in Human
Behavior, 37:94–100, 2014.
[156] Ewa Luger and Abigail Sellen. ” like having a really bad pa” the gulf between user
expectation and experience of conversational agents. In Proceedings of the 2016 CHI
conference on human factors in computing systems, pages 5286–5297, 2016.
184
[157] P. Malecha, J. Williams, N. Kunzler, L. Goldfrank, H. Alter, and K. Doran. Material
needs of emergency department patients: A systematic review. Academic Emergency
Medicine, 25:330–359, 2018.
[158] Lena Mamykina, Elizabeth Mynatt, Patricia Davidson, and Daniel Greenblatt. Mahi:
investigation of social scaffolding for reflective thinking in diabetes management. In
Proceedings of the SIGCHI Conference on Human Factors in Computing Systems,
pages 477–486, 2008.
[159] Fiona Martin and Mark Johnson. More efficient topic modelling through a noun only
approach. In Proceedings of the Australasian Language Technology Association Work-
shop 2015, pages 111–115, 2015.
[160] Louis Martin, Benoˆıt Sagot, Eric de la Clergerie, and Antoine Bordes. Controllable
sentence simplification. arXiv preprint arXiv:1910.02677, 2019.
[161] Daniel McDuff, Amy Karlson, Ashish Kapoor, Asta Roseway, and Mary Czerwinski.
Affectaura: an intelligent system for emotional memory. In Proceedings of the SIGCHI
Conference on Human Factors in Computing Systems, pages 849–858, 2012.
[162] Graeme McLean and Kofi Osei-Frimpong. Hey alexa. . . examine the variables influ-
encing the use of artificial intelligent in-home voice assistants. Computers in Human
Behavior, 99:28–37, 2019.
[163] Mary McMahon, Wendy Patton, and Mark Watson. Creating career stories through
reflection: An application of the systems theory framework of career development.
Australian Journal of Career Development, 13(3):13–17, 2004.
[164] David S Metzger, Beryl Koblin, Charles Turner, Helen Navaline, Francesca Valenti,
Sarah Holte, Michael Gross, Amy Sheon, Heather Miller, Philip Cooley, et al. Ran-
domized controlled trial of audio computer-assisted self-interviewing: utility and ac-
ceptability in longitudinal studies. American journal of epidemiology, 152(2):99–106,
2000.
[165] Jochen Meyer, Steven Simske, Katie A Siek, Cathal G Gurrin, and Hermie Hermens.
Beyond quantified self: Data for wellbeing. In CHI’14 Extended Abstracts on Human
Factors in Computing Systems, pages 95–98. 2014.
[166] Sallyanne Miller. What it’s like being the ‘holder of the space’: a narrative on working
with reflective practice in groups. Reflective Practice, 6(3):367–377, 2005.
185
[167] S. Milne, S. Orbell, and P. Sheeran. Combining motivational and volitional interven-
tions to promote exercise participation: protection motivation theory and implemen-
tation intentions. British journal of health psychology, 7 Pt 2:163–84, 2002.
[168] J. Moon. Reflection in learning & professional development: Theory & practice. 1999.
[169] Y. Moon. Personalization and personality: Some effects of customizing message style
based on consumer personality. Journal of Consumer Psychology, 12:313–325, 2002.
[170] Robert R Morris, Kareem Kouddous, Rohan Kshirsagar, and Stephen M Schueller.
Towards an artificially empathic conversational agent for mental health applications:
system design and user perceptions. Journal of medical Internet research, 20(6):e10148,
2018.
[171] Andreea Muresan and Henning Pohl. Chats with bots: balancing imitation and en-
gagement. In Extended Abstracts of the 2019 CHI Conference on Human Factors in
Computing Systems, pages 1–6, 2019.
[172] Clifford Nass, Katherine Isbister, Eun-Ju Lee, et al. Truth is beauty: Researching
embodied conversational agents. Embodied conversational agents, pages 374–402, 2000.
[173] Roni Neff and Jillian Fry. Periodic prompts and reminders in health promotion and
health behavior interventions: systematic review. Journal of medical Internet research,
11(2):e16, 2009.
[174] Christine M Neuwirth, Ravinder Chandhok, David Charney, Patricia Wojahn, and Loel
Kim. Distributed collaborative writing: A comparison of spoken and written modalities
for reviewing and revising documents. In Proceedings of the SIGCHI Conference on
Human Factors in Computing Systems, pages 51–57, 1994.
[175] Annie WY Ng, HW Lo, and AH Chan. Measuring the usability of safety signs: A use
of system usability scale (sus). In proceedings of the International MultiConference of
Engineers and Computer Scientists, volume 2, pages 1296–1301. Citeseer, 2011.
[176] Hien Nguyen and J. Masthoff. Designing persuasive dialogue systems: Using argumen-
tation with care. In PERSUASIVE, 2008.
[177] J. Norcross, Marci S Mrykalo, and M. Blagys. Auld lang syne: success predictors,
change processes, and self-reported outcomes of new year’s resolvers and nonresolvers.
Journal of clinical psychology, 58 4:397–405, 2002.
186
[178] P. Norris. Digital divide: Civic engagement, information poverty, and the internet
worldwide. 2001.
[179] Jekaterina Novikova, Oliver Lemon, and Verena Rieser. Crowd-sourcing nlg data:
Pictures elicit better data. arXiv preprint arXiv:1608.00339, 2016.
[180] Heather L O’Brien and Elaine G Toms. The development and evaluation of a survey
to measure user engagement. Journal of the American Society for Information Science
and Technology, 61(1):50–69, 2010.
[181] Shereen Oraby, Pritam Gundecha, Jalal Mahmud, Mansurul Bhuiyan, and Rama Akki-
raju. ” how may i help you?” modeling twitter customer serviceconversations using
fine-grained dialogue acts. In Proceedings of the 22nd international conference on in-
telligent user interfaces, pages 343–355, 2017.
[182] Marcia G Ory, Matthew Lee Smith, Nelda Mier, and Meghan M Wernicke. The science
of sustaining health behavior change: the health maintenance consortium. American
journal of health behavior, 34(6):647–659, 2010.
[183] Mina Park, Milam Aiken, and Laura Salvador. How do humans interact with chatbots?:
An analysis of transcripts. International Journal of Management and Information
Technology, 14:3338–3350, 2018.
[184] Baolin Peng, Chenguang Zhu, Chunyuan Li, Xiujun Li, Jinchao Li, Michael Zeng, and
Jianfeng Gao. Few-shot natural language generation for task-oriented dialog. arXiv
preprint arXiv:2002.12328, 2020.
[185] J. Pennebaker, M. Mehl, and Kate Niederhoffer. Psychological aspects of natural
language. use: our words, our selves. Annual review of psychology, 54:547–77, 2003.
[186] Rifca Peters, Joost Broekens, and Mark A Neerincx. Guidelines for tree-based collabo-
rative goal setting. In Proceedings of the 22nd International Conference on Intelligent
User Interfaces, pages 401–405, 2017.
[187] R. Petty and J. Cacioppo. Attitudes and persuasion: Classic and contemporary ap-
proaches. 1981.
[188] Afarin Pirzadeh, Li He, and Erik Stolterman. Personal informatics and reflection:
a critical examination of the nature of reflection. In CHI’13 Extended Abstracts on
Human Factors in Computing Systems, pages 1979–1988. 2013.
187
[189] Mark Pope. A brief history of career counseling in the united states. The career
development quarterly, 48(3):194–211, 2000.
[190] David B Portnoy, Lori AJ Scott-Sheldon, Blair T Johnson, and Michael P Carey.
Computer-delivered interventions for health promotion and behavioral risk reduction:
a meta-analysis of 75 randomized controlled trials, 1988–2007. Preventive medicine,
47(1):3–16, 2008.
[191] J. Prochaska and W. Velicer. The transtheoretical model of health behavior change.
American Journal of Health Promotion, 12:38 – 48, 1997.
[192] Kiran Ramesh, Surya Ravishankaran, Abhishek Joshi, and K Chandrasekaran. A
survey of design techniques for conversational agents. In International conference on
information, communication and computing technology, pages 336–350. Springer, 2017.
[193] Yi Ren, Yangjun Ruan, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan
Liu. Fastspeech: Fast, robust and controllable text to speech. arXiv preprint
arXiv:1905.09263, 2019.
[194] R. Rickenberg and Byron Reeves. The effects of animated characters on anxiety, task
performance, and evaluations of user interfaces. Proceedings of the SIGCHI conference
on Human Factors in Computing Systems, 2000.
[195] J. Riedel. Using a health and productivity dashboard: A case example. American
Journal of Health Promotion, 22:1 – 12, 2007.
[196] Vero´nica Rivera-Pelayo, Valentin Zacharias, Lars Mu¨ller, and Simone Braun. Applying
quantified self approaches to support reflective learning. In Proceedings of the 2nd
international conference on learning analytics and knowledge, pages 111–114, 2012.
[197] Susan Robinson, David R Traum, Midhun Ittycheriah, and Joe Henderer. What would
you ask a conversational agent? observations of human-agent dialogues in a museum
setting. In LREC, 2008.
[198] Stephen Rollnick, William R Miller, and Christopher Butler. Motivational interviewing
in health care: helping patients change behavior. Guilford Press, 2008.
[199] Catherine A Roster, Robert D Rogers, Gerald Albaum, and Darin Klein. A comparison
of response characteristics from web and telephone surveys. International Journal of
Market Research, 46(3):359–373, 2004.
188
[200] Susana Rubio, Eva Dı´az, Jesu´s Mart´ın, and Jose´ M Puente. Evaluation of subjec-
tive mental workload: A comparison of swat, nasa-tlx, and workload profile methods.
Applied psychology, 53(1):61–86, 2004.
[201] Caryl E Rusbult, Stephen M Drigotas, and Julie Verette. The investment model:
An interdependence analysis of commitment processes and relationship maintenance
phenomena. 1994.
[202] K. Ryokai, F. Michahelles, M. Kritzler, and Suhaib Syed. Communicating and inter-
preting wearable sensor data with health coaches. 2015 9th International Conference
on Pervasive Computing Technologies for Healthcare (PervasiveHealth), pages 221–224,
2015.
[203] Gu¨nther Sagl, Bernd Resch, and Thomas Blaschke. Contextual sensing: Integrating
contextual information with human and technical geo-sensor information for smart
cities. Sensors, 15(7):17013–17035, 2015.
[204] Shruti Sannon, Brett Stoll, Dominic DiFranzo, Malte Jung, and Natalya N Bazarova.
How personification and interactivity influence stress-related disclosures to conver-
sational agents. In companion of the 2018 ACM conference on computer supported
cooperative work and social computing, pages 285–288, 2018.
[205] Donald A Schon. The reflective practitioner: How professionals think in action, volume
5126. Basic books, 1984.
[206] D. Schulman and T. Bickmore. Persuading users through counseling dialogue with a
conversational agent. In Persuasive ’09, 2009.
[207] Daniel Schulman and Timothy Bickmore. Persuading users through counseling dialogue
with a conversational agent. In Proceedings of the 4th international conference on
persuasive technology, pages 1–8, 2009.
[208] David Schumann, R. Petty, and D. Clemons. Predicting the effectiveness of differ-
ent strategies of advertising variation: A test of the repetition-variation hypotheses.
Journal of Consumer Research, 17:192–202, 1990.
[209] S. Schwartz, Jan Cieciuch, M. Vecchione, E. Davidov, R. Fischer, C. Beierlein,
A. Ramos, M. Verkasalo, J. Lo¨nnqvist, Kursad Demirutku, Ozlem Dirilen-Gumus,
and Mark Konty. Refining the theory of basic individual values. Journal of personality
and social psychology, 103 4:663–88, 2012.
189
[210] A. Schwerdtfeger, C. Schmitz, and M. Warken. Using text messages to bridge the
intention-behavior gap? a pilot study on the use of text message reminders to increase
objectively assessed physical activity in daily life. Frontiers in Psychology, 3, 2012.
[211] John R Searle. Austin on locutionary and illocutionary acts. The philosophical review,
77(4):405–424, 1968.
[212] John R Searle, Ferenc Kiefer, Manfred Bierwisch, et al. Speech act theory and prag-
matics, volume 10. Springer, 1980.
[213] Lifeng Shang, Zhengdong Lu, and Hang Li. Neural responding machine for short-text
conversation. arXiv preprint arXiv:1503.02364, 2015.
[214] Ashish Sharma, Inna W Lin, Adam S Miner, David C Atkins, and Tim Althoff. Towards
facilitating empathic conversations in online mental health support: A reinforcement
learning approach. arXiv preprint arXiv:2101.07714, 2021.
[215] Ashish Sharma, Adam S Miner, David C Atkins, and Tim Althoff. A computational
approach to understanding empathy expressed in text-based mental health support.
arXiv preprint arXiv:2009.08441, 2020.
[216] Miriam Sherin and Elizabeth van Es. Using video to support teachers’ ability to inter-
pret classroom interactions. In society for information technology & teacher education
international conference, pages 2532–2536. Association for the Advancement of Com-
puting in Education (AACE), 2002.
[217] Ingo Siegert. “alexa in the wild”–collecting unconstrained conversations with a modern
voice assistant in a public environment. In Proceedings of the 12th Language Resources
and Evaluation Conference, pages 615–619, 2020.
[218] P. Slova´k, C. Frauenberger, and G. Fitzpatrick. Reflective practicum: A framework
of sensitising concepts to design for transformative reflection. Proceedings of the 2017
CHI Conference on Human Factors in Computing Systems, 2017.
[219] Kenny Smith, Amy Perfors, O. Feher, Anna Samara, K. Swoboda, and E. Wonnacott.
Language learning, language use and the evolution of linguistic variation. Philosophical
Transactions of the Royal Society B: Biological Sciences, 372, 2017.
[220] Alessandro Sordoni, Michel Galley, Michael Auli, Chris Brockett, Yangfeng Ji, Mar-
garet Mitchell, Jian-Yun Nie, Jianfeng Gao, and Bill Dolan. A neural network ap-
proach to context-sensitive generation of conversational responses. arXiv preprint
arXiv:1506.06714, 2015.
190
[221] Lee Sproull, Mani Subramani, Sara Kiesler, Janet H Walker, and Keith Waters. When
the interface is a face. Human-computer interaction, 11(2):97–124, 1996.
[222] Jessica Stillman. Hiring a Remote Worker? 7 Interview Questions to Ask. https:
//www.inc.com/jessica-stillman/hiring-remote-workers-interview-questions-to-ask.
html, 2013. [Online; Retrieved September 28, 2020].
[223] Victor J Strecher, Saul Shiffman, and Robert West. Randomized controlled trial of a
web-based computer-tailored smoking cessation program as a supplement to nicotine
patch therapy. Addiction, 100(5):682–688, 2005.
[224] Nina Svenningsson and Montathar Faraon. Artificial intelligence in conversational
agents: A study of factors related to perceived humanness in chatbots. In Proceedings
of the 2019 2nd Artificial Intelligence and Cloud Computing Conference, pages 151–
161, 2019.
[225] Robert Tobias. Changing behavior by memory aids: A social psychological model of
prospective memory and habit development tested with dynamic field data. Psycho-
logical review, 116(2):408, 2009.
[226] US VA. Sample pbi questions - performance based interviewing (pbi). https://www.
va.gov/PBI/Questions.asp, 2018. [Online; Retrieved September 28, 2020].
[227] Aukje AC Verhoeven, Marieke A Adriaanse, Denise TD De Ridder, Emely De Vet, and
Bob M Fennis. Less is more: The effect of multiple implementation intentions targeting
unhealthy snacking habits. European Journal of Social Psychology, 43(5):344–354,
2013.
[228] M Vagias Wade et al. Likert-type scale response anchors. Clemson international insti-
tute for tourism & research development, department of parks, recreation and tourism
management, clemson university, 2006.
[229] Harald Walach, Nina Buchheld, Valentin Buttenmu¨ller, Norman Kleinknecht, and Ste-
fan Schmidt. Measuring mindfulness—the freiburg mindfulness inventory (fmi). Per-
sonality and individual differences, 40(8):1543–1555, 2006.
[230] G. Walsh and J. Golbeck. Stepcity: a preliminary investigation of a personal
informatics-based social game on behavior change. CHI ’14 Extended Abstracts on
Human Factors in Computing Systems, 2014.
[231] Tan Wang, Xing Xu, Yang Yang, Alan Hanjalic, Heng Tao Shen, and Jingkuan Song.
Matching images and text with multi-modal tensor fusion and re-ranking. In Proceed-
ings of the 27th ACM international conference on multimedia, pages 12–20, 2019.
191
[232] David Watson, Lee Anna Clark, and Auke Tellegen. Development and validation of
brief measures of positive and negative affect: the panas scales. Journal of personality
and social psychology, 54(6):1063, 1988.
[233] Wei Wei, Bei Zhou, and Georgios Leontidis. A hybrid natural language generation sys-
tem integrating rules and deep learning algorithms. arXiv preprint arXiv:2006.09213,
2020.
[234] Barry D Weiss, Mary Z Mays, William Martz, Kelley Merriam Castro, Darren A
DeWalt, Michael P Pignone, Joy Mockbee, and Frank A Hale. Quick assessment
of literacy in primary care: the newest vital sign. The Annals of Family Medicine,
3(6):514–522, 2005.
[235] J. Weizenbaum. Eliza — a computer program for the study of natural language com-
munication between man and machine. Commun. ACM, 26:23–28, 1983.
[236] Lyndon White, Roberto Togneri, Wei Liu, and Mohammed Bennamoun. How well sen-
tence embeddings capture meaning. In Proceedings of the 20th Australasian document
computing symposium, pages 1–8, 2015.
[237] WHO. Who — prevalence of insufficient physical activity. http://www.who.int/gho/
ncd/risk factors/physical activity/en/, 2008. [Online; Retrieved September 27, 2020].
[238] Robert A Wicklund. Freedom and reactance. Lawrence Erlbaum, 1974.
[239] Ziang Xiao, M. Zhou, Q. V. Liao, G. Mark, Changyan Chi, W. Chen, and H. Yang.
Tell me about yourself: Using an ai-powered chatbot to conduct conversational surveys.
arXiv: Human-Computer Interaction, 2019.
[240] Anbang Xu, Zhe Liu, Yufan Guo, Vibha Sinha, and Rama Akkiraju. A new chatbot
for customer service on social media. In Proceedings of the 2017 CHI conference on
human factors in computing systems, pages 3506–3510, 2017.
[241] Qiongkai Xu, Chenchen Xu, and Lizhen Qu. Alter: Auxiliary text rewriting tool for
natural language generation. arXiv preprint arXiv:1909.06564, 2019.
[242] Xinnuo Xu, Ondrˇej Dusˇek, Ioannis Konstas, and Verena Rieser. Better conversations
by modeling, filtering, and optimizing for coherence and diversity. arXiv preprint
arXiv:1809.06873, 2018.
[243] O¨zge Nilay Yalc¸ın. Empathy framework for embodied conversational agents. Cognitive
Systems Research, 59:123–132, 2020.
192
[244] Rui Yan, Yiping Song, and Hua Wu. Learning to respond with deep neural networks
for retrieval-based human-computer conversation system. In Proceedings of the 39th
International ACM SIGIR conference on Research and Development in Information
Retrieval, pages 55–64, 2016.
[245] Mo Yu, Xiaoxiao Guo, Jinfeng Yi, Shiyu Chang, Saloni Potdar, Yu Cheng, Gerald
Tesauro, Haoyu Wang, and Bowen Zhou. Diverse few-shot text classification with
multiple metrics. arXiv preprint arXiv:1805.07513, 2018.
[246] Li Yujian and Liu Bo. A normalized levenshtein distance metric. IEEE transactions
on pattern analysis and machine intelligence, 29(6):1091–1095, 2007.
[247] Saizheng Zhang, Emily Dinan, Jack Urbanek, Arthur Szlam, Douwe Kiela, and Jason
Weston. Personalizing dialogue agents: I have a dog, do you have pets too? arXiv
preprint arXiv:1801.07243, 2018.
[248] Xingxing Zhang and Mirella Lapata. Sentence simplification with deep reinforcement
learning. arXiv preprint arXiv:1703.10931, 2017.
[249] Li Zhou, Jianfeng Gao, Di Li, and Heung-Yeung Shum. The design and implementation
of xiaoice, an empathetic social chatbot. Computational Linguistics, 46(1):53–93, 2020.
[250] Maurizio Zollo and Sidney G Winter. Deliberate learning and the evolution of dynamic
capabilities. Organization science, 13(3):339–351, 2002.
193
Appendix A
EXAMPLES OF REACTIONS MATCHED TO
QUESTION-ANSWER CONTEXT
194
Table A.1: Examples of empathetic reactions matched to local question-answer context. Phrases
in-between square brackets have been added or modified.
195
Appendix B
PHRASING CATEGORIES USED FOR QUESTION
REPHRASING IN AUTOMATION
Table B.1: Phrasing categories for survey questions derived empirically from survey data. Each
category is composed from a prefix that is preprended to the original text of survey questions and
a set of modification rules which change the text of the question to fit the 3rd person & question
form.
196
Appendix C
HOLD-OUT SURVEYS USED IN THE USER STUDY
EVALUATION
• Big Five Inventory-10 (BFI-10)
• Informal Fitness survey
• Personal Finance Survey (SurveyMonkey)
• Values Survey
• Portrait Values Questionnaire (PVQ)
• Sleep Quality Scale (SQS)
197
Appendix D
SURVEYS USED FOR ML DEVELOPMENT
• Vulnerable Elders Survey (VES-13)
• Theory of Planned Behavior Survey
• Student Satisfaction
• SDOH Short Screener
• Kember’s Reflection Survey
• Positive and Negative Affect Schedule (PANAS-SF)
• NASA TLX
• Health Literacy
• 3 Minute Depression Test
• Demographics #1
• Demographics #2
• Demographics #3
• Demographics #4
• Climate Change
• Harbor
198
Appendix E
ML PERFORMANCE ON THE FULL DATASET
Table E.1: Classification performance for the 4 text classification tasks (+1 derived) on a full dataset
of 22 surveys (combined 16 development and 6 hold-out surveys). Question Empathy Framing and
Answer Empathy Framing classifications are part of empathetic addition - the results of these two
classifications taken together are used to decide on reaction class
199
Appendix F
MANUAL QUESTION CORRECTIONS IN CHAPTER 6
Sleep Quality - 6 of 19 questions corrected
# Auto generated After manual correction Edit Cause
1
Next, did you experience i have
difficulty falling asleep?
Next, have you experienced difficulty
falling asleep?
7 Wrong class
2
Let’s carry on, can you share
whether you have experienced poor
sleep gives me headaches?
Let’s carry on, can you share
whether you have experienced poor
sleep giving you headaches?
6 Wrong class
3
Please tell me if you have
experienced poor sleep makes me
irritated?
Please tell me if you have
experienced poor sleep making you
irritated?
6 Wrong class
4
Moving on, did you experience poor
sleep makes me lose my appetite?
Moving on, did you experience poor
sleep making you lose your appetite?
10 Wrong class
5
So, please tell me if you’ve
experienced poor sleep makes me
lose interest in work or others?
So, please tell me if you’ve
experienced poor sleep making you
lose interest in work or others?
6 Wrong class
6
Can you share whether you’ve
experienced my fatigue is relieved
after sleep?
Can you share whether you’ve
experienced your fatigue being
relieved after sleep?
8 Wrong class
Big 5 survey - 10 of 11 questions corrected
200
# Auto generated After manual correction Edit Cause
1
Moving on, is it fair to say that you
see myself as someone who is
reserved
Moving on, is it fair to say that you
see yourself as someone who is
reserved
4
No rule:
‘myself’ 
‘yourself’
2
Further, please indicate the extent to
which you see myself as someone
who is generally trusting?
Further, please indicate the extent to
which you see yourself as someone
who is generally trusting?
4
No rule:
‘myself’ 
‘yourself’
3
Do you think it’s fair to say that you
see myself as someone who tends to
be lazy?
Do you think it’s fair to say that you
see yourself as someone who tends to
be lazy?
4
No rule:
‘myself’ 
‘yourself’
4
Next, does it make sense to say that
you see myself as someone who is
relaxed, handles stress well?
Next, does it make sense to say that
you see yourself as someone who is
relaxed, handles stress well?
4
No rule:
‘myself’ 
‘yourself’
5
Let’s carry on, do you think it’s fair
to say that you see myself as
someone who has few artistic
interests?
Let’s carry on, do you think it’s fair
to say that you see yourself as
someone who has few artistic
interests?
4
No rule:
‘myself’ 
‘yourself’
6
Let’s carry on, is it true that you see
myself as someone who is outgoing,
sociable
Let’s carry on, is it true that you see
yourself as someone who is outgoing,
sociable
4
No rule:
‘myself’ 
‘yourself’
7
Do you think that you see myself as
someone who tends to find fault with
others?
Do you think that you see yourself
as someone who tends to find fault
with others?
4
No rule:
‘myself’ 
‘yourself’
8
Moving on, would you say that you
see myself as someone who does a
thorough job?
Moving on, would you say that you
see yourself as someone who does a
thorough job?
4
No rule:
‘myself’ 
‘yourself’
9
Is it fair to say that you see myself
as someone who gets nervous easily?
Is it fair to say that you see yourself
as someone who gets nervous easily?
4
No rule:
‘myself’ 
‘yourself’
10
Further, does it make sense to say
that you see myself as someone who
has an active imagination?
Further, does it make sense to say
that you see yourself as someone
who has an active imagination?
4
No rule:
‘myself’ 
‘yourself’
201
Finance survey - 5 of 17 questions corrected
# Auto generated After manual correction Edit Cause
1
So, can you say that you have an
emergency savings fund established
to cover 3 to 6 months of expenses
should you lose the ability to work?
Do you have an emergency savings
fund established to cover 3 to 6
months of expenses should you lose
the ability to work?
19 Wrong class
2
Going forward, can I ask you to i
feel capable of handling my financial
future overall?
Going forward, do you feel capable
of handling your financial future
overall?
17 Wrong class
3
Further, please indicate the extent to
which you believe you have adequate
information to help make the best
financial decisions for you and your
family?
Further, do you believe you have
adequate information to help make
the best financial decisions for you
and your family?
33 Wrong class
4
Continuing, could you say that you
feel you have a good grasp on the
importance of insurance in all of its
forms. (life, health, disability, Long
Term Care)?
Continuing, do you feel you have a
good grasp on the importance of
insurance in all of its forms. (life,
health, disability, Long Term Care)?
16 Wrong class
5
Do you feel that you feel comfortable
about your financial future because
you have adequately planned for it?
Do you feel comfortable about your
financial future because you have
adequately planned for it?
14 Wrong class
Values & Politics survey - 2 of 16 questions corrected
202
# Auto generated After manual correction Edit Cause
1
Continuing, could you tell me a good
government should aim chiefly at
introducing the highest ethical
principles into its policies?
Continuing, could you say a good
government should aim chiefly at
introducing the highest ethical
principles into its policies?
7 Wrong class
2
Moving on, can I ask you a good
government should aim chiefly at
introducing the highest ethical
principles into its policies?
Moving on, should a good
government should aim chiefly at
introducing the highest ethical
principles into its policies?
12
No rule: ‘I’ ‘you’ for
sentence
start
Fitness survey - 2 of 11 questions corrected
# Auto generated After manual correction Edit Cause
1
Would you mind sharing do you
have a workout buddy?
Would you mind sharing whether
you have a workout buddy?
7 Wrong class
2
Would you mind sharing how do you
feel after a workout?
Would you mind sharing how you
feel after a workout?
3
Wrong class
PVQ values survey - 12 of 14 questions corrected
203
# Auto generated After manual correction Edit Cause
1
Would you say that thinking up new
ideas and being creative is important
to him. He likes to do things in his
own original way.
Would you say that thinking up new
ideas and being creative is important
to you. You like to do things in your
own original way?
12
No rules:
‘he’‘you’,
‘him’‘you’,
‘his’‘your’,
VERB3rd
2
Next, is it fair to say that it is
important to him to be rich. He
wants to have a lot of money and
expensive things.
Next, is it fair to say that it is
important to you to be rich. You
want to have a lot of money and
expensive things?
8
No rules:
‘he’‘you’,
‘him’‘you’,
‘his’‘your’,
VERB3rd
3
He thinks it is important that every
person in the world should be
treated equally. He believes everyone
should have equal opportunities in
life.
Do you think it is important that
every person in the world should be
treated equally. You believe
everyone should have equal
opportunities in life?
12
Wrong class
+ No rules:
‘he’‘you’,
‘him’‘you’,
‘his’‘your’,
VERB3rd
4
Do you think that it’s important to
him to show his abilities. He wants
people to admire what he does.
Do you think that it’s important to
you to show your abilities. You want
people to admire what you do?
17
No rules:
‘he’‘you’,
‘him’‘you’,
‘his’‘your’,
VERB3rd
5
Does it make sense to say that it is
important to him to live in secure
surroundings. He avoids anything
that might endanger his safety.
Does it make sense to say that it is
important to you to live in secure
surroundings. You avoid anything
that might endanger your safety?
12
No rules:
‘he’‘you’,
‘him’‘you’,
‘his’‘your’,
VERB3rd
6
Do you think it’s fair to say that he
likes surprises and is always looking
for new things to do. He thinks it is
important to do lots of different
things in life.
Do you think it’s fair to say that you
like surprises and are always looking
for new things to do. You think it is
important to do lots of different
things in life?
12
No rules:
‘he’‘you’,
‘him’‘you’,
‘his’‘your’,
VERB3rd
204
# Auto generated After manual correction Edit Cause
7
So, could you tell me he believes
that people should do what they are
told. He thinks people should follow
rules at all times, even when no-one
is watching.
So, could you tell me whether you
believe that people should do what
they are told. You think people
should follow rules at all times, even
when no-one is watching?
15
Wrong class
+ No rules:
‘he’‘you’,
‘him’‘you’,
‘his’‘your’,
VERB3rd
8
Then, could you say that it is
important to him to listen to people
who are different from him. Even
when he disagrees with them, he still
wants to understand them.
Then, could you say that it is
important to you to listen to people
who are different from you. Even
when you disagree with them, you
still want to understand them.
14
No rules:
‘he’‘you’,
‘him’‘you’,
‘his’‘your’,
VERB3rd
9
Further, can you say that it is
important to him to be humble and
modest. He tries not to draw
attention to himself.
Further, can you say that it is
important to you to be humble and
modest. You try not to draw
attention to yourself?
14
No rules:
‘he’‘you’,
‘him’‘you’,
‘his’‘your’,
VERB3rd
10
Please indicate the extent to which
having a good time is important to
him. He likes to “spoil” himself.
Please indicate the extent to which
having a good time is important to
you. You like to “spoil” yourself?
12
No rules:
‘he’‘you’,
‘him’‘you’,
‘his’‘your’,
VERB3rd
11
Continuing, do you feel that it is
important to him to make his own
decisions about what he does. He
likes to be free and not depend on
others.
Continuing, do you feel that it is
important to you to make your own
decisions about what you do. You
like to be free and not depend on
others?
17
No rules:
‘he’‘you’,
‘him’‘you’,
‘his’‘your’,
VERB3rd
12
Moving on, do you think it’s fair to
say that it’s very important to him
to help the people around him. He
wants to care for their well-being.
Moving on, do you think it’s fair to
say that it’s very important to you
to help the people around you. You
want to care for their well-being?
11
No rules:
‘he’‘you’,
‘him’‘you’,
‘his’‘your’,
VERB3rd
205
Appendix G
MANUAL REACTION CORRECTIONS IN CHAPTER 6
PVQ survey - 6 of 72 (8.3%) reactions corrected
# Auto generated After manual correction Edit Cause
Q: He likes surprises and is always looking for new things to do. He thinks 236 Wrong Q cat
it is important to do lots of different things in life.
A: Very much like me
1
Okay, I’m getting a better idea of your
answers
That sounds positive 40
A: Like me
2
Okay, I’m getting a better idea of
your answers
That sounds positive 40
A: Somewhat like me
3
Okay, I’m getting a better idea of
your answers
That sounds positive 40
A: A little like me
4
Okay, I’m getting a better idea of
your answers
That sounds positive 40
A: Not like me
5
Okay, I’m getting a better idea of
your answers
That sounds stressful 38
A: Not like me at all
6
Okay, I’m getting a better idea of
your answers
That sounds stressful 38
206
Sleep survey - 34 of 72 (47.2%) reactions corrected
# Auto generated After manual correction Edit Cause
Q: I have difficulty falling asleep 18
A: Sometimes: 1-2 times a week
1 That sounds stressful Sure 18 Wrong A cat
Q: I fall into a deep sleep 66 Wrong Q cat
A: Rarely: None or 1-3 times a month
2 Sounds good That is frustrating 17
A: Sometimes: 1-2 times a week
3 I am sorry to hear that Got it 19 Wrong A cat
A: Often: 3-5 times a week
4 I am sorry to hear that I am happy that’s the case 15
A: Almost always: 6-7 times a week
5 I am sorry to hear that I am happy that’s the case 15
Q: I have difficulty getting back to sleep once you wake up in middle of the night. 12
A: Sometimes: 1-2 times a week
6 That’s hard to hear Thanks for sharing 12 Wrong A cat
Q: I wake up easily because of noise. 22
A: Sometimes: 1-2 times a week
7 So sorry about that Thanks for letting me know 22 Wrong A cat
Q: I toss and turn. 21
A: Sometimes: 1-2 times a week
8 I am happy that’s the case Got it! Thanks for sharing 21 Wrong A cat
207
# Auto generated After manual correction Edit Cause
Q: I never go back to sleep after awakening during sleep. 15
A: Sometimes: 1-2 times a week
9 That is frustrating Got it. 15 Wrong A cat
Q: I feel refreshed after sleep 67 Wrong Q cat
A: Rarely: None or 1-3 times a month
10 That’s really great! I am sorry to hear that 17
A: Sometimes: 1-2 times a week
11 So sorry about that Got it! Thanks for sharing 20 Wrong A cat
A: Often: 3-5 times a week
12 So sorry about that Sounds good 15
A: Almost always: 6-7 times a week
13 So sorry about that Sounds good 15
Q: I feel unlikely to sleep after sleep 53 Wrong Q cat
A: Rarely: None or 1-3 times a month
14 That’s really great! That’s hard to hear 11
A: Sometimes: 1-2 times a week
15 That is frustrating Noted 18 Wrong A cat
A: Often: 3-5 times a week
16 That is frustrating That sounds positive 12
A: Almost always: 6-7 times a week
17 That is frustrating That sounds positive 12
Q: Poor sleep gives me headaches. 38
A: Sometimes: 1-2 times a week
18 That sounds stressful Okay, I’m getting a better
idea of your answers
38 Wrong A cat
Q: Poor sleep makes me irritated. 12
A: Sometimes: 1-2 times a week
19 That’s hard to hear Thanks for sharing 12 Wrong A cat
208
# Auto generated After manual correction Edit Cause
Q: My sleep hours are enough. 75 Wrong Q cat
A: Rarely: None or 1-3 times a month
20 Sounds nice That is frustrating 16
A: Sometimes: 1-2 times a week
21 I am sorry to hear that Sure 21 Wrong A cat
A: Often: 3-5 times a week
22 I am sorry to hear that Okay, that’s good 19
A: Almost always: 6-7 times a week
23 I am sorry to hear that Okay, that’s good 19
Q: Poor sleep makes me lose my appetite. 16
A: Sometimes: 1-2 times a week
24 That is frustrating Thank you for your answer 16 Wrong A cat
Q: Poor sleep makes hard for me to think. 39
A: Sometimes: 1-2 times a week
25 Thanks for sharing that Okay, I’m getting a better
idea of your answers
39 Wrong A cat
Q: I feel vigorous after sleep. 62 Wrong Q cat
A: Rarely: None or 1-3 times a month
26 Great! So sorry about that 16
A: Sometimes: 1-2 times a week
27 That sounds stressful Got it 18 Wrong A cat
A: Often: 3-5 times a week
28 That sounds stressful Sounds nice 14
A: Almost always: 6-7 times a week
29 That sounds stressful Sounds nice 14
Q: Poor sleep makes me lose interest in work or others. 20
A: Sometimes: 1-2 times a week
30 I am sorry to hear that Noted 20 Wrong A cat
209
# Auto generated After manual correction Edit Cause
Q: My fatigue is relieved after sleep. 65 Wrong Q cat
A: Rarely: None or 1-3 times a month
31 That’s good That is frustrating 13
A: Sometimes: 1-2 times a week
32 So sorry about that Got it! Thanks for sharing 20 Wrong A cat
A: Often: 3-5 times a week
33 So sorry about that Great! 16
A: Almost always: 6-7 times a week
34 So sorry about that Great! 16
Big5 survey - 20 of 50 (40.0%) reactions corrected
# Auto generated After manual correction Edit Cause
Q: I see myself as someone who is generally trusting. 84 Wrong Q cat
A: Disagree strongly
1 Thanks for letting me know I am sorry to hear that 20
A: Disagree a little
2 Thanks for letting me know I am sorry to hear that 20
A: Agree a little
3 Thanks for letting me know I am glad to hear that 22
A: Agree strongly
4 Thanks for letting me know I am glad to hear that 22
210
# Auto generated After manual correction Edit Cause
Q: I see myself as someone who has few artistic interests. 50 Wrong Q cat
A: Disagree strongly
5 Sure Sounds nice 8
A: Disagree a little
6 Sure Sounds nice 8
A: Agree a little
7 Sure So sorry about that 17
A: Agree strongly
8 Sure So sorry about that 17
Q: I see myself as someone who is outgoing, sociable 76 Wrong Q cat
A: Disagree strongly
9 Noted That sounds stressful 18
A: Disagree a little
10 Noted That sounds stressful 18
A: Agree a little
11 Noted That sounds positive 18
A: Agree strongly
12 Noted That sounds positive 18
Q: I see myself as someone who tends to find fault with others 66 Wrong Q cat
A: Disagree strongly
13 So sorry about that Great! 16
A: Disagree a little
14 So sorry about that Great! 16
A: Agree a little
15 That sounds positive Thanks for sharing that 17
A: Agree strongly
16 That sounds positive Thanks for sharing that 17
211
# Auto generated After manual correction Edit Cause
Q: I see myself as someone who does a thorough job 70 Wrong Q cat
A: Disagree strongly
17 Sure That is frustrating 17
A: Disagree a little
18 Sure That is frustrating 17
A: Agree a little
19 Sure That sounds positive 18
A: Agree strongly
20 Sure That sounds positive 18
Values survey - 38 of 64 (59.4%) reactions corrected
# Auto generated After manual correction Edit Cause
Q: A good government should aim chiefly at more aid for the poor, 51 Wrong Q cat
sick, and old.
A: highest preference
1 So sorry about that That’s really great! 17
A: second preference
2 So sorry about that That’s really great! 17
A: third preference
3 So sorry about that Thanks for sharing. 17
Q: A good government should aim chiefly at the development of 9
manufacturing and trade.
A: lowest preference
4 Good to hear that I am sorry to hear that 9 Wrong A cat
Q: A good government should aim chiefly at introducing the 15
highest ethical principles into its policies.
A: lowest preference
5 Sounds good Sorry to hear that 15 Wrong A cat
212
# Auto generated After manual correction Edit Cause
Q: A good government should aim chiefly at establishing a 16
position of power and respect among nations.
A: lowest preference
6 Sounds nice That is frustrating 16 Wrong A cat
Q: Someone who works all week would best spend the weekend 84 Wrong Q cat
keeping up on the latest in scientific advances.
A: highest preference
7 I am happy that’s the case Thanks for sharing 21
A: second preference
8 I am happy that’s the case Thanks for sharing 21
A: third preference
9 I am happy that’s the case Thanks for sharing 21
A: lowest preference
10 I am happy that’s the case Thanks for sharing 21 Wrong A cat
Q: Someone who works all week would best spend the weekend 72 Wrong Q cat
trying to win at golf or other sport.
A: highest preference
11 That’s really great! Sure 18
A: second preference
12 That’s really great! Sure 18
A: third preference
13 That’s really great! Sure 18
A: lowest preference
14 That’s really great! Sure 18 Wrong A cat
213
# Auto generated After manual correction Edit Cause
Q: Someone who works all week would best spend the weekend going to a 80 Wrong Q cat
classical music concert or art museum.
A: highest preference
15 I am glad to hear that Noted 20
A: second preference
16 I am glad to hear that Noted 20
A: third preference
17 I am glad to hear that Noted 20
A: lowest preference
18 I am glad to hear that Noted 20 Wrong A cat
Q: If I could influence the educational policies of the public schools of some 164 Wrong Q cat
city, I would try to promote the study of and participation in music and
the fine arts.
A: highest preference
19 That’s good
Okay, I’m getting a better idea
of your answers
41
A: second preference
20 That’s good
Okay, I’m getting a better idea
of your answers
41
A: third preference
21 That’s good
Okay, I’m getting a better idea
of your answers
41
A: lowest preference
22 That’s good
Okay, I’m getting a better idea
of your answers
41 Wrong A cat
214
# Auto generated After manual correction Edit Cause
Q: If I could influence the educational policies of the public schools of some 84 Wrong Q cat
city, I would try to encourage the study of social problems.
A: highest preference
23 I am happy that’s the case Thanks for letting me know 21
A: second preference
24 I am happy that’s the case Thanks for letting me know 21
A: third preference
25 I am happy that’s the case Thanks for letting me know 21
A: lowest preference
26 I am happy that’s the case Thanks for letting me know 21 Wrong A cat
Q: If I could influence the educational policies of the public schools of some 80 Wrong Q cat
city, I would try to provide additional laboratory facilities.
A: highest preference
27 Sounds nice Thank you for your answer 20
A: second preference
28 Sounds nice Thank you for your answer 20
A: third preference
29 Sounds nice Thank you for your answer 20
A: lowest preference
30 Sounds nice Thank you for your answer 20 Wrong A cat
Q: If I could influence the educational policies of the public schools of some 68 Wrong Q cat
city, I would try to increase the practical value of courses.
A: highest preference
31 Good to hear that Got it! Thanks for sharing 17
A: second preference
32 Good to hear that Got it! Thanks for sharing 17
A: third preference
33 Good to hear that Got it! Thanks for sharing 17
A: lowest preference
34 Good to hear that Got it! Thanks for sharing 17 Wrong A cat
215
# Auto generated After manual correction Edit Cause
Q: I would prefer a friend who is practical, efficient, and hard working. 20 Wrong Q cat
A: highest preference
35 Great! Got it 5
A: second preference
36 Great! Got it 5
A: third preference
37 Great! Got it 5
A: lowest preference
38 Great! Got it 5 Wrong A cat
Fitness survey - 21 of 40 (52.5%) reactions corrected
# Auto generated After manual correction Edit Cause
Q: How often do you workout? 54 Wrong Q cat
A: Not at all
1 Great! That’s hard to hear 17
A: 2-3 times a week
2 Got it! Thanks for sharing That’s good 20 Wrong A cat
A: Every day
3 Thanks for sharing that That’s good 17
Q: How healthy is the food you eat? 47
A: Not sure: I eat whatever is in front of me.
4 Noted I am sorry to hear that 20 Wrong A cat
A: Okay: I count calories but I’m not too strict.
5 That sounds stressful Noted 18 Wrong A cat
A: Excellent: I’ve adopted the perfect plate and feel great.
6 Noted Sounds nice 9 Wrong A cat
216
# Auto generated After manual correction Edit Cause
Q: Do you know what the perfect plate is? 65 Wrong Q cat
A: Yes: it balances food intake for I never use it.
7 Thanks for letting me know Got it 23 Wrong A cat
A: Yes: 1/2 vegetables, 1/4 protein & 1/4 carbohydrates. I use it when
I remember.
8 Thanks for letting me know Good to hear that 21
A: Yes: It sums up my approach to every meal.
9 Thanks for letting me know Good to hear that 21
Q: How hard do you work out? 71 Wrong Q cat
A: Not very hard, it depends on the mood I’m in.
10 Good to hear that So sorry about that 12
A: I start hard but usually tail off part of the way through.
11 That’s hard to hear Sure 17 Wrong A cat
A: I make sure to feel the burn by the end.
12 Thank you for your answer Okay, that’s good 21 Wrong A cat
A: Hard enough to ensure I have given it my all. By the end I can
barely stand.
13 Thank you for your answer Okay, that’s good 21 Wrong A cat
Q: Do you have a workout buddy? 21
A: Yes but I only see them once every month or two.
14 I am sorry to hear that Got it! Thanks for sharing 21 Wrong A cat
Q: How do you feel after a workout? 39
A: Relieved that I made it through the session.
15 Thanks for letting me know That sounds stressful 21 Wrong A cat
A: Shattered like I just pushed myself to the limit and maybe passed it.
16 Thanks for sharing that Got it 19 Wrong A cat
217
# Auto generated After manual correction Edit Cause
Q: Do you have an exercise plan? 63
A: Yes but I don’t always stick to it.
17 Great! Thank you for your answer 23 Wrong A cat
A: Yes, it’s designed to work all muscle groups with alternating exercises.
18 Okay, I’m getting a better idea of That sounds positive 40 Wrong A cat
your answers
Q: What would you say is your current level of fitness? 59 Wrong Q cat
A: Pretty poor; I get out of breath just walking up the stairs.
19 Sure So sorry about that 17
A: Good; I exercise regularly and watch what I eat.
20 Sure I am glad to hear that 21
A: Excellent; I’m always at the gym and avoid all foods that are bad.
21 Sure I am glad to hear that 21
Finance survey - 28 of 84 (33.3%) reactions corrected
# Auto generated After manual correction Edit Cause
Q: As it relates to matters of personal finance, what topics do you feel you 224 Wrong Q cat
could use more information on?
A: Budgeting
1 I am sorry to hear that Thanks for letting me know 20 Wrong A cat
A: Credit
2 I am sorry to hear that Thanks for letting me know 20 Wrong A cat
A: Wills
3 I am sorry to hear that Thanks for letting me know 20 Wrong A cat
A: Life Insurance
4 I am sorry to hear that Thanks for letting me know 20 Wrong A cat
218
A: Disability Insurance
5 I am sorry to hear that Thanks for letting me know 20 Wrong A cat
A: Health Insurance
6 I am sorry to hear that Thanks for letting me know 20 Wrong A cat
A: Long Term Care Insurance
7 I am happy that’s the case Thanks for letting me know 21 Wrong A cat
A: Loans/Debt
8 I am sorry to hear that Thanks for letting me know 20 Wrong A cat
A: Saving
9 I am happy that’s the case Thanks for letting me know 21 Wrong A cat
A: Investing
10 I am happy that’s the case Thanks for letting me know 21 Wrong A cat
A: Other (please specify)
11 I am happy that’s the case Thanks for letting me know 21 Wrong A cat
Q: I feel in control of my current financial situation. 56 Wrong Q cat
A: Not at all true
12 Sounds good That is frustrating 17
A: Somewhat untrue
13 Sounds good That is frustrating 17
A: Somewhat true
14 That’s hard to hear That’s good 11
A: Very true
15 That’s hard to hear That’s good 11
Q: I feel capable of handling my financial future overall. 70 Wrong Q cat
A: Not at all true
16 I am glad to hear that That is frustrating 17
A: Somewhat untrue
17 I am glad to hear that That is frustrating 17
A: Somewhat true
18 That sounds stressful Okay, that’s good 18
A: Very true
19 That sounds stressful Okay, that’s good 18
219
# Auto generated After manual correction Edit Cause
Q: I have the following types of insurance. 99 Wrong Q cat
A: Life
20 That is frustrating Thank you for your answer 16 Wrong A cat
A: Health
21 That is frustrating Thank you for your answer 16 Wrong A cat
A: Auto
22 That is frustrating Thank you for your answer 16 Wrong A cat
A: Homeowner’s/Renter’s
23 That is frustrating Thank you for your answer 16 Wrong A cat
A: Disability
24 That is frustrating Thank you for your answer 16 Wrong A cat
A: Long Term Care
25 I am glad to hear that Thank you for your answer 19 Wrong A cat
Q: How necessary or important do you feel it is for you to work with a 44 Wrong Q cat
financial advisor?
A: Not important
26 That sounds stressful Got it 18
A: Somewhat unimportant
27 That sounds stressful Got it 18
A: Very important
28 Sounds nice Got it 8