The Complexity of Collecting Digital and Social Media Data in Ephemeral Contexts Shawn Walker A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy University of Washington 2017 Reading Committee: Emma Spiro, Chair W. Lance Bennett Nicholas Weber Program Authorized to Offer Degree: Information School © Copyright 2017 Shawn Walker University of Washington Abstract The Complexity of Collecting Digital and Social Media Data in Ephemeral Contexts Shawn Walker Chair of the Supervisory Committee: Assistant Professor, Emma Spiro Information School Just as social media has permeated communication in our public and private lives, it has also become a widely used source of data and object of study in academic and commercial research. Despite widespread use, relatively little is known about how social media datasets change when observed at different points over time or how collection methods may impact the data at the core of our research projects. For example: Will results differ if social media data are collected in real-time, a few minutes after production, hours, days, or weeks later? What happens to the metadata, links to web pages, photos, and videos embedded in this content over time? If data collection methods do not preserve and archive social media posts, metadata, and linked content; are researchers venturing into a new dataset each time they engage with it? In this dissertation, a combination of quantitative and qualitative approaches are used to examine how social media datasets change over time and how change impacts the reliability and authenticity of this data. Three Twitter-based case studies, each exhibiting prototypical elements social scientists encounter in their research are used to demonstrate the impact of research design and data collection choices. This work advances the field of information science by empirically investigating how the ephemeral nature of social media data, metadata, and linked content have significant and lasting effects on the reliability and authenticity of datasets used in research. By situating research design decisions of how and when to observe data within the frameworks of process theory, infrastructure studies, and archival theory, this work brings the importance of methodological considerations to the forefront of studies of digital and social media. Empirical observations inform a set of implications for social media research, offering researchers practical considerations to inform their research designs. i TABLE OF CONTENTS Chapter 1: Introduction ................................................................................................................... 1 1.1 Methodological Issues Surrounding Social Media Data Collection ............................... 1 1.2 The Process of Collecting Social Media Data ................................................................ 6 1.3 Gaps in the Social Media Literature ............................................................................... 9 1.4 Research Questions ....................................................................................................... 10 1.5 Chapter Summaries ....................................................................................................... 11 Chapter 2. Social Media as a Data Source .................................................................................... 15 2.1 Social Media Sites as Infrastructure and Platforms ...................................................... 15 2.2 Collecting Social Media Data ....................................................................................... 22 2.3 The Dimensions of Latency and Level of Automation ................................................. 28 2.3.1 Latency of Data Collection (Temporal) .................................................................... 30 2.3.2 Level of Automation (Method) ................................................................................. 30 2.4 Process Theory .............................................................................................................. 33 2.4.1 The Social Science Research Process ....................................................................... 34 2.4.2 Social Media Data Collection as Process .................................................................. 37 2.5 Chapter Summary ......................................................................................................... 39 Chapter 3. Social Media as a Record ............................................................................................ 41 3.1 Archival Theory and Preservation ................................................................................ 43 3.2 Applicable Concepts From Archival Theory ................................................................ 45 3.3 Social Media as a Record .............................................................................................. 47 ii 3.4 Chapter Summary ......................................................................................................... 48 Chapter 4. Ephemerality ............................................................................................................... 50 4.1 Conceptualizing Ephemerality ...................................................................................... 50 4.2 Ephemerality and Social Media Data ........................................................................... 56 4.3 Research Design & Methods ........................................................................................ 59 4.4 Data Collection ............................................................................................................. 60 4.4.1 Real-Time Data Collection ....................................................................................... 63 4.4.2 Nightly Availability .................................................................................................. 65 4.4.3 Semi-Real-Time Collection ...................................................................................... 65 4.4.4 Summary of Data Collection .................................................................................... 66 4.5 Case Study and Data Description ................................................................................. 67 4.5.1 Occupy Wall Street - Topic Based Dataset ............................................................... 68 4.5.2 West Coast Departments of Transportation - Account Based .................................. 71 4.5.3 RuPaul's Drag Race - Mixed Account/Topic Based Dataset .................................... 72 4.6 Summary of Case Study Data Collection and Analysis ................................................ 74 4.6.1 Analysis of Occupy Wall Street Case Study ............................................................. 75 4.6.2 Analysis of DoT and Drag Race Case Studies .......................................................... 76 4.6.3 Case Descriptive Statistics ........................................................................................ 76 4.7 Chapter Summary ......................................................................................................... 82 Chapter 5. Reliability .................................................................................................................... 84 5.1 Operationalizing Reliability .......................................................................................... 84 5.2 Reliability Analysis of Each Case Study ...................................................................... 85 iii 5.3 Mechanisms of Inaccessibility ...................................................................................... 88 5.4 Chapter Summary ......................................................................................................... 92 Chapter 6. Authenticity ................................................................................................................. 94 6.1 Operationalizing Authenticity ....................................................................................... 95 6.2 Authenticity Analysis of Tweet and User Metadata ..................................................... 97 6.3 Tweet Linked Data ...................................................................................................... 103 6.4 Chapter Summary ....................................................................................................... 106 Chapter 7. Impacts of Ephemerality ........................................................................................... 108 7.1 Reliability - The Relationship Between Time and Ephemerality ............................... 109 7.2 Authenticity: The impact of the prototypical features ................................................ 110 7.3 Limitations .................................................................................................................. 115 7.4 Contributions .............................................................................................................. 116 7.5 Future Work ................................................................................................................ 117 7.6 Conclusion .................................................................................................................. 118 Works Cited ................................................................................................................................ 120 Appendix A: Implications of this Research for Social Media Research .................................... 128 Appendix B: Case Study Query Terms ....................................................................................... 133 iv LIST OF FIGURES Figure 2.1: Layers impacting the social media data collection process ............................ 17 Figure 2.2: Examples of quantification offered in the US National Park Service public profile on Facebook (left) and Instagram (right) from May 2017. ............................................ 17 Figure 2.3: Example of the display of activity metrics and affordances in the Twitter interface. ................................................................................................................................... 19 Figure 2.4: Tweet (top), Twitter API request, and API output of tweet by @TwitterAPI.25 Figure 2.5: Spectrum of Social Media Data Collection Methods by Latency. ................. 30 Figure 2.6: The Research Process from The Practice of Social Research (Babbie, 2007, p. 108). ................................................................................................................................... 35 Figure 3.1: Diagram of the social media as a record framework. ..................................... 47 Figure 4.1: Summary of the data collection process and timeline for Occupy Wall Street case study. ......................................................................................................................... 62 Figure 4.2: Summary of the data collection process and timeline for Departments of Transportation and RuPaul’s Drag Race case studies. .............................................. 63 Figure 4.3: Daily tweet volume collected for each case study during real-time collection. Timestamps are in UTC. ........................................................................................... 78 Figure 4.4: Visualization of overlap between data collected real-time (Twitter Streaming API) and semi-real-time (Twitter REST API). .................................................................. 81 Figure 5.1: Tweets inaccessible per time period during the 90-day observation period for the Departments of Transportation and RuPaul’s Drag Race case studies. .................... 87 Figure 5.2: Illustration of how a tweet inherits the accessibility properties of the tweets it is related to. In example shown, a retweet is deleted because the account that produced the original retweet was deleted. ..................................................................................... 89 Figure 6.1: Tweet from the US National Park Service as display on the Twitter website in May 2017 with metadata fields labeled. ............................................................................ 95 v Figure 6.2: Distribution of mean edit distance of change to user profile metadata - user description, user name, and user location. Only users with an edit distance > 0 are displayed. .................................................................................................................. 99 Figure 6.3: Distribution of mean change in user metrics per user. ................................. 101 Figure 6.4: Distribution of mean simhash distance between the content of weekly archives of URLs within tweets selected for archiving. Content in all URLs for each tweet was grouped into one unit. Tweet URLs archived for less than two weeks excluded. ................ 105 Figure 7.1: Simple regression of tweet accessibility at time points t0 - t90 for the Departments of Transportation and RuPaul’s Drag Race case studies. ............................................ 109 vi LIST OF TABLES Table 2.1: Description of the Twitter API Ecosystem ...................................................... 27 Table 2.2: Data Collection Approaches by Time and Method ......................................... 31 Table 4.1: Case Collection and Analysis Summary .......................................................... 75 Table 4.2: Summary of Case Study Descriptive Statistics ............................................... 78 Table 4.3: Proportion of tweets with entities: hashtags, URLs, and mentions in each case study. ................................................................................................................................... 79 Table 4.4: Summary of Case Study Descriptive Statistics - User Statistics ..................... 80 Table 4.5: Visualization of overlap between data collected real-time (Twitter Streaming API) and semi-real-time (Twitter REST API). ......................................................................... 81 Table 5.1: Proportion Tweets Accessible After Time Periods Under Investigation. ........ 87 Table 5.2: Reason for tweet inaccessibility - Departments of Transportation and RuPaul’s Drag Race ........................................................................................................................... 91 Table 5.3: Tweet inaccessibility categorized by changes to user account vs. a retweeted account. ................................................................................................................................... 92 Table 6.1: Number and proportion of users changing profile metadata ........................... 98 Table 6.2: Extent to which users changed profile metadata as measured by mean edit distance between changes. ...................................................................................................... 99 Table 6.3: Mean change in user-level metrics ................................................................ 102 Table 6.4: Top 10 URLs by volume. .............................................................................. 103 Table 6.5: Descriptive statistics for archived URLs. ...................................................... 103 vii ACKNOWLEDGEMENTS Like all dissertations, this process has been a long journey for me with a lot of support from friends, family, and colleagues along the way. For everyone who supported me: THANK YOU!, I made it! I would like to express my gratitude to: • To my advisor, Dr. Emma Spiro, I am grateful for your deep understanding, unwavering support without judgement, and tirelessly helping me defend and cross this finishing line. • To Dr. Nicholas Weber for stepping in to support and help me with the final edits this summer. • To Dr. W. Lance Bennett for always helping me with the big picture, providing support, and pushing my work out of my comfort zone. • To Dr. Karine Nahon for being my first advisor, shaping so much of my work and who I am as a scholar, and providing support from afar. • To Dr. Robert Mason for showing me how to put students first, how to write my first grant, helping a group of grad students start a lab around a crazy idea, and always championing our work. • To my SoMe Lab partners in crime – Joe Eckert and Jeff Hemsley – where a large portion of this dissertation began to develop. viii • To my friends in colleagues in the sunshine cohort – you provided such an inspiring and supportive environment, I feel so lucky to have gone through this process with all of you. Amazingly, 8 of us started the program and 8 of us competed the program! • To my Sunday library writing group and partners in crime. • My PhD buddies – Liz Mills, Norah Abokhodair, and Jordan Eschler. • The support from the Social Media Collective at MSRNE and the Oxford Internet Institute’s Summer Doctoral Program. The support system and lifelong connections I formed have already taken me far. • To Dr. Sheetal Agarwal for being a source of support, inspiration, patience, understanding, and an all-around amazing person. Watching you finish your PhD such an amazing gift. Thanks for being my source of support. • To Dr. Kristen Shinohara and Dr. John Mario for all of the blood, sweat, and tears. Thanks for going through the job process with me and watching my job talk 150 times to the point where you could present it better than I! • To my adopted family who went on this journey with me – Elyse and Kevin Lewis, Jean Donohue, and Fred Johnson – I don’t know what I’d do without you all. • To Linda Dolive for making space for me and “adopting” me into your family, being a mom to me, and being such a champion. • To Darwin for all of the support, wags, and unconditional love. As promised, you stuck with me through the PhD. I’ll miss you on the next stage of this journey buddy. ix DEDICATION To my mom, even though she didn’t get to see this, I know she’s proud. 1 CHAPTER 1: INTRODUCTION For the first time, we can follow [the] imaginations, opinions, ideas, and feelings of hundreds of millions of people. We can see the images and the videos they create and comment on, monitor the conversations they are engaged in, read their blog posts and tweets, navigate their maps, listen to their track lists, and follow their trajectories in physical space (Manovich, 2013, p. 461). “Big” and “social” data bring a substantial increase in the scale and types of data that academic researchers and practitioners can access. This digital trace data, in the form of social media posts from Facebook or Twitter for example, allows automated observation and collection of the online activities of millions of users by simply writing a short program to collect data (Freelon, 2014). Researchers use social media data is to make claims of study human activity (Zimmer & Proferes, 2014). Businesses and governments (Parmelee & Bichard, 2013) use social media data to make decisions about which potholes to repair or customers to serve. In this new research environment, how does one conduct empirical research with methodological rigor? In this dissertation I focus on one aspect underlying this question: how social media data sets change over time. 1.1 METHODOLOGICAL ISSUES SURROUNDING SOCIAL MEDIA DATA COLLECTION Motivated by direct prior experience with the challenges of collecting and analyzing social media data, videos, and web links in, this dissertation directly addresses this question. The first project examined YouTube videos and blogs during the 2008 US Presidential Election to understand the role these platforms play in propagating viral political videos (Nahon, Hemsley, Walker, & Hussain, 2011). The second project focused 2 on the use of social media by the Occupy Wall Street movement, where 31,000 seed URLs embedded in tweets related to the Occupy Wall Street movement were coded based on the type of resource (e.g. mainstream media site, celebrity site, government site, etc.) each referred to to (Agarwal, Bennett, Johnson, & Walker, 2014; Bennett, Segerberg, & Walker, 2014). The third study used tweets and links embedded in tweets to examine rumor propagation after the Boston Marathon Bombings (Starbird, Maddock, Orand, Achterman, & Mason, 2014). During each of these projects, the research team encountered an ephemeral and unstable social media dataset leading to a host of non- trivial methodological issues that needed to be addressed in order to meet projects aims and answer core scientific questions. While social media data allows researchers to “study social and cultural processes and dynamics in new ways” on an unprecedented scale (Manovich, 2013, p. 461), our tendency as researchers making use of such data is often to focus on the phenomena under examination; less attention is paid to understanding the dynamic nature of these data themselves. Social media datasets present a number of challenges for researchers including, but not limited to: (1) the need for new and/or adaptation of existing methods for data collection and analysis, (2) issues of representation and sampling (Boyd & Crawford, 2012; Liang & Fu, 2015), and (3) ethical implications and risks to those whose social media data is being collected and analyzed (Light & McGrath, 2010; Zimmer, 2010; Zimmer & Proferes, 2014). Underlying these important issues are fundamental questions about the nature of social media data itself. Relatively little is known about how social media datasets change when observed at different points over 3 time or how choices of collection method may impact the data at the core of our research projects, and subsequent research findings. For example: Will results measuring the prevalence of rumors over time differ if social media data are collected as it is produced in real-time, a few minutes after production, hours, days, or weeks later? What happens to the metadata — links to web pages, photos, and videos — embedded in and documenting this content over time? If data collection methods do not preserve and archive social media posts, metadata, and linked content; are researchers venturing into a different dataset each time they engage with it? The findings in this dissertation show that latency, a delay in the collection of data, changes the resulting social media dataset, its metadata, and linked data. Social media posts are deleted or become inaccessible, users change their profiles, and embedded link change. To illustrate, consider the case of a researcher using the #YesAllWomen campaign to study misogyny online. The #YesAllWomen hashtag and social media campaign was used to share stories of misogyny and violence against women following the 2014 Isla Vista killings (Valenti, 2014). Elliot Rodger, a twenty-two-year-old man, went on a shooting spree on Isla Vista, near the University of California Santa Barbara, killing six people before committing suicide. In the weeks leading up to the killings, Rodger posted a series of YouTube videos and a 137 page autobiographical “manifesto,” declaring his hatred of all women for the rejection and disdain he claims they dealt him throughout his life. Responses to the campaign, ranged from support and personal stories to hateful and sexist comments. Since hate speech is against the abusive behavior policies of social 4 media sites like Twitter,1 Facebook,2 and Instagram, many of these posts were deleted and accounts suspended by the platforms. This example demonstrates a number of challenges for researchers: • What search terms should the researcher use to collect content related to #YesAllWomen? While the hashtag #YesAllWomen may seem like an obvious choice, would a single hashtag will not contain all of the content relevant to the research project. Some users may have tweeted content related to #YesAllWomen without a hashtag, used a related hashtag, or posted content on other social media platforms outside of Twitter. The researcher must decide how to bound the project using a set of query terms and social media platforms based upon their researcher questions and understanding of the context of the campaign. This would include the observation what keywords users include in their posts and what platforms they are utilizing. • Will all content collected with the query terms be relevant to the research at hand? Or will some posts contain the query terms, but be irrelevant to the project? • What processes should be used to filter out irrelevant posts and how should these processes be documented? Since the campaign was not planned in advance, data collection cannot be setup prospectively to capture the campaign from its inception. As a result, at least a portion, if 1 https://support.twitter.com/articles/18311 2 https://www.facebook.com/help/216782648341460 5 not all, of the social media data related to the campaign would need to be collected retrospectively. The delay in data collection introduces additional questions: • During the delay from the time a researcher started collecting and when tweets and images were posted, content, accounts, images, and links may change or be deleted. • How will this affect the dataset and the findings from the dataset? • Will retrospectively collected data represent the actual discussions and posts within the #YesAllWomen campaign or will missing posts provide a false account of the campaign? Despite these challenges, there has been an explosion of research using social media data to study human behavior in almost every domain of social science. Examples range from death and memorialization (Acker & Brubaker, 2014), social movements (Agarwal et al., 2014; Bastos, Mercea, & Charpentier, 2015; Bennett et al., 2014), disasters (Starbird & Palen, 2010; 2012), epidemiology (Malik, Gumel, Thompson, Strome, & Mahmud, 2011), to many others. This body of literature is growing at a staggering rate (Williams, Terras, & Warwick, 2013; Zimmer & Proferes, 2014), but accompanying methodological contributions describing and examining the process of conducting research with social media data (SMD) is very thin. Existing methodological literature is typically tool or technologically driven, not a result of empirical examination of the data collection process (Felt, 2016; Miller, Ginnis, Stobart, Krasodomski-Jones, & Clemence, 2015), leaving researchers without an understanding of how to approach or evaluate the social media data collection process. As a result, researchers, practitioners, and students 6 are left to continually re-invent the wheel, learning through a process of trial and error (Brooks, 2015). Complementary to the social media methodology literature, a body of literature coined “critical data studies” by Dalton and Thatcher (2014) has emerged. Critical data studies focuses on concerns related to the use, analysis, and ethics of big and social media data (Boyd & Crawford, 2012). boyd and Crawford (2012) provide the most cited critique of big data with six provocations ranging from questions about the (lack of) contextualization of big data to how big data changes our3 definitions of knowledge. Often, however, such literature provides cautions and criticism without tangible solutions to issues raised. 1.2 THE PROCESS OF COLLECTING SOCIAL MEDIA DATA The process of collecting social media data, while seemingly simple on the surface, requires numerous competencies (Brooks, 2015; Driscoll & Walker, 2014; Felt, 2016), both technical and research design related. The process is made more complex as it involves a mixture of theory, data, and computational processes (see Goble 2008 for a bioinformatics perspective) filled with many “black-boxes” (Driscoll & Walker, 2014; Goble et al., 2008, p. 510). An algorithmic system underlies the multitude of interfaces users and programs use to consume and interact with information from social media platforms. These algorithmic systems (Ananny & Crawford, 2017) are an assemblage of “institutionally situated code, practices, and norms with the power to create, sustain, and 3 In this document when referencing ‘our’ or ‘we’ I am referring to the community of researchers using social media data. 7 signify relationships among people and data through minimally observable, semiautonomous action” (Ananny, 2015, p. 93). To users and researchers outside of the platform, these algorithmic systems and databases seem like black boxes taking input from a user’s action and outputting posts without giving any details of how data is processed or changed. The lack of transparency only adds complexity to the research process since the impact of forces assembling and acting on data are unknown to us. The process often begins with the research design of a project, linking a set of research questions with the appropriate social media data. After matching a research design and data source, a data collection plan must be developed and executed. It is important to use data collection methods that preserve elements of each social media post and their accompanying metadata that are essential to answering the researchers’ research question. Developing and executing this plan requires a deep understanding of the phenomena and social media site(s) under examination: that is to say the recipes used to “cook” (the collecting, cleaning, processing, and analysis of the data) (Bowker, 2013), and the technical skills to carry out the “cooking”. The amount of data collected ranges from a few hundred posts to larger datasets consisting of thousands or millions of information artifacts. Methods of collecting social media data range from manually copying and pasting content from social media web sites to large-scale automated data collection via complex scripts. Large or small, manual or programmatic, the processes of research design and social media data collection require a set of empirically informed principles to guide researchers through many choices that must be made throughout the 8 data collection process. Literature focusing on this process is lacking and this dissertation contributes to this area. Social media datasets continue to present challenges for researchers after data collection. These datasets also push at the boundaries of traditional research methods (Hargittai & Sandvig, 2015; Karpf, 2012). Oftentimes researchers attempt to apply existing, more traditional methods in this space, but this approach may be problematic if researchers do so blindly without first adapting methods to the unique properties of social media datasets and platforms. For example, consider the application of stratified random sampling techniques traditionally used in survey techniques (de Leeuw, Hox, & Dillman, 2012; Lynch, 2008); how can this method be applied to social media data without a proper sampling frame or observed characteristics on which to determine which strata each account or individual falls? How do we account for representation (Miller et al., 2015), political power (Nahon, 2015), algorithmic and platform bias (Gillespie, 2010), presentation of self (Goffman, 1990), and the context within which these posts are generated (Seaver, 2015)? These are important questions to address in order to assess the validity of research using social media data. A precursor to this In the rush to collect data, the implications of observing dynamic content at a particular (arbitrary) point in time and issues of preservation of social media posts and their accompanying linked metadata aren’t generally considered and rarely discussed in research publications. At its core, social media data is ephemeral – a term often used but rarely defined in research (see Bernstein, Monroy-Hernández, Harry, & André, 2011 for an example of use without a definition). When researchers use the term ephemeral in 9 the context of social media data it is often shorthand for instability; data is constantly changing, being updated, or deleted. As a result, it is difficult for two researchers to collect the same exact dataset in real-time and practically impossible for them to collect the same dataset retrospectively via a purchase of data from a reseller or by scraping (Burgess & Bruns, 2014). Further, researchers are often forbidden from sharing full datasets by the Terms of Service (ToS) of many platforms. While some platforms, such as Twitter, allow for sharing of each post’s unique identification number; this still requires researchers to “rehydrate” or go back to the platform to recollect the most current post content and metadata, if available. The difficulty of collecting and/or sharing datasets makes it impossible to validate or replicate studies using social media data (Felt, 2016). 1.3 GAPS IN THE SOCIAL MEDIA LITERATURE Few studies have focused on the dynamic nature of social media data itself; those that have primarily looked at specific tools or software interfaces for data collection (Driscoll & Walker, 2014; Felt, 2016; e.g. Gaffney & Puschmann, 2014; González-Bailón, Wang, Rivero, Borge-Holthoefer, & Moreno, 2014), and have not considered the impact of the ephemeral nature of social media data on the data collection process and resulting dataset. In this dissertation, I examine the impact of the ephemeral nature of social media data on research datasets — how posts, their accompanying metadata documenting the post, and linked content such as videos, images, and web pages change over time. Our research community needs ways of acknowledging, understanding, stabilizing, and combating/addressing the ephemerality of social media datasets. 10 Without ways to measure the impact ephemerality has on datasets, researchers are unable to determine the subsequent impact on research designs and findings. If ephemerality does have an impact, how can researchers quantify and counteract or at least address it or understand the limitations of results? The first step in addressing these questions is to gather empirical data to measure the level of ephemerality over time within multiple social media based case studies at the post, metadata, and linked content levels. 1.4 RESEARCH QUESTIONS In this dissertation I use the lenses of process theory (Crowston, 2000), infrastructure studies, and archival theory surrounding electronic records (Duranti, 1995; 1997) to examine how the ephemeral nature of social media impacts collected data, situating social media data collection within the social science research process. Quantitative approaches are applied to examine how social media datasets change over time through an examination of social media posts surrounding the three Twitter-based case studies. The first case, Occupy Wall Street movement, was a world-wide social movement observed over a 3-year timeframe. The second case, Departments of Transportation on the West Coast (Washington, Oregon, and California), represents everyday political interactions with official government accounts. The third case, the reality TV show RuPaul’s Drag Race, represents a entertainment context with a high level of image and video content due to the show’s visual nature. Each case study was chosen because it represents prototypical features of the types of data collection scenarios researchers 11 experience when collecting social media data. Examples of these dimensions include time-scale (short to long), population bounding (tight to lose), level of political contention (highly contentious to the everyday political context), and inclusion of links to media such as images and videos (high to low level of media and linking). The aim of this research is to contribute an empirically informed framework for the study of social media data. As such the research questions to be addressed in this dissertation are as follows: RQ: How does the ephemeral nature of social media data affect social media data (SMD) sets? • RQA: How does the ephemerality of SMD interact with the process of data collection to impact the reliability of social media data sets? • RQB: How does the ephemerality of SMD interact with the process of data collection to impact the authenticity of social media data sets? 1.5 CHAPTER SUMMARIES The rest of this dissertation is organized as follows: Chapter 2 provides a description of social media as a source of research data — the social media platforms and social media data collection, and the social science research process through the theoretical lenses of process theory, platform studies, and infrastructure studies. The lens of process theory situates the collection of social media data as subprocess within the process of conducting social science research. Through the lens of process theory, research and data collection are seen as a series of linked steps, 12 allowing for the testing and comparison of alternative choices each step — for example time between an event and collection of social media data collection or the choice collection method. This process is impacted by the affordances, or features, of each social media platform and the data collection infrastructure these platforms provide. Chapter 3 introduces a framework to discuss and approach the design of systems to collect social media data. The framework, Social Media as Record, brings relevant concepts from archival and electronic records theory to social media and its preservation. When viewing social media posts as records within a collection, posts are connected via an archival bond and not just seen as individual posts. Posts and embedded content are also bound by time, taking into account not just the text of the social media post itself, but the metadata documenting each post and the linked content within each post as well as the related social media posts and the context within which the posts were created. Chapter 4 introduces the overarching concept of ephemerality drawn from media studies, archival theory and practice, web archiving, and data curation. Three Twitter- based case studies and the methods of data collection, each exhibiting prototypical elements social scientists encounter in their research, that will be used to quantify the impact of ephemerality of social media data are described in detail. The three cases studies are: 1) the Occupy Wall Street movement, 2) Departments of Transportation on the West Coast of the US, and 3) the reality TV show RuPaul’s Drag Race. Tweets, metadata, and archives of web links embedded in tweets were collected for each cast study (see Appendix B) in real-time for a period of two weeks. Descriptive statistics for each case study are also provided in this chapter. 13 Chapter 5 develops the concept of a reliable social media dataset. Drawn from the concept of statistical repeated measures, a reliable social media dataset is one in which the corpus of social media data collected by a researcher is impervious to change — collecting a dataset with the same parameters at different points in time should yield the same dataset. I operationalize the concept of reliability as the number of tweets still accessible at any point in time compared with tweets collected in real-time. Tweets that were inaccessible at the end of the observation period were examined to determine the cause of inaccessibility. Approximately 7 - 12% of tweets were no longer accessible at the end of the observation period with over 90% of tweets inaccessible due to either the deletion of the tweet itself or deletion/protection of the user’s account. Less than 10% of tweets were inaccessible due to the deletion of a related retweet or account. Over 40% of tweet inaccessibility occurred within the first 48 hours. Chapter 6 develops the concept of authenticity, pertaining to the stability metadata and linked data surrounding a social media post. I measure authenticity by comparing nightly changes in tweet metadata such as the user description, account username, number of followers, and number of retweets as well as the change in the content of hyperlinks embedded in the tweet text. In the Departments of Transportation and RuPaul’s Drag Race case studies, over 50% of users changed profile information and there was pervasive linking to other tweets and social media platforms. Chapter 7 summaries the main findings of this dissertation as well as the differences and similarities of each case study. Conclusions, limitations, and future work are also discussed. 14 Appendix A presents a general set of implications for social media researchers based upon the framework and findings of this dissertation for researchers and practitioners looking to collect and analyze social media data as part of a research project. Appendix B lists the keywords and accounts used as query terms for data collection in each case study. 15 Chapter 2. SOCIAL MEDIA AS A DATA SOURCE This chapter focuses on social media as a data source for academic, industry, and practitioner research. Social media platforms, social media data collection, and the social science research process are examined through the theoretical lenses of process theory, platform studies, and infrastructure studies. The lens of process theory is used to situate the collection of social media data as subprocess within the process of conducting social science research. Through the lens of process theory, research and data collection are a series of linked steps, allowing for the testing and comparison of different choices the same step — for example the time between an event and collection of social media data collection or choice collection method. This process of data collection and is impacted by the affordances of each social media platform and the data collection infrastructure these platforms provide. 2.1 SOCIAL MEDIA SITES AS INFRASTRUCTURE AND PLATFORMS Within the contexts of social media data collection research questions and the research process intersect with the infrastructure and affordances offered by social media platforms. Affordances are the features a platform offers to users. Facebook offers users the ability to “like” posts. Other affordances are not offered — Facebook does not offer a “dislike” button — placing constraints on the activities of a user. As a result, the 16 affordances of a platform create a set of activities and interactions a user can and cannot perform within the platform.4 I argue, as illustrated in figure 2.1 below, research involving social media datasets is impacted by each of these layers. Some layers, such as the databases and algorithms within each platform that process and translate the activities of users into data structures and interfaces we see when accessing the platform, are hidden from public view. These algorithms and data structures shape the possibilities within the system (Ananny & Crawford, 2017; Gillespie, 2010; Vis, 2013) and, through their function, constrain what information researchers can easily consume and process. For example, each platform quantifies certain actions and offers these as part of the interface users see (Grosser, 2014). Figure 2.2 show examples of this quantification for two social media platforms, Facebook and Instagram, for the US National Park Service. The National Park Service’s Instagram profile shows that the account has (1) 204 posts, (2) 554k followers, and (3) follows 430 other Instragram as well as the most recent posts from this account. Their Facebook profile categories the National Park Service as a (1) ‘Government Organization’ with 4.6 star rating as well as quantifying the number of Facebook users who have (2) liked the National Park Service, (3) follow their posts, (4) visited the page, as well as the (5) number of the my friends who have liked the National Park Service page. 4 For a more lengthy discussion of the affordances of social media platforms, see Taina Bucher & Anne Helmond’s article (2017). 17 Figure 2.1: Layers impacting the social media data collection process Figure 2.2: Examples of quantification offered in the US National Park Service public profile on Facebook (left) and Instagram (right) from May 2017. The design choices made by social media sites to provide metrics for certain activities within their platform privilege some activities while limiting or preventing the visibility of other types of activities. For example, the number of times a tweet was retweeted is often used as a measure of the popularity or reach of a tweet (Starbird & Palen, 2012; Zimmer & Proferes, 2014). The number retweets are displayed prominently when viewing a tweet on the Twitter website. It is important to note that no other measures related to the number of times a tweet has been seen by users. Using the number of 18 retweets as a measure of popularity or influence privileges production of posts over for other types of listening (Crawford, 2009). The affordances and metrics offered in the Twitter interface are labeled in the tweet from the University of Washington shown Figure 2.3. Twitter offers three metrics for that can be used as proxies for tweet popularity5: (1) number of retweets, (2) number of likes, and (3) a visual proxy made up of the profile images of users who have retweeted this tweet. Below the tweet timestamp, three affordances to interact with this tweet are offered (highlighted in Figure 2.3): (1) the curved arrow allows users to rely to this tweet, (2) the square arrows allows users to retweet this tweet, and (3) the heart allows users to like this tweet. 5 See https://dev.twitter.com/overview/api/tweets for a list of fields in a tweet, archived at https://perma.cc/439D-PKXJ. 19 Figure 2.3: Example of the display of activity metrics and affordances in the Twitter interface. Social media sites take on the roles of both a platform and an infrastructure (Plantin, Lagoze, Edwards, & Sandvig, 2016) in research. Infrastructure and platform studies both refer to underlying features and structures, combined they ‘take account of how rapidly “infrastructuralized platforms” have arisen in the digital age’ (Plantin et al., 2016). Through these lenses are social media sites are seen as research infrastructures offering a 20 rigid set of affordances, or entry points, constraining our ability to access, query, format, and collect data. Entry points take two forms: (1) interfaces for human-consumption (e.g. Facebook.com, Twitter.com, and mobile applications) and (2) software interfaces designed for consumption by computer programs called Application Programming Interfaces (APIs) (e.g. Facebook Graph API, Twitter Streaming API, Instagram API, and Amazon’s e-commerce APIs) (Helmond, 2015). Social media sites also offer these interfaces websites on the open web to extend their reach, decentralize data production, and centralize data collection and processing (Gerlitz & Helmond, 2013). Algorithms underlie the interfaces, mediating between users and databases. The impact of these underlying features of social media sites on research design and data collection need to be taken into account as each of these layers process and shape the resulting datasets. In this section, I briefly examine social media sites through these two lenses — as illustrated in Figure 2.1 by the bottom two layers. The bottom layer of Figure 2.1 illustrates visible and invisible layers of infrastructures of social media sites. An infrastructure lens “makes the fundamental qualities of endurance, reliability, and the taken-for-grantedness of a technical and institutional base supporting everyday work and action” (Edwards, 2010) visible. In infrastructure studies, Ribes developed the kernel as a unit of analysis offering a lens through which to investigate the enabling capacities of an infrastructure — specifically a research infrastructure (Ribes, 2014). The kernel, a concept borrowed from computer science operating system design, is composed of (1) the “core resources and services that an infrastructure makes available” and (2) “the work, techniques, and technologies that 21 seek to sustain the availability of those resources over time” (Ribes, 2014). In the kernel, resources and services are entangled with the techniques and technologies used to make the resources and services available thus acknowledging the blurred nature between layers. Social media sites offer services in the form of APIs and interfaces allowing researchers to access and query data within the site. These services offer resources in the form of rendered data about users, posts, and interactions with content. While some reverse engineering of algorithms within social media sites is possible (Ananny, 2015; Ananny & Crawford, 2017), due to the lack of transparency and speed of evolution of social media sites the majority of the processes that shape data underlying the sites’ publicly accessible interfaces remain invisible and unknowable. What we can do is be cognizant of how these invisible layers constrain our ability to conduct research through the metrics, formats, and query parameters of accessible data from these platforms (Grosser, 2014; Vis, 2013). Examining social media sites through the lens of the infrastructure kernel, only two of the kernel components are visible — the resources and services that the infrastructure makes available. These resources and services are in the form of APIs and web-based interfaces sites make available to both users and researchers as well as the data presented within these interfaces. The activities of users within the site act as input into the invisible bottom layer of algorithms and data structures. These invisible components of each site’s infrastructure process, shape, and render these activities into web pages and API data when researchers and users access the sites via public interfaces and APIs. 22 Moving up to the second layer in Figure 2.1, the platform lens, social media sites act as platforms offering a set affordances, or features, that allow users and researchers to generate and interact with data held in the data structures and algorithms of the infrastructure layer. Key features of platforms include programmability, affordances or features that allow and constrain the activities of users (Bucher & Helmond, 2017; Gibson, 1977), and accessibility of data and logic through application programming interfaces (APIs). Public APIs and web interfaces offer a set of affordances, or features, which constrain or enable users to act and interact in certain ways. For example, an API that provides items posted within the last 7 days or the Facebook web interface provides a limited number like "reactions" (like, love, haha, wow, sad, and angry) for users to respond to posts. In some cases researchers use the same web-based interfaces that users use — for example, viewing or scraping data from a user’s profile or social media post — or they may use publicly available APIs to collect data from the platform. The affordances of these interfaces allow researchers entry into the infrastructure of the platform. 2.2 COLLECTING SOCIAL MEDIA DATA Twitter, a social media platform founded in 2006, offers a set of computer-focused Application Programming Interfaces (APIs) for automated data collection and user- focused public web interfaces for manual data collection that researchers may use (Zimmer & Proferes, 2014). These APIs offer interfaces for scripts and applications to request information and interact with the platform. Responses are returned as a JSON 23 document, a computer readable format of key-value pairs. The key uniquely identifies a field and the value contains the field’s data. Figure 2.4 show a tweet rendered on the Twitter website as well as the JSON document retrieved from the API. The “Name” field of the tweet displayed in figure 2.4 would contains the value “Twitter API” in the API output. It is important to note that APIs render social media posts as textual documents while the user-facing web interfaces render social media content as web documents. The JSON API output contains links to embedded content such as images and videos. The web interfaces render this content inside of the web document, as can be seen with the Twitter logo in the upper left-hand corner of the rendered tweet in figure 2.4. This is an important consideration because: 1. API data does not render content in the same format as the web interfaces platform users interact with. As a result, the interface and data researchers collect differs from the experience of platform users. 2. While the JSON API output provides pointers to linked content such as images, URLS, and videos, the content of the links is not contained within the data returned by the API. If a researcher plans to include linked content as part of their analyses, the content may change or become inaccessible between the time of data collection from the API and when linked content is accessed at a later date during analysis. For example, a researcher may collect tweets from the Streaming API in real-time and then access links when coders analyze the content weeks or months later. As a result, the content may not accurately 24 reflect the content users posted at the time of data collection – breaking the time bound between the social media posts and the embedded content. 25 API Request GET https://api.twitter.com/1.1/statuses/ user_timeline.json?screen_name=twitterapi&count=1 API Response Some fields have been removed for { "created_at": "Wed Aug 29 17:12:58 +0000 2012", "contributors": null,
 "text": "Introducing the Twitter Certified Products Program: https://t.co/MjJ8xAnT", "retweet_count": 123,
 "id": 240859602684612608, "retweeted": false, "in_reply_to_user_id": null, "user": { "name": "Twitter API",
 "created_at": "Wed May 23 06:01:13 +0000 2007", "location": "San Francisco, CA", "favourites_count": 90, "utc_offset": -28800, "followers_count": 1212864,
 "time_zone": "Pacific Time (US & Canada)",
 "description": "The Real Twitter API. I tweet about API changes, service issues and happily answer questions about Twitter and our API. Don't get an answer? It's on my website.", "statuses_count": 3333, "screen_name": "twitterapi" ... } ... } Figure 2.4: Tweet (top), Twitter API request, and API output of tweet by @TwitterAPI. 26 The Twitter offers API endpoints, or connections, for the posting of tweets, modification of user accounts, and to request information about a specific user or tweet. Researchers select the endpoint relevant to the data and time period they need to collect. Twitter’s API interfaces are similar to the data collection interfaces offered by other social media platforms such as Facebook (GraphAPI6), Instagram7, and Baidu8. As described in table 2.1, each Twitter API provides access to a type of data within a specific timeframe. For example, the Streaming API provides access to up to tweets that match a set of keywords (hashtags, usernames, text, or URLs) as tweets are being posted to the platform and it rate-limited up to 1% of the entire Twitter stream. If a set of keywords match more than 1% of the Twitter stream, those tweets are not delivered and a rate limit notice is returned. Each API offers access to a specific time period of data, so the time period of the API must be matched with the time period of data access. For example, the Streaming API only allows real-time access to tweets as they are posted to Twitter, so if a researcher does not know about an event of interest in advance and is setup the data collection infrastructure prior to the event another API must be used for data collection. 6 https://developers.facebook.com/docs/graph-api 7 https://www.instagram.com/developer/ 8 http://developer.baidu.com/wiki/index.php?title=docs 27 Table 2.1: Description of the Twitter API Ecosystem API9 Time Period Access Description REST API N/A Public Provides access to the current state of user profiles, timelines, and tweets. A user’s screen name or a tweet’s unique identifier must be known in order to be retrieve. REST API – Search API 7 days Public Provides access to tweets from the last 7 days via keyword search matching the tweet’s username, text, URLs, or hashtags. The documentation states that the Search API focuses on “relevance and not completeness”, noting that some tweets and users may be missing from search results from the Search API. Twitter points developers and researchers are pointed to the Streaming API or GNIP for more complete datasets. The API is currently rate-limited10 to 180 requests every 15 minutes. Each request may contain up to 100 tweets. Streaming API – Filter Real- time Public Provides real-time access to tweets via keyword matching (400 keywords), username/id (5,000 users), or a geographic bounding box (25 boxes). Researchers must maintain a constant connection to the API in order to receive data. Any disconnection will result in missing data. The API is currently rate-limited to 1% of the entire Twitter stream. Streaming API - Sample Real- time Public Provides a small random sample in real-time of all public tweets. Information is not provided on how the sample is generated or if the sample is statically random and representative. GNIP PowerTrack Real- time Commercial A commercial service from Twitter providing full-access real-time to the entire Twitter “firehose”. Query options are more granular 9 See https://dev.twitter.com/products for full technical documentation of Twitter’s APIs. 10 https://dev.twitter.com/rest/public/rate-limiting 28 GNIP Historical PowerTrack Historical Commercial A commercial service from Twitter providing access to all non-deleted tweets from the start of twitter to the present. Researchers may also collect data directly from Twitter’s public-facing website. This has the advantage of accessing data in the same rendered format that platform’s users experience as well as the inclusion of some linked content (images and videos). The search interface on the Twitter website provide access to all public tweets and is not limited to a 7-day window like the search API. Collecting data directly from the Twitter website does not lend itself to large-scale data collection like the APIs offer. Each collection method (API vs. website) offers its own advantages and disadvantages, so researchers should choose a data collection strategy that most closely meets the requirements of their research questions and design. The best strategy may include a combination of data collection from public APIs, the public website, and archiving linked content such as media and URLs embedded in each post. 2.3 THE DIMENSIONS OF LATENCY AND LEVEL OF AUTOMATION In the social media research space, researchers are applying existing methods to the collection and analysis of social media data. In a content analysis of the abstracts of over 500 papers focusing on Twitter from 2007 to 2011, Williams et al. (2013) found that the analysis of tweets rather than Twitter users or the Twitter site itself was the most common focus of these papers. Building on this work, Zimmer and Proferes coded 382 studies focusing on Twitter for their primary data collection and analysis published between 2006 to 2012. They created a typology of Twitter research related to the 29 “disciplines and methods of analysis, amount of tweets and users under analysis, the methods used to collect Twitter data, and accounts of ethical considerations related to these projects” (2014). Their findings show the amount of research utilizing Twitter data has grown from two studies in 2007 to 145 studies in 2011, with a slight dip in 2012 of only 109 studies. The fields of computer science, information science, and communications dominated. Content analysis of the text of the tweet itself was the dominant analysis with nearly two-thirds of all studies examined using this method with a majority of studies using Twitter APIs for data collection. Of the papers not using the Twitter API, manual capture or the use of a tool such as TwapperKeeper was popular. Similar work with papers focusing on Facebook as their data source found that content analysis of posts also dominated as the primary method for analysis (2016). Based on the meta studies of social media research approaches and my experiences working with social media data, it is useful to think about the social media data collection process across two dimensions: 1) Latency (real-time vs. historical) and 2) Automation (manual vs. automated). Figure 2.5 shows data collection methods along a latency (or delay) continuum from the least (data collection in real-time) to the highest latency (data collection from a historical archive). In the middle are low latency (semi- real-time) data collection methods enabling the collection of data in near real-time — seconds to minutes after a post has been created. 30 Figure 2.5: Spectrum of Social Media Data Collection Methods by Latency. 2.3.1 Latency of Data Collection (Temporal) The time of data collection refers to whether social media data is collected at the time of production (real-time) or with some delay after a post has been produced (historical). Here I borrow the concept of latency, a computer networking term related to the delay in transferring information from one part of a network to anther (Gummadi, Saroiu, & Gribble, 2002; B. Zhang et al., 2006), to refer to the delay between the production of a social media post, metadata, or reference to linked content and its collection. This is separate from, but related to the time period under investigation. For real-time data collection, posts are collected immediately after they are produced or “posted”; in historical data collection, there is a latency or delay between the when a post was produced and its collection. 2.3.2 Level of Automation (Method) The level of automation refers to level of manual intervention required by the method of data collection. Automated collection of social media data is normally accomplished through a small program or script freeing up researchers or their assistants from 31 completing the process by hand. The level of automation is separate, but rated to the method of data collection as most methods can be accomplished in an automated or manual fashion. For example, if a researcher chooses to take screenshots of social media posts, this can be done by manually loading each page and taking a screenshot or through a script that automates the process. In many cases, automation allows for the templated collection of higher volumes of data over longer periods of time since scripts can execute process faster than humans and for longer periods of time. For many researchers, who prefer high-volume, real-time data collection, automated data collection has become the “gold standard” for social media research.11 Methods, such as grounded theory based coding (Patton, 2001), require a level of human decision making and nuisance that are less amenable to automation. Merging the temporal and automation dimensions results in 4 possible approaches as displayed in Table 2.2. Table 2.2: Data Collection Approaches by Time and Method Real-Time / Manual Real-Time / Automated Historical / Manual Historical / Automated Real-Time/Manual In this scenario, a researcher or proxy is collecting social media data at the time of production using a manual process. The data collection may be done using copy and paste, screen-shots, or by viewing the posts as they appear on the screen. For example, 11 While some researchers treat real-time data collection as the “gold standard”, I do not take a normative stance in this dissertation. Researchers should use the findings of this dissertation as they see fit in their own research. 32 during a political debate, a researcher could follow specific keywords and users as the debate progresses. Real-Time/Automated In this scenario, a researcher is collecting social media data in real-time using a social media site’s Streaming API via automated script. With a streaming API, a script maintains an open connection to an API in order to receive posts related to the query in real-time — moments after they have been posted. Historical/Manual In this scenario, a researcher or proxy is collecting social media data after it has production using a manual process. The data collection may be done using copy and paste, screen-shots, or by viewing the posts/profiles minutes to months after they were posted. For example, months after a political debate, a researcher could search for specific hashtags and users. Historical/Automated In this scenario, a researcher is collects social media data after its production via a using a social media site’s API via automated script. With a REST API, a script maintains polls an API with a query in order to receive posts related to the query. Each query returns a certain number of posts and the script must query the API multiple times in order to receive all of the posts related to query. This could take seconds or days depending on the rate limits imposed by the API and the number of posts matching the query. 33 2.4 PROCESS THEORY The research questions guiding this dissertation are concerned with the impact of choices made during the research process, specifically how and when the collection of social media data is performed, on the resulting datasets. Process theory “argues for a patterned sequence of events [focusing on] … questions of the order and sequence of events and about the effects of that order [to determine if more] preferable outcomes can be associated with particular sequences of activities” (Abbott, 1990). A sequence is an ‘ordered sample of things’ that can be temporal or spatial in nature with properties of a continuous or discrete variable. These become events when tied together into temporal sequences (Abbott, 1990). Using process theory, a given set of sequence patterns can be examined to understand why they are the way there are or the effect of a certain set of sequence patterns. Examples of the former include: “Does education determine the characteristic sequence of career? Does the size of an organization determine the shape of the status rankings we find in it?” Examples of the latter include: [Are] “those promoted before acquiring certain kinds of expertise are helped or hindered in their ultimate career success”? I focus on the second of these questions — the effect of a certain set of sequences on a particular outcome. The research questions to be addressed focus on the effect of the ephemerality of social media data on the research process and characteristics of the resulting datasets. The sequence of events I focus on is the research process, and the data collection design choices made during this process. Within this data collection subsequence, I am interested in quantifying the effect of ephemerality on the reliability 34 and authenticity of the social media data collected using these processes. Thus, through the lens of process theory, social media data collection by researchers becomes a sequence of events. A process theory approach allows for the examination of the impact of changes to the sequence. Thus allowing for the examination of the impact of ephemerality on social media data sets using different data collection procedures. 2.4.1 The Social Science Research Process The research process is the process researchers go through in order to achieve their desired research outcome. The lens of process theory (Crowston, 2000) views processes as a way of accomplishing goals and transforming inputs into outputs, allowing the subprocess of social media data collection to be situated within the research process. This approach stands as an alternative to existing work focusing on the use of tools to collect social media data; often obscuring the methodological choices and epistemologies embodied and hidden within the tool. 35 Figure 2.6: The Research Process from The Practice of Social Research (Babbie, 2007, p. 108). Consider the above diagram (Figure 2.6) presenting a high-level overview of the research process as described by Babbie in The Practice of Social Research (2007). When viewed through a process theory lens, ideas, interests, and theories act as inputs in the research process leading to outputs — research findings and applications. Generally, the genesis of research is the ideation phase, ignited by an idea, some interest in a phenomenon, and/or a theoretical frame. From that point, a researcher may do 36 exploratory work and/or reading of prior studies to better understand the phenomenon and sites of observation for data collection. Thus, the diagram begins with “interests, ideas, and theory” with double arrows between them representing the bidirectional movement between the three. For example, an interest may lead to an idea which is further developed through theory, generating new ideas. Once a researcher’s ideas, interests, and theories are honed into a more well-defined purpose and list of outcomes, the conceptualization, choice of research method, population and sampling methods, and operationalization of variables must be determined. Again, this process is iterative with each step influencing the others, occurring in any order. Conceptualization involves specifying the meaning of each concept in the research. In research designs using highly structured methods, such as surveys and experiments, concepts may need to be well-defined in advance. In other cases, such as with open-ended interviews, the goal of the research may be to uncover the meaning of certain concepts so these concepts may not be well-defined at the start of the project. Single or multiple research methods are then chosen based on their appropriateness to address the research question(s) and the constraints of the available data and skills of researchers involved in the project. Operationalization is the process of determining the measurement techniques for each variable. The population and sampling methodology details the group under investigation. Since it’s normally not possible to observe every member of a population (complete data), researchers specify a sample to be collected and analyzed. 37 At this point, the researcher has decided what to study among what population and to do that through a specific method or set of methods. Observations and data collection can now commence. Once data has been collected, it is often not in a form lending to analysis or interpretation, so it must be processed and cleaned. “Unprocessed” data is cleaned and reformatted for analysis. In the cleaning step any erroneous and invalid data is filtered out — it should be noted that erroneous data is different than outlying data. The “processed” dataset, if necessary, can now be reformatted for analysis and analyses performed. The final step and output of the process, application, involves packaging and communicating the results of the study. Methods of communicating the results of a research study include, but are not limited to, publishing peer-reviewed articles, presenting at conferences or public forums, or writing a blog post. 2.4.2 Social Media Data Collection as Process Data collection is a subprocess occurring within the larger research process described in the previous section. Determining what data to collect, what platform(s) to collect data from, and how to collect are precursors to starting data collection — these steps occur during the development of the research idea and goals, conceptualization, operationalization of variables, selection of the population and sample, and the selection of the research method(s). As shown in Figure 2.5, each of these steps inform the process of data collection or observation of a phenomena. The data collection process occurs within this subprocess of observation, but as mentioned, the process does not occur in a vacuum but is informed by all of the stops occurring before. While social media data are 38 just one type of researcher data source, the context has unique features that make the data collection choices extremely impactful. Consider the example introduced in the introduction, of the researcher interested in using the #YesAllWomen campaign to study misogyny online. Before collecting data, a researcher must conceptualize the research, choose one or more research methods, bound a population and sample, and operationalize the variables. Let’s run through an example project using this case in order to show how the social media research process is situated within and connected to the other elements of the research process. Refining these ideas into a more tangible research project, imagine that the researcher is interested in discovering common factors between Twitter accounts which are targets of hate and misogyny within the #YesAllWomen hashtag. Important terms such as misogyny and prominent users would need to be conceptualized and operationalized. After determining the indicators of a misogynistic tweet, content analysis may arise as the most appropriate research method. Tweets could then be collected and classified as to whether they contain misogynistic content in order to produce a corpus of misogynistic tweets. This corpus could be examined to find the most mentioned user accounts. Prominent user’s public profiles, including the profile image, profile description, and timeline of tweets, could be examined to determine common factors in how the accounts present themselves or the types of tweets in their timeline. This scenario illustrates multiple issues of ephemerality during social media data collection process: 39 • When coding tweets for misogynistic content, what content will be examined? As discussed in Will linked content or media embedded in the tweet be included in the coding process or will the coding only focus on the text of the tweet? If included in the coding process, will embedded media and URLs be archived at the time of data collection or will coders open the URLs and media at the while coding the tweet? Will the content of the URLs and images change between the time of tweet collection and coding occurs? • Are the assumptions inherent in the methods used met? For example, content analysis assumes a certain level of stability in the dataset (Karlsson, 2012; Krippendorff, 2012; Saltzis, 2012) and parametric statistics assume certain normalized distributions of data. Also, after determining what accounts are the most prominent in the misogynistic corpus, some accounts may be deleted, made private, or public details such as profile images and descriptions change due to the level of harassment they received? How might these changes impact the research findings? These examples point to the central issues of the reliability and authenticity of social media data that lie at the heart of this dissertation. 2.5 CHAPTER SUMMARY In this chapter I have discussed social media as a data source for research and provided a framework for understanding social media platforms as data collection infrastructure. The framework helps researchers understand how the affordances of social media platforms constrain and shape their ability to collect data from these 40 platforms. This chapter also discusses methods of data collection from social media platforms with a specific focus on Twitter. The process of data collection was then situated within the larger social science research process. 41 Chapter 3. SOCIAL MEDIA AS A RECORD Often the analysis of social media posts focuses on either volume-based metrics or the text of a posts within bounded sets of keywords/accounts (Zimmer & Proferes, 2014) on a singular platform (see for a discussion of “hashtag studies” Burgess & Bruns, 2014). These approaches do not take into account that social media posts contain an assemblage of text, images, metadata, and hyperlinks. When accessing a post via a platform’s website or API, the post and its accompanying metadata are assembled at the time of the request. Embedded content and metadata surrounding a post may change independently of the text of a post, breaking the time-based bond between a post and the surrounding metadata and content. Since researchers often use social media data as a historical record or documentation of an event or phenomena occurring at a specific time , changes in the accessibility of posts or content of embedded metadata may have an impact on research findings since the content may no longer be reflective of what was posted by the user. Some changes to web pages, such as hourly updates to the BBC homepage or 404 Not Found error pages, may be more easily recognizable and quantifiable. Other changes, such as a link redirecting to a new location or the deletion of a user’s account, may be less obvious. Current practices solely focus on collecting social media posts and often assume a high level of stability in social media data sets which does not reflect the experience of many researchers as evidenced by the treatment of real-time data collect as a “gold-standard” in social media research (Burgess & Bruns, 2014; Driscoll & Walker, 2014; González-Bailón et al., 2014). 42 Current informal research practices, derived from the choices described in the methods sections of early publications exploring the use of social media data (Bruns, 2012; ex: Bruns & Burgess, 2011), were written when the collection of social media data was experimental, novel, and the platforms were just starting to emerge. These practices coalesce around the “large-scale” collection of social media data via automated scripts and public APIs offered by social media platforms. Often the choices made in bounding a case, the related keywords and account, the method of collection, and data cleaning are briefly described in the methods section of these papers. While shedding some light on reasoning behind and the implementation of these choice, they normally do not “include enough detail about how the studies were actually conducted on the ground to allow for their replication” (Hargittai & Sandvig, 2015, p. 2), leaving researchers without a comprehensive framework through which to employ similar methods or to determine the best approach for their own research questions. In this chapter I develop a framework for social media data collection based on relevant concepts borrowed from archival and electronic records theory. Archival theory and practice focuses on the acquisition, arrangement, description, and preservation of objects and records in library collections. Electronic records theory builds on approaches in archival science for the management and preservation of integrity of legal and business records in an electronic environment. In this chapter, I draw on elements from both theories to develop a framework to expand our strategies in collecting data, incorporating multiple components (post text, metadata, linked content) and greater environment in which posts are generated by users and platforms. 43 3.1 ARCHIVAL THEORY AND PRESERVATION There are two dominant models in the curation process — the older lifecycle model conceives records as living organisms. It is heavily used in the records management literature and practice based on a sort of cradle to grave understanding of records where archives are part of the “end-of-life” management. In the lifecycle model, records pass through stages until they die. While life cycle concept has been taken up in studies of data ("data life cycle") and many have pointed out that the model is troubling, namely that things are born and they eventually die, or they may not mirror life stages of development. In contrast, Australian archival scholars (McKemmish, 2001) have developed the "records continuum model" which suggests that records live on in many iterations, perhaps even after they end/die. In this model, “records are 'fixed' in time and space from the moment of their creation, but record-keeping regimes carry them forward and enable their use for multiple purposes by delivering them to people living in different times and spaces” (Pearce-Moses, 2005). Both models include at least four main stages: 1) appraising the historical value of a record, 2) accessing an item into an archive, 3) arranging and describing items in the archive, and 4) preservation of items in an archive (Acker, 2014; Daniels, Walch, & Service, 1984). Appraisal is ”the process of establishing the value of documents made or received in the course of the conduct of affairs, qualifying that value, and determining its duration” (Duranti, 1994). An archivist uses this assessment to determine if an item should become part of the collection and, if so, would move on the next stage of the 44 process. Inherent in archival practice is the recognition of impossibility of collecting and preserving every record, only items deemed to have a high value and relevance are accessed or brought into the archive. Also, an archive is limited to the records available. As a result, all archives are incomplete with gaps in their record (Thumim,2002). After accessing an item, it is arranged and described. Arrangement involves the physical placement of records, often mirroring the arrangement and ordering that was given to the archive. This physical placement of records also represents the association and relation with all of the other documents received as part of that collection (Holmes, 1964). Collections are then integrated into the larger arrangement at the depository, record group, and filing unit. A description of the record is then recorded which includes information related to the creator, dates, and content to facilitate the management and finding of the record. Finally, the physical or digital item is preserved to prevent further degradation. This is part of an ongoing process. As an example, consider the example of a prominent politician donating her collection of letters to the local university library. An archivist would first visit the collection of items to collect or access the appropriate items into the collection. After accessing the items, they will be taken back to the library to be arranged and each record will be described. The archival process mirrors the social media data collection process making this a good model to draw relevant concepts from. Researchers appraise the value of data collection, develop and execute a research design to collect and analyze those records, 45 after collect data is arranged and described for analysis, and, as part of collection, data is preserved in a format necessary for analysis. Social media data collection, like web archiving to a certain extent, collapses the traditional archival lifecycle into one step. 3.2 APPLICABLE CONCEPTS FROM ARCHIVAL THEORY In the following sections, I describe the relevant archival concepts from archival and electronic records theory applying each concept to social media data. • Action. A core component of every record is that they participate in some action. This falls into types: dispositive (action comes into existence with the creation of a record - contract of sale / enter of relevant information in patient record substantiates admittance to hospital), probative (record acts of proof action took place such as a marriage document), narrative (records that are the substance of non-legal actions - eg. most email), and supporting (help carry out an oral action such as lecture notes or a meeting agenda) (Duranti, Eastwood, & MacNeil, 2013). Social media posts serve as a record of the act of producing and interacting with a posts and social media platforms. • Archival bond. The archival bond web of relationships that each record has at the moment it was made or received with the records that belong in the same aggregation. In a traditional collection, the archival bond carries from the implicit physical arrangement records. The archival bond in an electronic record is made up of the classification codes assigned to records, connecting it to other records belonging to the same class (Duranti, 1997; Duranti et al., 2013). In social media 46 data sets the archival bond consists of the (inter)relationships between social media posts in the same reply/retweet/hashtag stream and the same content aggregation (about the same topic). In social science research a content aggregation is analogous to the bounds of a case study. • Context. The context is anything outside a record that has significance for its meeting. Relevant contexts could include the legal and organizations system in which the record creation took place, procedures used in the course of creating a record, and 4) documentary context: fonds, the whole of the records that a person naturally accumulates by its activities and the byproducts of them, and internal structure. Electronic records theory expands the documentary context to include the technological context which includes the technological characteristics of record keeping system (Duranti et al., 2013). Within the context of social media datasets, the context includes the documenting the technological context of the platform that the user used to produce the post — this is especially important as platforms evolve over time. Consider that Twitter of 2011 has different features and configurations than Twitter of 2017. • Physical Form. In an electronic record, the physical form includes the configuration and architecture of the electronic operating system, architecture of electronic records, the software, all those parts of the technological framework that determine what the document will look like and how it will be accessed, and digital signatures and time-stamps (Duranti et al., 2013). The majority of these are invisible to the user, but any migration or small change would generate a new 47 and different record. In the case of social media data, this would include information about the form or format the researcher collected data. • Content. The context consists of the textual, symbolic, and/or visual message that is meant to be conveyed. Content must be fixed and stable in order for record to exist and cannot be separated from its form or its medium (Duranti et al., 2013). In the context of social media data, content includes the post text, links, video, or images — making an argument for the content of a social media post to be more than just the ‘text’ of the post. 3.3 SOCIAL MEDIA AS A RECORD Figure 3.1: Diagram of the social media as a record framework. 48 A framework derived from relevant concepts discussed in the previous section is imagined in Figure 3.1. The framework has 5 elements: 1) Layers of context and infrastructure which a post is produced and embedded, 2) social media post, 3) linked content contained in the post, 4) metadata surrounding the post, and 5) related social media posts. At the center of the framework is the content of the social media post, often rendered as “the text of the post” through social media platform APIs. The second layer consists of linked content — URLs and media content (images, videos, etc.) are often embedded within this text. Surrounding, documenting, and describing the post is the metadata associated with the post. This can take the form metadata about the time the post was created, what client was used to create the post, metrics about the post such as the number of likes, and user information. The social media post, its linked elements, and metadata are all linked by a specific point in time. The post reverses to a specific state of the user profile, embedded links and images, and other metadata connected to the post at single moment in time. Viewing these items disconnected from that bound moment in time may result in viewing a different post and content than the user intended. In the final layer, the post is connected to other posts within the same platform and in other platforms bring proceeded at that same moment in time within a specific human context. 3.4 CHAPTER SUMMARY Seeing social media posts through the lens of a record offers researchers a guide to approach both the collection and analysis of social media posts. This framework based on 49 relevant concepts in archival theory, expands the conception of a social media post beyond “just its text” and illustrates interconnectedness of the elements of a post as well as the time-bound nature of the elements. 50 Chapter 4. EPHEMERALITY The concept of ephemerality is often used in Internet and Social Media Studies, but rarely defined. In this chapter I develop and operationalize the concept of ephemerality as well as describe the core data collection methods and cases used in this dissertation. The chapter concludes with descriptive statistics for each case under examination. 4.1 CONCEPTUALIZING EPHEMERALITY Researchers use the term ephemerality when discussing the ever changing or impermanent nature of social media. Issues of ephemerality in research are not new or unique to digital or social media data, scholars in many fields, including film and internet studies, have wrestled with the issue. In the case of film studies, consider representations of early histories of 1950s British broadcasting. When the BBC started archiving footage in the 1950s, it prioritized the recording of documentaries for inclusion in its early archives(Thumim, 2002). As a result, the absence of audiovisual archives created false assumptions about the reality of early television because only footage deemed “important” enough to record was archived. The “decision of what to archive and not archive created ‘particular bundles of silences’” (Thumim, 2002). These silences exist in the gap between what was broadcast by the BBC and what was archived, presenting two different images of reality — this is the ephemerality of early television. Similar issues of ephemerality are faced by researchers using web archives. At the root of the internet is a hypertext system in which data is stored in a network of nodes 51 connected by links (Smith & Weiss, 1998). These links, commonly called hyperlinks (URLs), serve as the primary mechanism that connects nodes in the web to one another, and are technological affordances that allow seamless connection between one website and another (Park & Thelwall, 2003). Estimates of link decay (also known as: half-life, death, accessibility, persistence, and link rot) mainly come from studies of links in journal articles and range 31% - 39% (Dimitrova & Bugeja, 2007; Goh & Ng, 2007; Moghaddam, Saberi, & Esmaeel, 2012; Sanderson, Phillips, & Van de Sompel, 2011). These studies use a combination of automated analysis of error codes returned by web servers or rely on researchers visiting each of the URLs. These methods only detect obvious cases where the destination of the URL returns an error. In addition, focus on “404 not found” errors of links in academic journals tell us little about the other ways links can change, decay, disappear, or erode in other contexts. Publicly accessible web archives, such as the Internet Archive’s Wayback Machine (https://archive.org/web/), do not provide a “magic” solution to issue of link decay and change since only a handful of the web is archives. Even these ‘saved’ pages create issues such as “broken links, missing images, and code written for outdated pages” (Ankerson, 2012). Dynamic content including flash animations provide another challenge because some types of embedded content are not archived by the automated systems used by web crawlers such as the Wayback machine. This results in a “broken flash image” and unaccessible content (Ankerson, 2012). Ankerson encourages scholars working with web histories and archives to look at strategies used by broadcast historians who, as with the aforementioned BBC example, 52 understand “well the difficulties in piecing together the past when so much of what was broadcast was sent out live and unrecorded” (2012). Although it is important to note that most broadcast historians relay on centralized corporate and institutional archives12 which often archive ephemera related to broadcast shows, such as internal memoranda, letters, press cuttings, reports, and much more. This differs from web archives which are (partially) “preserved digital files with proper contextualization” (Ankerson, 2012). As Anderson acknowledges, these two examples experience different problems due to the ephemeral nature of the content (web and broadcast) — with the BBC archives containing large amounts of contextual resources but lacking extensive archives of full broadcasts while web archives offer a plethora of preserved sites, with its own complexities, but often lack contextual information. Going back to the case of social media, researchers using social media data will stumble upon both a lack of context and a lack of archived material. Social media inherits many of the characteristics and complexities of the web since many platforms have web-based interfaces and allow the inclusion of linked elements. A post or account may be deleted; linked content such as images, videos, or web pages may change, decay, or disappear but this is rarely discussed in papers using social media data. Consider two cases – the Occupy Wall Street Movement (OWS) and the Boston Marathon bombings. In both cases, social media were used to share event specific content. During the Occupy protests, protesters posted images of police actions in order to assist other protesters in 12 Interview with Brewster Kahle. RLG DigiNews 6(3). Available at:http://worldcat. org/arcviewer/1/OCC/2007/08/08/0000070519/viewer/file3096.ml 53 avoiding these actions. Once police actions had ceased, posts and their accompanying images were deleted by users; using deletion as a protest tactic (Neumayer & Stald, 2014). Similar behavior was seen after the Boston Marathon bombings, when rumors about the identity and location of the bombers were abundant (Starbird et al., 2014). As more information was released by responding organizations, posts containing misinformation (i.e. false rumors) were deleted from timelines. In both of these cases, the real-time record differed from the retrospective record, leading to similar ‘bundles of silences’ experienced by researchers using early BBC archives. Researchers collecting social media posts in real-time likely ended up with a different dataset than those who collected or purchased data weeks or months later. As a result, protest tactics during OWS would look significantly different between these two datasets since some practices were meant to be temporary in nature. Similar issues would emerge in the Boston Marathon bombings dataset— after correction, some rumors and misinformation might disappear entirely. These two examples do not take linked content such as web links into account – news articles are updated as a story progresses (Saltzis, 2012) and web forums, such as reddit, may delete or modify content over time. An important concept developed in the previous Chapters 2 and 3 is that social media posts are made up of more than just the text of the post itself – they often contain links to web content, videos and images that extend the post and are an integral part of the post. Without the content of the link, whether that is a picture or web page, it may be impossible to understand the post. The post and link may change together, or one may change while the other remains untouched, but they 54 are both bounded by a specific time – the time when the post is created. If the two are out of sync then the relationship, meaning, and context of the post and link may be disrupted. Gray et al’s (Gray, Szalay, Thakar, & Stoughton, 2002) describe ephemeral data as data and metadata describing that data which cannot be replaced, reproduced, or reconstructed; therefore necessitating the archiving of that data. Stable data, in contrast, only requires the preservation of the metadata documentating its creation so it can more easily be reconstructed. Social media data easily fits into the category of data that cannot be reconstructed after the fact. The ability to reconstruct a dataset is an important and distinct from the ability to acquire a dataset. In the case of social media data, while it is possible to collect it using a myriad of methods (manual copying and pasting, automated collection from APIs, reading of profiles/walls/timelines, or purchasing data from aggregators); these methods do not necessarily preserve the data for future data collection endeavors. For example, the SoMe Lab has an archive of over 350 million tweets related to the Occupy Wall Street movement (Agarwal et al., 2014) but it would not be possible to purchase the same exact dataset that the lab collected — even using the same collection parameters. That is because accounts, posts, and links have been deleted or modified. It is also important to note that deletes cascade in that when someone deletes an account, any retweets of their tweets are also deleted from the timelines of users who retweeted them. So while it is possible to go to a social media data aggregator such as GNIP (now a part 55 of Twitter and the only authorized provider of Twitter data13) to purchase tweets from their historical collection, the tweets are filtered through a deletion list before being delivered to the customer. This is in contrast to real-time data collection from Twitter’s Streaming API. With the Streaming API, tweets matching the search criteria are delivered shortly after a user posts content (often within seconds). Due to limitations of the Terms of Service (ToS), Twitter14, like most social media platforms, restricts the ability of researchers to share the data they have collected from the service. Resections are similar to confidential datasets such as the US Census (Abowd, Vilhuber, & Block, 2012). These restrictions mean researchers and librarians cannot publish and archive data as suggested by Gray et al. While it is possible to share aggregations of social media data such as the number of posts, likes, retweets, and the results of analysis; the ToS for most platforms only allow for the sharing of the unique identifier associated (e.g. Tweet ID or Facebook post ID) with each post. It is not possible to share the actual content of or metadata associated with a post. Using the unique ID of the post, researchers with the requisite technical skills can use a social media site’s public APIs to programmatically “rehydrate” each of the social media posts. The API returns the current content and metadata associated with the post, if it is still accessible. While this solves the issue of telling other researchers “what posts are in my dataset” and provides a method of comparing the posts in different datasets, it creates a series of problems of its 13 While Twitter gifted a copy of the twitter archive to the Library of Congress, it has not been make available to researchers or members of the public. The limitations and filtering of deleted/inaccessible would content apply to both GNIP and the LoC Twitter archives. 14 See section 6b of the Twitter Developer Policy at https://dev.twitter.com/overview/terms/policy, archived at https://perma.cc/Y64D-JDLC. 56 own. Three major issues emerge across many social media platforms: 1) deleted posts and posts from deleted accounts cannot be retrieved from the API so we can be left with orphaned data, 2) modified posts are not flagged by the API so we do not have a way to determine if a post changed since its creation, and 3) large datasets are difficult and time-consuming to rehydrate due to API request limits. For example, the Twitter REST API is rate limited to 150 requests per hour, returning a maximum of 100 tweets per request. As a result a researcher with a single Twitter account, can ideally “rehydrate” up to 15,000 per hour. While it is possible to get around these limits by using multiple account simultaneously, doing so increases the technical complexity of the rehydration process. The result is that this is not a viable solution. The post IDs themselves falls under Gray et al’s definition of ephemeral data since, in most instances, it cannot be reconstructed. When posts changes or disappears, it may end up being a research “opportunity lost forever” (Lynch, 2008) or present a false account of the phenomena under study. 4.2 EPHEMERALITY AND SOCIAL MEDIA DATA Some scholars such as Herring (2010) and Karlsson (2012), express the concern that structural features of new media (such as hyperlinks) and embedded media content created through them are simply too ‘new’ to be addressed by ‘old’ or existing methods of content analysis alone. Content analysis, as described by Krippendorf, assumes a high level of stability of the data being coding (2012). Scholars also note that sampling procedures in the context of social media analysis are far from being understood (Gerlitz 57 & Rieder, 2013). Without data concerning the stability of social media datasets, scholars are often forced to use the methods they are already familiar with and are unable to adapt existing methods to this new space. The majority of the literature surrounding social media datasets and ephemerality focuses on deleted posts — especially tweets. Almuhimedi et al. (2013) preformed a large-scale analysis of tweet deletion. In their dataset of over 67 million tweets, only 2.4% of tweets were deleted, however 50% of roughly 300,000 users have deleted a tweet. Petrovich et al used tweet and account features (number of words, presence of curse words, number of followers, number of tweets, etc.) to predict deletions (2013). This is similar to the methods used to predict deleted emails (Dabbish, Venolia, & Cadiz, 2003) via certain features of the email message. Of the 200,000 tweets examined, 85.2% were manually deleted by the user, 12.2% were inaccessible due to a changing the account from public to private, and remaining 2.6% were due to deletion of the account. The authors posited that account deletions were due to Twitter acting on violations of their SPAM policy. Other studies have examined deletions due to government censorship in Chinese social media platforms (Bamman, O'Connor, & Smith, 2012), deletions as a protest tactic (Bamman et al., 2012; Neumayer & Stald, 2014), and deletions or changes by site administrators related to bullying or the posting of inappropriate behavior (W. Phillips, 2011). Changing policies of sites can also lead to deletions and changes of content and profiles — for example some public libraries created a Facebook “user” account to interact with patrons, but this violated Facebook’s policy so the user was deleted (Roblyer, McDaniel, Webb, Herman, & Witty, 2010). 58 A specific class of deletions that has been studied includes ‘regret tweets’. Sleeper et al. (2013) used an Amazon Mechanical Turk task to understand the types of regret users experience regarding content they have tweeted. Of the 474 responses (using the regret categories from Knapp et al. (1986)) the most common cause for regret was revealing too much in the tweet (e.g. personal information or a secret), followed by direct criticism regarding a specific person. Participants rarely reported experiencing regret due to lying or ‘behavioral edict’. Of the participants who experienced regret due to a specific tweet, only 52% of the tweets were actually deleted. This further highlights the difficulty in deletion prediction, that even if a tweet has cause for deletion, it may very well remain on twitter. Similar work by Zhou et al. also focuses on responses to regrettable tweets (2016). The majority of social media deletion studies gather platform-wide data by connecting to a public API for a number of days to ingest publicly available deletion notices. This shows the gap between how content deletions are studied and how researchers often conduct their research. In that research has focused on “all deletes” from a stream vs. the topic bounded case studies many researchers use. Also these studies only take the deletion of a post into account — missing edits to a post (if a platform offers such affordances), changes to a user’s profile or presentation, embedded media such as images and videos, or linked content. 59 4.3 RESEARCH DESIGN & METHODS As discussed in the previous sections, social media data is often labeled ephemeral, or unstable over time, based on researchers’ experiences; however we lack empirical data to support or refute this claim — especially through a social science lens. Existing studies have focused on deletions, only one aspect of ephemerality, by monitoring platform-wide deletion notifications from social media APIs (Almuhimedi et al., 2013; Petrovic et al., 2013; Zhou et al., 2016). As a result, researchers have little understanding of ephemerality and impact of delays in data collection on social media datasets and therefore research findings. My research design addresses this gap through the investigation of the following research questions: • RQA: How does the ephemerality of SMD interact with the process of data collection to impact the reliability of social media data sets? • RQB: How does the ephemerality of SMD interact with the process of data collection to impact the authenticity of social media data sets? To address these research questions, I conducted an empirical analysis of three Twitter-based case studies. A case study approach was chosen because it emulates the conditions, contexts, methods, and topic/population bounding commonly used in a social science approach. It also provides a foundation for generalized guidelines or best practices for social media based research designs for future researchers. The studies and their collection parameters are described in the next chapter. The case studies focus on the social media platform Twitter because: 1) the use of Twitter as an object of study and source of observational data is pervasive in academic research (Williams et al., 2013; 60 Zimmer & Proferes, 2014), 2) Twitter is less susceptible to algorithmic filtering, also called ’filter bubbles’ (Bozdag, 2013; Bruns & Stieglitz, 2012; Bucher, 2012; Flaxman, Goel, & Rao, 2013; van Dijck, 2013, p. 75), than other platforms since the public APIs return all public, non-deleted statuses matching query terms; theoretically producing a more “accurate” record15 (Driscoll & Walker, 2014), and 3) concepts, structures, metadata, and links (URLs) easily generalize beyond Twitter to other social media services and platforms. Each case study was chosen because it represents specific prototypical features (e.g. time scale, account stability, context, level of image and media usage) of the types cases and therefore the types of social media datasets social science researchers might encounter. This design allows for a combination of within and between case analysis to understand ephemerality within the prototypical cases. Across case analysis allows for generalization to other social media platforms beyond Twitter. 4.4 DATA COLLECTION Since the concepts of reliability and authenticity in social media datasets refer to the stability of specific components of a social media post over time, I collected tweets related to each case study at three different points in time. Using the framework described in the Chapter 3, I collected data in real-time, semi-real-time, and nightly. To allow for a longitudinal data collection over a period of two weeks, I used the appropriate Twitter APIs matching each of the concepts in the framework. For example, 15 https://dev.twitter.com/streaming/overview, archived at https://perma.cc/AT9U-KEWJ. 61 it would be difficult to manually collect data 24/7 for a period of two weeks so automated data collection from the Twitter Streaming API is used to collect data in real- time. This should not be seen as a privileging of automated data collection, but a recognition that automated data collection most closely aligns with the continuous, longitudinal data collection strategy needed — it would be very difficult to manually copy/paste tweets for a period of two weeks. Tweets were collected in real-time from the Twitter Streaming API over a period of two weeks. Semi-real-time data was collected from the Twitter REST API two weeks after real-time collection concluded. The Twitter Search API was queried daily to check the availability of each tweet collected for a period of 90 days after each tweet was collected in real-time. This process is summarized in Figure 4.1 below and described in more detail in the sections below. For each case study, the same query terms and parameters were used for each data collection method. As mentioned in the second chapter, it is not possible to use the same API or collection method to collect tweets at different points in time. Each API is specifically designed to support a certain level of latency. For example, the Twitter Streaming API only supports real-time data collection and, as a result, cannot be used to collect tweets after they have been posted. The publicly accessible Streaming API provides access to up to 1% of the current Twitter stream; tweet volumes over 1% are rate limited and not accessible.16 The REST API provides access to the past 14 days of 16 While the two case studies requiring new data collection have been chosen to avoid rate- limiting by the Streaming API, it is theoretically possible for it to be an issue during data collection. 62 non-deleted, public tweets (Driscoll & Walker, 2014; González-Bailón et al., 2014) using query terms or access to all non-deleted tweets and users via their unique identifiers. Figure 4.1: Summary of the data collection process and timeline for Occupy Wall Street case study. Two Weeks Server Twitter Streaming API GNIP Historical Query Parameters Query Parameters Server 3 years 63 Figure 4.2: Summary of the data collection process and timeline for Departments of Transportation and RuPaul’s Drag Race case studies. 4.4.1 Real-Time Data Collection For real-time collection from the Streaming API, a script maintained a constant connection to the API for a period of two weeks. Since the Streaming API delivers tweets in real-time, a connection to the Twitter API must be maintained at all times during the data collection period. Any drop in connection between the script and API will result in missed tweets during the period of disconnect. As each tweet was received by the script from the API, the metadata associated with each tweet was examined for URLs and media. Tweets were randomly selected for immediate and ongoing weekly archiving of Two Weeks Two Months Server Twitter Streaming API Storage Real-Time Tweets URL Archiving Heritrix Archiving Rendered Screenshot Twitter REST API Query Parameters Parse URL Selection Descriptive Stats Processing Query Parameters Server Semi-Real- Time Tweets Historical Tweets Web Archives 3.5 Months Total 64 embedded URLs and media through three processes: 1) by the heritrix (Mohr, Stack, Ranitovic, Avery, & Kimpton, 2004) web crawler, 2) automated screenshots of URLs as rendered by a web browser via the PhamtomJs CLoud service (https://phantomjscloud.com/), and 3) extraction of content contained in web pages using the the Phantom Js Cloud API. Heritrix is the web archiving engine developed by the Internet Archive to create archival-quality crawls of web sites based on the standards from the International Internet Preservation Consortium (IIPC). Since some dynamic and flash content is difficult for the heritrix engine to archive, automated screenshots of each selected URL were also taken. In addition, the main content of each URL, or the readable text on the page, was extracted using the Phantom Js Cloud API. Phantom Js Cloud is a commercial cloud-based service which renders screenshots of web pages — producing an image of what a web page looks like in desktop web browser. Once all randomly selected URLs were archived using the processes outlined above, the tweet and its associated metadata was stored for later analysis. Archiving URLs at the time a tweet was posted allowed for gathering of a baseline of what the URL looked liked at the time the tweet was produced. Weekly archiving of selected URLs continued for a period of two months after real-time data collection ended in order to track changes in linked content over time. This is illustrated by section above the blue “two week” bar in Figures 4.1 and 4.2. In summary, the data collection procedures doing real-time data collection were as follows: 1. A script submitted the query keywords, described in Appendix B, to the Twitter Streaming API for each case study and maintained an open connection. 65 2. As tweets matching the query terms are received via the Twitter Streaming API, each tweet was stored in a text file for later analysis. 3. The metadata of each tweet was examined for URLs (entities.urls) and embedded media (entities.media). 4. Upon encountering a URL, the script randomly selected the tweet and its accompanying URLs for archiving. URLs and media within selected for archiving were be submitted to the heritrix crawler for immediate archiving and a rendering of the URL in a web browser will also preserved via the Phantom Js Cloud service. 5. URLs selected for archiving were re-archived on a weekly basis for a period of two months. 4.4.2 Nightly Availability From the start of real-time data collection to 90 days after each tweet was collected, the Twitter REST API was queried for the status and current version of each tweet collected. Each tweet and its associated metadata were stored for later analysis. 4.4.3 Semi-Real-Time Collection At the conclusion of two-weeks of real-time data collection, tweets were again collected for each case study using the Twitter REST API. A script connected to the Search API and poll (repeatedly ask) for all tweets with the keywords/accounts from the two-week time period of real-time data collection. Each tweet and its associated metadata were be stored for later analysis. This is illustrated by the data collection steps 66 between the blue “two week” and yellow “two month” bars (semi-real-time) in Figure 4.2. 4.4.4 Summary of Data Collection At the conclusion of data collection, multiple datasets for the Departments of Transportation and RuPaul’s Drag Race Case studies were collected including: • Tweets and metadata collected in real-time using the Twitter Streaming API for a period of two weeks (14 days). • Tweets and metadata collected in semi-real-time using the Twitter Search API two weeks after real-time collection. • Tweets collected nightly by requesting each tweet via its unique id via the Twitter REST API for 90 days after real-time data collection. • Randomly selected URLs archived in real-time and on a weekly basis for two months. For the Occupy Wall Street case study, preexisting data from an earlier study was used: • Tweets and metadata collected in real-time from the Twitter streaming API for a period of 12 days. • Historical purchase of tweets from GNIP three years (June, 2014) after real-time data collection. Together these datasets allow for the examination of the reliability and authenticity of the social media datasets contained within the three case studies. 67 4.5 CASE STUDY AND DATA DESCRIPTION The data used in this dissertation is derived from three Twitter-based case studies. A case study approach was chosen to closely replicate the bounded, event/phenomena focus of the social science approach to research allowing the findings to apply to a wider range of research designs from a social science point of view. Case studies also work well when a “how” question is being asked about a set of contemporary events (Yin, 2014) — which in my case included a recent world-wide social movement, interactions between the public and West Coast Departments of Transportation, and a reality TV show. This approach also preserves the connection between the phenomenon and its context (Yin, 2014) while retaining the capacity to address a case study’s complexity (Simons, 2006). Each case study was chosen because it represents prototypical features of the types of data collection scenarios researchers experience. Examples of these dimensions include time-scale (short to long), population bounding (tight to lose), level of political contention (highly contentious to the everyday political context), and inclusion of links to media such as images and videos (high to low level of media and linking). The range of case studies provide for a triangulation of different contexts — social movements, daily interactions with government, and a reality TV show — to examine the ephemerality of social media data within, between, and across cases. Different levels of case study analysis provide different types of insights — within case analysis informs researches in situations where their case studies/data match one or more of the prototypical dimensions of one of my case studies. Between-case analysis provides information about the impact of different prototypical dimensions on the ephemerality of 68 social media data sets. Across case analysis allowed me to generalize across twitter and to other social media platforms. For example, due to its nature as a social movement and the use of deletion as a protest tactic (Neumayer & Stald, 2014), the Occupy Wall Street case may exhibit a different level of ephemerality than the other cases. As a result, the Occupy Wall Street case acts as a model for researchers working with politically contentious social media datasets, but may not be a good model for researchers working with non-political data. The other case studies counter this, for example the inclusion of popular culture and entertainment though the Drag Race case study, acts as a model for a variety of events/phenomena social sciences researchers encounter when working with social media data. Combined the three case studies allow for a more general understanding of ephemerality across the Twitter and to other social media platforms. A description of each case study as well as the associated data collection procedures and proposed analyses are listed below. A list of query terms for each case study are listed in Appendix B. 4.5.1 Occupy Wall Street - Topic Based Dataset On June 2nd, 2011, Adbusters proposed a peaceful demonstration, “Occupy Wall Street”, to take place on September 17th to demand a separation of money from politics. Over the course of the next three months, face-to-face working groups met in NYC to create a General Assembly to coordinate and organize action. Around the same time, in August 2011, a Tumblr site called “We are the 99%” was created in which individuals were able to upload their personal narratives as they related to the actions and message of Occupy Wall Street. The Tumblr site provided an opportunity for geographically 69 distant supporters to participate and connect with the localized efforts in NYC. On September 17th, roughly 1,000 protesters marched on Wall Street and set up a camp in Zuccotti Park, “occupying” the space to demonstrate their dissent. Over the course of a few weeks, the Occupy Wall Street demonstrations grew into the global Occupy movement, as camps where set up around the world including in the United States, United Kingdom, Japan, Italy, Canada, and Mexico in solidarity with the philosophy of the movement. This digitally enabled action network closely mirrored the indignados movement of Spain, in that established political organizations so common to traditional social movements, such as unions and political parties were replaced by technology platforms and applications (Bennett & Segerberg, 2012). Facebook pages for Occupy city camps sprung up in early October, accumulating tens of thousands of likes in major cities such as Philadelphia and Chicago (Caren & Gaby, 2012). Twitter handles and hashtags such as #occupywallst, #ows, and #occupy emerged to facilitate the coordination and exchange of relevant information. A battery of local city focused websites also grew in conjunction with umbrella websites that organized cross-city coordination and information. As the movement grew and matured new tools and platforms were incorporated into the local and national level of the Occupy information ecosystem. Livestream, a tool for broadcasting live events to the web, provided an opportunity for those not on the ground to bear witness and connect to the movement by watching real-time events such as protest marches, General Assembly meetings, speeches, arrests, and evictions. Protestors used photo sharing tools such as twitpics and 70 yfrog (both now defunct)17 to share pictures and document events unfolding on the ground. While on-the-ground efforts and participation were critical to sustaining and growing the movement; digital tools and technologies in these networked organizations acted as communication infrastructure, providing channels to share information, organize events, coordinate activity, and connecting participants and camps to one another (Agarwal et al., 2014). The real-time #OWS Twitter case consists of 64,298,104 tweets collected from October 19, 2011 - June 9, 2011. Tweets were collected using Twitter’s Streaming API, which returns tweets matching any of the search keywords occurring in the text, hashtags, @mentions, or URLs within a tweet. A panel of faculty and graduate students curated a list of popular hashtags, keywords, and Occupy city accounts related to the Occupy movement. The resulting data stream was examined at regular intervals for emerging terms. New terms were added to the keyword list after being reviewed by the entire research team, resulting in a dynamic archive based on a list of 355 keywords as data collection continued through the summer of 2012. A companion historical #OWS Twitter dataset collected from GNIP consists of tweets from October 17, 2011 - October 31, 2011. Clemson University collected this dataset using the GNIP PowerTrack Historical Search API, which returns non-deleted tweets matching any of the search keywords occurring in the text, hashtags, @mentions, or 17 See https://blog.twitpic.com/2014/10/twitpics-future/, archived at https://perma.cc/WR7V-QB3V; https://en.wikipedia.org/wiki/Yfrog. 71 URLs within a tweet. The initial list of 205 keyword terms18 from the contemporaneous data collection was used to collect this data in June of 2014. 4.5.2 West Coast Departments of Transportation - Account Based This case focuses on Department of Transportation (DoT) accounts on the West Coast of the United States including WA, OR, and CA. It represents an everyday form of every day political talk (J. Kim & Kim, 2008) about traffic and transportation issues with government (S. Zhang, 2015); less contentious than a social movement but still involving communication with government. Unlike the other two cases, this case does not consist of a set of keywords, but a list of accounts. Tweets collected consist of tweets produced by each account and tweets from users interacting with these accounts (replies, mentions, and retweets). This case study is prototypical of research projects with a well-defined population of users, high metadata stability, and high link stability. I posit this level of account metadata and link stability since government accounts properties such as profile text and usernames are unlikely to change and the majority of links tweeted by these accounts point to public facing .gov sites. This case contains only official staffed and automated accounts used by the Departments of Transportation in Washington, Oregon, and California. Twitter accounts were gathered from public facing websites of the Departments of Transportation and will 18 A list of keywords used to collect data can be found at https://github.com/somelab/SoMeToolkit/blob/master/collection.terms, archived at https://perma.cc/7E58-PWGB. 72 include the primary account for the state DoT, regional DoT accounts, and automated traffic bots. Account profiles were reviewed to ensure that the account was active at the time of data collection and related to the state DoT. Unofficial, county, and city Department of Transportation (non-state) accounts were excluded from the list. Accounts without any activity in the last year were also excluded from the list. Data was collected in real-time for two weeks from September 19, 2016 - October 2, 2016 via the Twitter Streaming API using the list of list of accounts and hashtags as generated above. During real-time collection, tweets with URLs and media (embedded images and video) were randomly selected for archiving as the tweet was received from the API. Embedded content for the selected tweets was archived within minutes of receipt of the tweet via the Twitter Streaming API and regular at weekly intervals for two months — for a possible total of 9 archives, 1 real-time with 8 weekly additional archives. Upon conclusion of the two-week real-time data collection period, the Twitter REST API was queried using the account and hashtag list. The Twitter REST API was queried nightly for 90 days after each tweet was collected to determine the accessibility of each tweet and any changes to its associated metadata. 4.5.3 RuPaul's Drag Race - Mixed Account/Topic Based Dataset RuPaul’s Drag Race is a reality competition television show in which a group of 12 drag queen constants seek the title of “America’s next drag superstar”19 with a grand price of $100,000. The show, currently in its 8th season, is the highest-rated television on 19 http://www.logotv.com/shows/rupauls-drag-race/cast 73 its parent network, Logo20, also airing in Australia, Canada, and the United Kingdom. A panel of regular and guest judges, led by RuPaul, critique constants as they progress through a series of weekly challenges. Each show concludes with the top two constants competing in a “lip sync for your legacy” event to win the weekly challenge and to select a competitor for elimination. The winner of the lip sync competition selects a competitor for elimination from the bottom two contestants as selected by RuPaul. The contestant selected for elimination “sashays away”, while the competitor not selected for elimination stays in the contest for another week. This case study is prototypical of research projects with a semi-defined population of users bounded by a set of twitter accounts and hashtags, high level of media (image and video) usage, or a focus on an entertainment context. The query terms for this case contain a combination of Twitter accounts and hashtags related to the show including the Twitter accounts of judges, constants, and guests appearing in the episodes during data collection. The list of judges, constants, and guests was gathered from the show’s website and Twitter account handles were obtained via Google and Twitter searches. Hashtags related to the show were after observing a sample of tweets from the official Drag Race Twitter account, @RuPaulsDragRace. A full list of account, hashtags, and keywords are listed in Appendix B. Data was collected in real-time for two weeks from September 19, 2016 - October 2, 2016 via the Twitter Streaming API using the list of list of accounts and hashtags as generated above. During real-time collection, tweets with URLs and media (embedded 20 http://www.etonline.com/tv/160480_for_rupauls_drag_race_mainstream_is_jumping_the_shark/ 74 images and video) were randomly selected for archiving as the tweet was received from the API. Embedded content for the selected tweets was archived within minutes of receipt of the tweet via the Twitter Streaming API and regular at weekly intervals for two months — for a possible total of 9 archives, 1 real-time with 8 weekly additional archives. Upon conclusion of the two-week real-time data collection period, the Twitter REST API was queried using the account and hashtag list. The Twitter REST API was queried nightly for 90 days after each tweet was collected to determine the accessibility of each tweet and any changes to its associated metadata. 4.6 SUMMARY OF CASE STUDY DATA COLLECTION AND ANALYSIS Because data was collected for the Occupy Wall Street case in prior work, only a subset of analyses can be applied to that case study. Specifically, archiving URLs or the collection nightly availability of tweets was not part of the preexisting dataset. All data collection procedures and analyses were applied to the other two case studies. A short summary and table are below. 75 Table 4.1: Case Collection and Analysis Summary Datase t Streamin g API GNIP API REST API URL Archivin g Prototypical Dimensions Duration of Analysis 1. OWS X X multi-year time scale, contentious political context, keyword based query, high account and metadata instability, unbounded population 3 years 2. DoT X X X X bounded population, account based query, high metadata and account stability, high URL stability, every-day political context 2 Months 3. Drag Race X X X X mixed keyword and account query, entertainment context, media intensive, semi- bounded population 2 Months 4.6.1 Analysis of Occupy Wall Street Case Study A subset of both datasets with the same timeframe and hashtags was compared. Tweets ids and metadata from October 19, 2011 to October 31, 2011 with the hashtags #ows and #occupy were compared from the real-time and historical datasets. Since this is a preexisting dataset, URLs were not archived during real-time data collection so the analysis of the URLs cannot be performed in this case study. The comparison of these Occupy Wall Street datasets provides insight into the ephemerality of social media data from a social movement over a three-year timeframe — from its (near) inception to three years later. This case study is prototypical due to the timespan between real-time and historical collection (three years) and its semi- 76 contentious political nature; thus providing insight into the ephemerality of situations when researchers collect social media data about a historical political event multiple years after it occurred. Tweets missing from either set were further examined using a process similar to the one developed by Petrovic et. al (2013). Missing tweets were requested via the Twitter REST API — to determine if the tweet was deleted, modified, still available, or the user account was deleted. Since URLs were not captured in real-time during data collection, it was not possible to track the change of URLs over time. It would be possible to do an automated analysis of HTTP codes or code a random sample to get the current state of URLs, but this would provide little insight into the change over time. Also, the protest nature of this dataset may lead to different patterns of ephemerality vs other less political datasets. 4.6.2 Analysis of DoT and Drag Race Case Studies Since the DoT and Drag Race cases were specifically collected for this dissection, all of the data collection and analysis procedures outline in the previous chapter have been conducted on the case studies. This includes a comparison of real-time (Streaming API), semi-real-time (Search API), and nightly availability check (REST) datasets within each case study. 4.6.3 Case Descriptive Statistics Data for each case study was collected in real-time for a period of 12 - 14 days via the Twitter Streaming API using a set of query terms including keywords and/or account 77 accounts as described in the previous section. The set of query terms for each case study are listed in Appendix A. Figure 4.3 shows a graph of the number of tweets collected each day during real-time data collection. Due to the high-volume of tweets collected in the OWS case study, the number of tweets exceeded the rate limits of the Twitter Streaming API. When rate-limiting occurs, the API reports an estimated number of rate-limited tweets since the connection to the API was opened. This number is reset with reconnecting to the API. The estimated number of rate-limited tweets reported by the Twitter Streaming API for this case was 25,052. As seen in Figure 4.3, no tweets were collected on October 29, 2011 due to API connection issues. 78 Figure 4.3: Daily tweet volume collected for each case study during real-time collection. Timestamps are in UTC. Table 4.2: Summary of Case Study Descriptive Statistics Case Start Date Duratio n Unique Users Tweets Retweets Replies Occupy Wall Street 10/19/11 12 536,912 2,310,038 972,208 (42.1%) 142,452 (6.2%) Departments of Transportation 9/19/16 14 3,464 13,330 5,567 (41.8%) 2,032 (15.2%) Drag Race 9/19/16 14 106,602 356,147 179,346 (50.3%) 38,477 (10.8%) Table 4.2 displays basic descriptive statistics for each case study. The graph in Figure 4.3 shows the daily tweet volume collected for each of the case studies. The Occupy Wall 79 Street case contains the most tweets due the collecting period taking place as the movement was ramping up during a high period of new media coverage. The Departments of Transportation case contains the least number of tweets of the three cases since the majority of the accounts in the dataset were information accounts automatically tweeting information related to traffic conditions. The lower volume of tweets is also due to the query parameters used — tweets in the case are either from the set of accounts mentioned in Appendix B or from an account retweeting or replying to one of those accounts. Table 4.3: Proportion of tweets with entities: hashtags, URLs, and mentions in each case study. Case With hashtags With URLs With mentions Occupy Wall Street 65.1% 52.5% 65.4% Departments of Transportation 24.7% 27.3% 65.1% Drag Race 35.4% 21.6% 78.5% Table 4.3 lists the proportion of tweets containing hashtags, URLs, and mentions of other users for each case study. The RuPaul’s Drag Race case study contains the highest number of mentions (78.5%) as Twitter users watching the show mentioned the Twitter handles of judges and contestant in their tweets. The OWS (65.4%) and Departments of Transportation (65.1%) case studies contain a similar percentage of mentions. The percentages of hashtags and mentions are highly impacted by the construction of case keywords — for example, the lower number of tweets with hashtags in the Departments of Transportation case study (24.7%) may be due to the fact that the query terms for the 80 case include only accounts and not keywords or hashtags. The high number of tweets with URLs in the Occupy Wall Street case study (52.5%) may be due to the nature of a social movement as protesters tweet out different types of informational resources in responses to external events (Bennett et al., 2014; Segerberg & Bennett, 2011). Table 4.4: Summary of Case Study Descriptive Statistics - User Statistics Case Mean tweets/user (SD) Max tweets/user Min tweets/user Occupy Wall Street 4.3 (28.3) 7,364 1 Departments of Transportation 3.8 (32.5) 1,214 1 Drag Race 3.4 (27) 8,250 1 Table 4.4 lists descriptive statistics related to user accounts in each of the case studies. Mean tweets per user range for each of the case studies between 4.3 to 3.8 with the majority of users having only one tweet. As shown by the high standard deviation and high max tweets per user value, users central to the case often tweeted out more tweets with the majority of users entering the dataset with only 1 tweet. 81 Table 4.5: Visualization of overlap between data collected real-time (Twitter Streaming API) and semi-real-time (Twitter REST API). Case Streaming Search Overlap Difference Departments of Transportation 13,330 6,457 5,490 (41.2%) 7,840 (58.9%) Drag Race 356,147 179,131 163,220 (45.8%) 192,927 (54.2%) Figure 4.4: Visualization of overlap between data collected real-time (Twitter Streaming API) and semi-real-time (Twitter REST API). The Twitter Streaming API only supports real-time data collection so any delay in data collection forces researchers to use only data collection points such as scraping the Twitter website, purchasing data, or using the Twitter REST API. The Twitter REST API supports the collection of tweets via a search interface21 against a sampling of tweets posted in the last 7 days. The documentation states that the Search API focuses on “relevance and not completeness”, noting that some tweets and users may be missing from search results from the Search API. The documentation points developers and researchers to the Streaming API or GNIP for more complete datasets. The lack of “completeness” and short result window (7-days) make the Search API problematic for research data collection (Driscoll & Walker, 2014; González-Bailón et al., 2014). Figure 4.4 and Table 4.5 compares tweets collected in real-time via the Twitter Streaming API and in semi-real-time (Twitter REST API) for the Departments of Transportation and RuPaul’s Drag Race case studies. The same query terms were used 21 See https://dev.twitter.com/rest/public/search for more information about the Twitter Search API. 82 for the Streaming API and Search API as listed in Appendix A. Search API data was collected for each study after real-time data collection had concluded, or 15 days after data collection began. Tweet ids were matched between the Streaming API and Search API datasets as listed in the overlap column in Table 4.5. The difference column lists the number of tweets missing from the Search API dataset. For both case studies, approximately 40% - 45% of tweets were accessible via both the Streaming and Search APIs. 4.7 CHAPTER SUMMARY In this chapter I introduced the concept of ephemerality, or unstable nature, of social media data over time. This concept is important for researchers working with social media data since any latency, or delay, in data collection may lead to changes in the resulting dataset. These changes may be caused by posts becoming inaccessible, changes to a user’s profiles, or changes to linked content embedded in posts and user profiles. Depending on the scope of the changes and their interaction with the research design, the content of the social media dataset my no longer accurately reflect the phenomena under investigation. This chapter also described my research design, data collection procedures, and the case studies under investigation. Each case study was chosen because it represents prototypical features of the types of data collection scenarios researchers experience. Examples of these dimensions include time-scale (short to long), population bounding (tight to lose), level of political contention (highly contentious to the everyday political 83 context), and inclusion of links to media such as images and videos (high to low level of media and linking). The range of case studies provide for a triangulation of different contexts — social movements, daily interactions with government, and a reality TV show — to examine the ephemerality of social media data within, between, and across cases. Descriptive statistics were provided for each case study. Finally, the overlap between real-time (Streaming API) and semi-real-time was calculated for the Departments of Transportation and RuPaul’s Drag Race case studies. For both case studies, approximately 40% - 45% of tweets were accessible via both the Streaming and Search APIs showing that the Search API is a poor choice for data collection when a research design requires the collection of all tweets related to a set of keywords. 84 Chapter 5. RELIABILITY The first concept closely connected with ephemerality is reliability. From a statistical standpoint, reliability is concerned with consistency in obtaining the same measurement or finding upon repeated measurements under similar conditions with a research instrument. Within the context of social media, high reliability would imply capturing a social media dataset in such a way that the number of posts in the dataset is stable and each post, after collection, is the same post each time it is accessed. In other words, a social media dataset that is observed with a particular measurement device is impervious to change — deletion, modification, or change in privacy settings (public/private). I measure the reliability of a social media dataset through the level of change in tweet identification numbers (appearance/disappearance) in the dataset over time when the same parameters are used to collect the data at different points in time (real-time, semi- real-time, and historical). To measure the level of change, I compare the unique tweet ids, identifiers which uniquely identify the individual tweets over time, between these datasets. 5.1 OPERATIONALIZING RELIABILITY To examine the reliability of the social media datasets in each case study, I tracked the availability or unavailability of each individual tweet throughout the three periods of data collection (see Chapter 4). Since Twitter does not offer users the ability to modify a tweet posting, this analysis only focuses on the accessibility or inaccessibility of a tweet. This approach is similar to the one used by Driscoll and Walker (2014) employed to 85 compare tweets collected via different real-time APIs. The availability of each tweet was determined by matching unique tweet ids from the real-time and nightly datasets. Tweets found in the real-time dataset but not in the nightly datasets were further analyzed to determine the cause of its inaccessibility. The Twitter API was queried to determine if the individual tweet was deleted, the account was deleted, or the account was made private. This is similar to the method used by Petrovic ́, Osborne, and Lavrenko (2013). If the inaccessible tweet was a retweet, the original tweet was also queried via the Twitter REST API to determine if the inaccessibility is a result of a deletion cascade. In a deletion cascade, the deletion of a retweet in another user’s timeline is not related to the timeline owner’s actions, but a propagation of the deletion of a tweet, account deletion/suspension, or privacy change by the original producer of the retweet; thus, without classification, magnifies the impact of a single deletion by the number of times the tweet was retweeted. This analysis, along with basic descriptive statistics, provides a picture of the reliability within each case study as the availability of each tweet collected in real-time is monitored for a period of 90 days after collection. This approach shows data a researcher would be able to collect as the latency of data collection increases from 1 to 90 days. 5.2 RELIABILITY ANALYSIS OF EACH CASE STUDY The accessibility of each tweet collected in real-time was checked each day for a period of 90 days after collection. Each night a script requested every tweet collected during the real-time data collection period for the Departments of Transportation and 86 RuPaul’s Drag Race case studies by requesting each tweet from the Twitter REST API via its unique id. After the 90-day data collection period, the data was processed using the following process: • Since tweets were collected in real-time over a two-week period, time periods were calculated for each tweet. The date of collection was represented as time point t0, the 1st day as time point t1, and the 90th day as time point t90. This transformation allows the accessibility of each tweet to be compared while taking the two-week collection window into account. • For each time point (t1 - t90), the accessibility of each tweet was determined by noting if each individual tweet id appeared in the nightly data corresponding the time point for that tweet. • Any gaps in the accessibility of a tweet were backfilled and the missing time point was assumed to be due to an error in the data returned from the Twitter API. For example, if a tweet was accessible at time points t9 and t12 but not at time points t10 and t11, time points t10 and t11 were re-coded as accessible. Backfilling missing time points masks changes users made to their private settings, moving their account from protected to not protected. More granular data on the reason for the inaccessibility of each tweet would be required in order to detect daily changes in account protection settings. 87 Figure 5.1: Tweets inaccessible per time period during the 90-day observation period for the Departments of Transportation and RuPaul’s Drag Race case studies. Table 5.1: Proportion Tweets Accessible After Time Periods Under Investigation. The proportion of tweets available 90 days after real-time collection for the Departments of Transportation and RuPaul’s Drag Race case studies are shown in Table 5.1. After 90 days, 6.7% of tweets in collected in the Departments of Transportation case study were unavailable. For the RuPaul’s Drag Race case study, 10.7% of tweets were inaccessible after 90 days. Case Time Period Total Tweets Collected in Real- Time Tweets Accessible Tweets Inaccessible Occupy Wall Street 3 years 2,310,038 2,029,074 280,964 (12.7%) Departments of Transportation 90 days 13,330 12,448 (93.4%) 882 (6.7%) Drag Race 90 days 356,147 276,204 (89.3%) 38,943 (10.7%) 88 For the Occupy Wall Street case study a different process was used to assess the reliability of the dataset. Tweets were collected at only two points in time: (1) real-time and (2) purchased from GNIP three years later in June 2014, requiring a different approach for comparison. The same process was used as noted above, but only two time points were compared (see Chapter 4 for a data collection methods). After three years, 280,964 (12.7%) tweets were in no longer accessible. Table 5.1 confirms the expected ranking of cases with respect to the proportion of inaccessible tweets when considering the prototypical features of each case. When constructing the prototypical features of the case studies, it was expected that the Occupy Wall Street case study would exhibit the highest level of inaccessible tweets (12.7%) due to the highly contentious nature of the social movement, the query terms used to collect data were solely keyword based (no accounts were followed), and the three-year timeframe. RuPaul’s Drag race was expected to have the second-highest level of inaccessible tweets (10.7%) due to its reality TV and entertainment context. The Departments of Transportation case study was expected to have the lowest proportion of inaccessible tweets (6.7%) due to the “everyday political context” and its focus on tweets from or replying to official government accounts. 5.3 MECHANISMS OF INACCESSIBILITY Tweets become inaccessible due to a variety of mechanisms22 related to the tweet itself as well as the tweets and accounts a tweet is related to. Each tweet inherits the 22 See https://support.twitter.com/articles/18906?lang=en for a description of what happens when a tweet is deleted. Link archived at https://perma.cc/38XY-RQFR. 89 accessibility properties of the Twitter account that created the tweet. As illustrated in Figure 5.2, when a user deletes their account, this action cascades deleting all tweets contained in the account. This same process takes place with retweets, but a retweet is also impact by the availability of tweet the original tweet that was retweeted. Figure 5.2: Illustration of how a tweet inherits the accessibility properties of the tweets it is related to. In example shown, a retweet is deleted because the account that produced the original retweet was deleted. The mechanisms tweets become inaccessible by actions of the user/account that produced the tweet. The following actions impact the accessibility of a tweet: 1. Tweet deletion. A user deletes a tweet from their timeline. 2. User account deletion. A user deletes their account thereby removing all their tweets. 3. Account set to protected. Tweets in an account are set to ‘protected’ thereby effectively making all tweets by that user inaccessible. If the tweet is a retweet, the retweet also inherits the accessibility properties of tweet that was retweeted. The following actions impact the accessibility of a retweet: 1. Deletion of an original tweet. If this tweet happens to be a retweet, if the retweeted tweet (original tweet) is deleted, the deletion of original tweet cascades Account deleted Deletes original tweet Deletes retweet 90 to all retweets of the original tweet. This is illustrated in figure 5.2, starting at the middle, “deletes original tweet” stage. 2. Deletion of the account of the user who created the original tweet. A user whose tweet was retweet, deletes their account thereby removing all of the tweets in their account trigging the previous step (#1). 3. Protection of the account of the user who created the original tweet. A user whose tweet was retweet, protects their account thereby removing all of the tweets in their account trigging the previous step (#1). The difference between protection and deletion of an account is that all tweets become publicly accessible again if the protection setting is turned off. In order to determine the reason a tweet became unavailable, a list of tweets inaccessible 90 days after collection were generated for the Departments of Transportation and RuPaul’s Drag Race case studies. It was not possible to complete this analysis on the Occupy Wall Street case study since nightly data was not collected for the case study. Further data was collected on each inaccessible tweet by querying the Twitter REST API to determine the accessibility of the account that produced the tweet and its public or protected status. If the tweet was a retweet, the status of the original tweet and the account that produced the original tweet was also queried. Using the 6 actions (deleted tweet, deleted user, protected user, deleted retweet, deleted retweet user, and protected retweet user) that can cause a tweet to become in accessible, each inaccessible tweet was analyzed using the data retrieved from the Twitter REST API. In cases where there could be multiple causes for a tweet to be 91 inaccessible, the reason for inaccessibility was attributed to the highest cause. For example, if a tweet was deleted and the account was also deleted, then the cause would be attributed to the deletion of the account. If the tweet was a retweet, and the account of the user who retweeted the tweet was deleted and the original tweet was also deleted, the cause of the inaccessibility of the tweet is attributed to the deletion of the original tweet. This process will not detect instances when a retweet was deleted before original tweet was deleted during the 90-day window of data collection. Table 5.2: Reason for tweet inaccessibility - Departments of Transportation and RuPaul’s Drag Race Case Total Inaccessible Tweets Deleted Tweet Delete d User Protecte d User Delete Retwee t Deleted Retweet User Protected Retweet User Departments of Transportation 882 475 (53.8%) 326 (37%) 21 (2.4%) 26 (2.9%) 33 (3.7%) 1 (0.1%) Drag Race 38,943 20,811 (53.4%) 10,536 (27%) 3,816 (9.8%) 2,280 (5.9%) 980 (2.5%) 520 (1.3%) In both cases, the main cause (53%) of tweet inaccessibility is the deletion of the tweet itself. This indicates that the majority of inaccessible tweets were deleted by users themselves. The second highest cause of tweet inaccessibility is due to the deletion of user’s account. It is important to note that in both cases, the accounts deleted represented transient users who either mentioned the accounts included in each case study or the, in the case of RuPaul’s Drag Race, the keywords contained in the query parameters. None of the accounts contained in the query parameters were deleted during data collection. 92 Table 5.3: Tweet inaccessibility categorized by changes to user account vs. a retweeted account. Case Total Inaccessible Tweets Inaccessibility due to: Changes in user account Changes to a retweeted account Departments of Transportation 882 823 (93.3%) 59 (6.7%) Drag Race 38,943 35,163 (90.3%) 3,780 (9.7%) It is helpful to think of the root of the case a tweet’s inaccessibility: is it due to changes made by a user to their own account or changes to other accounts. This collapses the reasons for inaccessibility into two categories: (1) changes to the user account (deleted tweet, deleted user, and protected user) into one category and (2) changes to the account related to an original tweet (deleted tweet, deleted retweet user, and protected retweet user). The 6 categories from Table 5.2 have been collapsed into these two categories in Table 5.3. In both cases, the majority (over 90%) of tweets were inaccessible due to changes to the account of the user who tweeted the tweet. 5.4 CHAPTER SUMMARY In this chapter I introduced the first concept closely connected with ephemerality: reliability. From a statistical standpoint, reliability is concerned with consistency in obtaining the same measurement or finding upon repeated measurements under similar conditions with a research instrument. Within the context of social media, high reliability would imply capturing a social media dataset in such a way that the number of posts in the dataset is stable and each post, after collection, is the same post each time it is 93 accessed. In other words, a social media dataset that is observed with a particular measurement device is impervious to change — deletion, modification, or change in privacy settings (public/private). I measured the reliability of a social media dataset through the level of change in tweet identification numbers (appearance/disappearance) in the dataset over time when the same parameters are used to collect the data at different points in time (real-time, semi-real-time, and historical comparing the unique tweet ids between these datasets. At the end of the time period under observation, 6.7% - 10.7% of tweets were no longer accessible. The cases followed the expected ranking with respect to the percentage of tweets that were inaccessible – 1) Occupy Wall Street, 2) Rupaul’s Drag Race, and 3) Departments of Transportation. Two of the case studies, Departments of Transportation and RuPaul’s Drag Race, were examined to determine the mechanism that caused each tweet to become inaccessible. In both cases, the main cause (53%) of tweet inaccessibility is the deletion of the tweet itself. This indicates that the majority of inaccessible tweets were deleted by users themselves. 94 Chapter 6. AUTHENTICITY The second concept closely connected with ephemerality is authenticity. Authenticity of a dataset is “the extent to which it accurately (precision) and faithfully (fidelity) represents what it is meant to. Establishing and documenting data quality and veracity is a key aspect of data lineage” (Kitchin, 2014, p. 153). Similarly, from an archival point-of- view, it is the authority and trustworthiness of a record as proof and memory of the activity of which they constitute a natural byproduct (as social media trace data is a “natural” byproduct of a post). In other words, a “record that can stand for the facts it is about” (Duranti, 1995). This is “linked to a record’s state, mode and form of transmission, and to the manner of its preservation and custody” (Duranti et al., 2013 Ch. 2). Within the context of social media, the authenticity of a post involves capturing and stabilizing the surrounding metadata so a post or digital object can stand on its own. I measure the authenticity of a social media dataset through the level of change in metadata and linked URLs embedded in a post over time when the same parameters are used to collect the data at different points in time. To measure the level of change, I compared specific metadata elements (e.g. username, profile description, post statistics) and URLs of the same tweets collected at different points in time. Textual metadata elements such as username and profile description were compared at different collection times to quantify how the extent of change. For example, changing cats to cat requires one change — the removal of the ’s’. The difference in numeric metadata elements was be calculated over time. Weekly URLs archives of URLs will be manually compared using 95 a qualitative coding process over time. Text content of URLs will also be extracted and compared using a method similar to the textual metadata elements. 6.1 OPERATIONALIZING AUTHENTICITY The authenticity of a social media dataset is concerned with the stability of the metadata and linked URLs surrounding each post. This analysis focuses on the stability of the metadata embedded in each tweet and the URLs randomly selected during real- time data collection. For example, when requesting a tweet via one of the Twitter APIs or viewing a tweet via the Twitter website or mobile client, the current and not historical metadata, such as a user’s profile description, is displayed. As a result, there could be a mismatch in the metadata between a tweet collected in real-time and one displayed or requested at a later point in time. Figure 6.1: Tweet from the US National Park Service as display on the Twitter website in May 2017 with metadata fields labeled. 96 To examine this change, I compared tweet and the user profile metadata of the real- time and nightly data from the Departments of Transportation and RuPaul’s Drag Race case studies. For reference, metadata fields analyzed are labeled in a tweet in Figure 6.1. Since the screen name (Twitter handle), profile description, and location are editable by the user, these fields were compared using edit or Levenshtein distance (Levenshtein, 1966). The Levenshtein distance measures the number of insertions, deletions, and substitutions required to change one string into another. For example, a change from ‘cat’ to ‘bat’ has a Levenshtein distance of 1 since 1 edit is required to change the ‘c’ in ‘cat’ to a ‘b’. This provides a measure of the extent of change, if any, between each metadata field in the real-time and nightly datasets within the two cases. In addition, I also checked for changes in the profile picture URL and homepage URL listed on each user’s profile by comparing values returned in each dataset. Tweet and user statistics were also compared over time, including the number of: followers, following, retweets, and favorited count. It is important to note that the backfilling of missing days of data collection as described in section 5.2 may have an impact on the above measure of authenticity since backfilled days are treated as if there was no change in a user’s profile or metadata. Any changes occurring on backfilled days will be detected during the next non-backfilled day thus shifting the recorded day of change by the number of backfilled days. In the case of indicators measuring the extent of change each day to the profile of a user (user description, location, and name), the extent of change may be overrepresented on days 97 occurring after backfilled days since that day is also recording changes made to a user’s profile during those gaps. To measure the stability of linked URLs, the web archives and extracted page content were compared across the highly weekly sampling points. The process involved comparing the baseline URLs archived at the time of tweeting to the weekly archived pages using web archives produced by the heritirx crawler. The extracted page content was compared using a similarity hash. This analysis provides a measure of the change of the content of URLs embedded in each tweet thus lending empirical data to the level of change of linked content embedded in social media datasets. 6.2 AUTHENTICITY ANALYSIS OF TWEET AND USER METADATA Authenticity of the metadata associated with each tweet was analyzed at two levels: (1) the metadata associated with the tweet including the number of times a tweet was retweeted and favorited by other users and (2) the metadata associated with the user tweeting the tweet. As mentioned in the previous section, changes in tweet metadata were tracked for a period of 90-days following its collection. Since some users tweeted multiple tweets in each case study, user profile information was tracked for 90 days from the first time a user tweeted in each case. Nightly user profile information was extracted from the last tweet the user tweeted each day. As shown in Table 4.2, 3,464 users were tracked as part of the Departments of Transportation case study and 106,602 users were tracked as part of the RuPaul’s Drag Race case study. 98 Table 6.1: Number and proportion of users changing profile metadata Case Unique Users Screen Name Name Profile Image Location Description Departments of Transportation 3,464 41 (1.2%) 325 (9.4%) 731 (21.1%) 173 (5%) 809 (23.3%) Drag Race 106,602 5,978 (5.5%) 34,313 (31.5%) 57,846 (53.3%) 13,689 (12.6%) 45,763 (42.2%) 99 Figure 6.2: Distribution of mean edit distance of change to user profile metadata - user description, user name, and user location. Only users with an edit distance > 0 are displayed. Table 6.2: Extent to which users changed profile metadata as measured by mean edit distance between changes. Case Description Name Location Departments of Transportation 0.99 (1.96) 0.28 (7.28) 13.5 (0.48) Drag Race 1.47 (2.57) 0.5 (0.79) 12.92 (6.1) As shown in Table 6.1, the highest percentage of users in both case studies made changes to their profile image. In the RuPaul’s Drag Race case study, the second most edited piece of user metadata was the description field (42.2%) with only 2.4% of users editing their profile description in the Departments of Transportation case study. This points to the impact of the construction of each case study and its prototypical features have on the level of authenticity. The Departments of Transportation has a lower level of user profile change due the more stable nature of official government accounts. User level metrics such as the number of followers are often used as proxies of user centrality and popularity. The three user-level metrics were tracked across the 90-period included: 1. Followers Count: indicates the number of users following this user’s posts, (2) friends count: indicates the number of users this user follows. This is often used as an indicator of a user’s potential reach or the number of other Twitter users who have the potential to see a user’s tweets in their timelines. 100 2. Friends Count: indicates the number of users that this user follows. While used less often than the followers count, this is normally used as an indicator of which users a user is connected to. 3. Statuses Count: indicates the number of non-deleted tweets a user has posted. This is often used as an indicator of the level of activity of a user. Note that deleting and posting tweets will decrease or increase this number. 101 Figure 6.3: Distribution of mean change in user metrics per user. 102 Table 6.3: Mean change in user-level metrics Case Followers Count Friends Count Statuses Count Departments of Transportation 2.8 (24.06) 0.47 (6.53) 36.7 (120.0) Drag Race 1.97 (82.75) 0.47 (19.9) 25.57 (117.67) Table 6.3 lists the mean change in these user level metrics. It is important for researchers to understand how user-level metrics change over time since each time a tweet is viewed or retrieved from the Twitter API, the user-level metrics reflect the statistics at the time of the request. For example, a user may not be very popular as measured by the number of followers at the time of posting a tweet, but if that tweet went viral (Nahon et al., 2011) after being posted, the number of followers and retweets may increase significantly due to the viral event. As a result, the updated statistics may show the user as being more popular than they actually were at the time of posting. Status count has the highest average change which would be expected as users posted more tweets over time during the 90-window of data collection and observation. For the Departments of Transportation case study, the change in status count ranged from -52 to 2,297. It is important to note that each of these metrics may increase or decrease as users lose followers, unfollow users or users delete tweets. 103 6.3 TWEET LINKED DATA Table 6.4: Top 10 URLs by volume. Departments of Transportation RuPaul’s Drag Race twitter.com: 1,635 bit.ly: 686 tpck.us: 369 ow.ly 197 remmont.com: 148 wa.gov: 59 usa.gov: 46 ca.gov: 38 goo.gl: 38 youtube.com: 33 twitter.com: 36,289 dlvr.it: 8,679 youtu.be: 4,698 instagram.com: 2,847 vine.co: 2,588 youtube.com: 2,333 bit.ly: 2,172 fb.me: 1,236 logo.to: 1,195 nyti.ms: 1,195 Starting at the URL level, Table 6.4 lists the top 10 domains linked to from all tweets with URLs in the Departments of Transportation and RuPaul’s Drag Race case studies. For both case studies, other tweets (twitter.com) were the top destination. Other social media sites such as YouTube, Instagram, Vine, and Facebook were also popular sites linked to. As expected, government domains were in the top 10. It is important to note that the top linking destinations for both case studies point to interconnectedness of tweets to other tweets and social media sites. Table 6.5: Descriptive statistics for archived URLs. Case Contain URLs Selected for Archiving Successfully Archived Average Weeks Archivable Departments of Transportation 3,639 2,189 (60.2%) 1,984 (90.6%) 7.23 Drag Race 76,928 48,767 (63.4%) 43,421 (89%) 7.16 104 Moving to the tweet level, tweets with URLs where randomly selected for archiving during real-time data collection. Table 6.5 provides descriptive statistics for the Tweets with URLs that were selected. Approximately 60% of all tweets with links were selected to have their links archived. All URLs within a tweet were grouped as a single unit. Of those, approximately 90% were able to be archived. For those that were successfully archived, the context was extracted from each URL embedded in a tweet. For tweets with multiple links, the content of each link was combined into a single document. A simhash (SalahEldeen & Nelson, 2013; Sood & Loguinov, 2011), or similarly hash was calculated. A simhash was used instead of an edit distance because many web pages will have slight changes in content, for example an automatically updated date, to which an edit distance is too sensitive. The distance between two simhashes indicates how similar two documents are to each other.23 The lower the simhash distance, the more similar the two documents. Two identical documents would have a simhash distance of 0. Two simhash distances of 40 would indicate that the two documents are not very similar. The simhash distance between weekly archives will give us an idea of the extent of change each week. The mean simhash distance between the content in weekly archives of all URLs in tweets selected for archiving is displayed in Figure 6.4. 23 See http://matpalm.com/resemblance/simhash/ for a non-technical description of the simhash algorithm. 105 Figure 6.4: Distribution of mean simhash distance between the content of weekly archives of URLs within tweets selected for archiving. Content in all URLs for each tweet was grouped into one unit. Tweet URLs archived for less than two weeks excluded. In the Departments of Transportation case study, 90% of all the URLs within a tweet (where all URLs within a tweet were treated as a single unit) were archived for the full 8 weeks. Over 98% of URLs that were achievable in real-time were also archivable over the entire 8-week period. As shown in Figure 6.4, the majority of URLs were identical over the archivable time period. The mean simhash distance between archives was 9.8. For the RuPaul’s Drag Race case study, 89% of URLs were archivable for the entire 8-week period. As expected, the stability of URLs is higher in the Departments of Transportation case. 106 6.4 CHAPTER SUMMARY This chapter introduced the second concept closely connected with ephemerality: authenticity. Within the context of social media, the authenticity of a post involves capturing and stabilizing the surrounding metadata so a post or digital object can stand on its own. The authenticity of a social media dataset was measured via the level of change in metadata and linked URLs embedded in a post over time when the same parameters are used to collect the data at different points in time. The descriptive statistics for metadata change in the RuPaul’s Drag Race and Departments of Transportation case studies support the hypothesis that the level of metadata stability would differ for each case studies due its context. Official government accounts did not change their profile information while over 50% of users in the Drag Race case study changed their profile image and over 42% of users changed their profile description. Follower, friend, and status count had a high rate of change for tweets in each of the case studies. For research analyzing user profile information or tweet statistics, these results point to the important of preserving the state of this metadata at the moment of data collection or taking the high rate of change into account during data analysis. The link analysis points to the interconnected nature of social media platforms and posts through the links users embedded within the post content. This is demonstrated by the top 10 domains for the Departments of Transportation and RuPaul’s Drag Race case studies. The top 10 domains included other posts within Twitter as well as links out to other platforms including Instagram, YouTube, Vine, and Facebook. Links within the 107 Departments of Transportation case also included many .gov or URL shorteners offered by the US Federal Government when linking to government content. This points to the limitations of current methods focusing on a single social media platform for analysis, which does not accurately reflect how users actually use social media platforms. 108 Chapter 7. IMPACTS OF EPHEMERALITY In this chapter, I summarize findings related to the impacts of ephemerality on the reliability and authenticity of the social media datasets within and across each of the three case studies. The observations collected in this dissertation are descriptive in nature. This was a necessary first step to understand the effects of ephemerality on social media dataset since we lack empirical data relating to the impact, if any, of latency on of the social media posts, metadata, and linked data collected for research purposes. The analysis is based on a combination of the descriptive statistics described in Chapters 4-6, my experience collecting and analyzing social media data, and when possible, inferential statistics. 109 7.1 RELIABILITY - THE RELATIONSHIP BETWEEN TIME AND EPHEMERALITY Figure 7.1: Simple regression of tweet accessibility at time points t0 - t90 for the Departments of Transportation and RuPaul’s Drag Race case studies. As illustrated in Figure 7.1, latency in data collection is a significant predictor of tweet accessibility. A simple regression model was calculated to predict the number of tweets accessible in each case based on the days since a tweet was posted. A significant regression equation was found (F(1, 89)=676, p<.000), with an R2 of .884 for the RuPaul’s Drag Race case study. The predicted number of tweets available is equal to 347,300 + -5.7857 (days) when days is measured in days since a tweet was posted. The number of tweets available decreased -5.7857 for each day since posting. A significant regression equation was also found (F(1, 89)=651.6, P<000), with an R2 of .880 for the 110 Departments of Transportation case study. The predicted number of tweets available is equal to 13,050 + -219.4831 (days) when days is measured in days since a tweet was posted. The number of tweets available decreased -219.4831 for each day since posting. A regression equation was not calculated for the Occupy Wall Street case study because data was only collected for two time points (real-time and three years later). As latency in data collection increases, the number of inaccessible posts increases. The highest number of inaccessible posts occur within the first 48 hours after tweets were created in both the Departments of Transportation (38%) and RuPaul’s Drag Race (44%) case studies. This points to the importance of collecting social media data in real-time and indicates that datasets may contain large gaps with as little as a 24-hour latency in data collection. Large gaps in social media datasets may result in a dataset that no longer represents the social media posts at the time under investigation. For example, when examining the spread of rumors after disasters, users may delete posts containing misinformation resulting in the disappearance of rumors in datasets collected hours or days later. This is a concern for researchers as social media data collected with a high latency may no longer accurately reflect the social media posts at that time. 7.2 AUTHENTICITY: THE IMPACT OF THE PROTOTYPICAL FEATURES As noted in Chapter 4, a case study approach was chosen to closely replicate the prototypical features of data collection scenarios social science researchers commonly use. While it is not possible to determine which prototypical features have the most impact, it is possible to determine if the differences between the three cases are 111 statistically significant. A chi-square test of independence was calculated between each of the three case studies on the core measure of reliability — the number of tweets inaccessible at the end of the period of observation (described in Table 5.1) — finding that the cases are independent χ²(2, N=2,667,515), p<.001.24 The significant results of the test can be interpreted to mean that the differences between cases is not due to randomness, but due to the prototypical features of each case study (described in Table 4.1). If the differences were not significant then the differences in the cases would not be meaningful. I posit that the following prototypical features have an impact on the level of ephemerality in each case study: • Context. The context of each case study is related to how users of within the case study utilize the platform, including the types of content posted as well as their relationship to other users. For example, the Departments of Transportation case study has the lowest level of inaccessible tweets and this may be to the low level of political contention of the everyday political context — the discussion and sharing information related to state transportation infrastructure is potentially less contentious than the case contexts. The Occupy Wall Street case study has the highest level of tweet inaccessible tweets. This is unsurprising due to the highly contentious nature of social movements and, as the literature has shown, deletion is used as a protest tactic (Neumayer & Stald, 2014). The reality TV context of RuPaul’s Drag race sits between the other two case studies resulting in the second- highest level of tweet inaccessibility. The impact of context confirms work 24 With a N of over 2 million, it would be expected to find significance. 112 examining rumor spread (Starbird et al., 2014) and behavior around “regret” posts (Knapp et al., 1986; Petrovic et al., 2013; Sleeper et al., 2013) on social media sites. • Query Terms. While each case study was collected in real-time via Twitter’s Streaming API, the construction of query terms differed significantly. The Occupy Wall Street case study was based solely on keywords, the Departments of Transportation case study was based solely on following a fixed set of official government Twitter accounts, and the RuPaul’s Drag Race was based on a combination of keywords and a bound set of accounts. A focus on following users vs. keywords may also relate to the impact of cascading account and tweet deletion — where a deleted account causes retweets in other accounts to be deleted — is slightly higher in the Departments of Transportation case. Notably, query construction may have an impact on the distribution and change of tweets with certain types of entities (URLs, mentions, and hashtags) as evidenced by the results in Tables 6.1 and 6.2. • Metadata Stability. Part of the criteria for each case study was an expectation that there would be a different level of stability of the metadata surrounding each case study. The descriptive statistics in the RuPaul’s Drag Race and Departments of Transportation case studies support those assumptions. I’ll discus the implication of changes to user and tweet-level metadata separately: ⁃ User-Level Metadata — Screen Name. Of special importance to note is the small percentages of users who changed their screen name during the 113 time of observation. The screen name servers a dual purpose on most social media platforms: (1): a descriptive element similar to other items in a user’s profile such as the profile image or user description and (2): a unique identifier for each user that can be used to collect data. In Twitter, and other social media platforms, a user can be identified by a screen name and a unique numerical identifier. This unique numerical identifier is not editable by a user, but the screen name is. If researchers use a user’s screen name as a query term in their data collection, the user may drop out of their data collection if they change their screen name. For the two case studies, 1-5.5% of users changed their screen names. This supports a best practice of using a user’s unique numerical identifier instead of their screen name as a query term as any change to the user’s screen name will impact account-focused data collection. Users who change their username would no longer be part of data collection after the change unless the unique identifier is used. ⁃ User-Level Metadata — Profile Metadata. User-level metadata such as the screen name, name, profile image, location, and description are often used to categorize users. Over half of the users in the Drag Race case study change their profile image or profile description and a quarter of users changed their profile information in the Departments of Transportation case. The implication is that a user may change their presentation of themselves thereby impacting how a user is categorized, pictured in their 114 profile photo, or described in their profile text. If the profile data analyzed is not part of the initial data collection — for example, viewing user profile photos at a later data, the bond between user profile image and the post may be broken. If a research design utilizes the user profile metadata, a deal in collection of this information may result in incorrect categorization. ⁃ Tweet-Level Metadata. Tweet-level metadata such as the number of retweets or favorites provide a similar set of challenges as user-level metadata. Since this data can change very quickly, it is important to note when it was collected for analysis and contextualized within that timeframe. • Linked Content. Social media platforms are interconnected through the links users embedded within the post content. This is demonstrated by the top 10 domains for the Departments of Transportation and RuPaul’s Drag Race case studies. The top 10 domains included other posts within Twitter as well as links out to other platforms including Instagram, YouTube, Vine, and Facebook. Links within the Departments of Transportation case also included many .gov or URL shorteners offered by the US Federal Government when linking to government content. This points to the limitations of current methods focusing on a single social media platform for analysis, which does not accurately reflect how users actually use social media platforms. 115 7.3 LIMITATIONS The focus of this dissertation was to further describe and explore the problem space around the ephemerality of social media datasets specifically focusing on the reliability of posts and the authenticity of metadata surround those posts. Heavy reliance on descriptive statistics, the data collection environment, and case construction while providing new insights, also create a set of limitations: • In the Occupy Wall Street case, data collection was limited to two time points (real-time and three years later). As a result, the most politically contentious case study was excluded from the majority of the within case study analyses • Due to the daily granularity of nightly data collection, it is not possible to detect multiple changes occurring within each day or disambiguate missing posts due to API errors vs. changes in user privacy settings. It was not possible to collect data more often than once-a-day due to API request rate limits since checking the status of all Tweets in the larger case studies took 6-7 hours. • While the difference between case studies were significant, it is not possible to determine which prototypical features has the highest impact on the ephemerality of each dataset. • The descriptive and exploratory nature of this work limited the use of inferential statistics to determine causality. • While Twitter shares many concepts, structures, metadata, and links (URLs); patterns of user activity within Twitter may differ from social media platforms, limiting my ability to generalize outside of Twitter. 116 • Between and during period of data collection, Twitter made changes to the affordances of the platform. For example, Twitter introduced the simplified replies and media attachments where the @mentions at the beginning of a tweet and URLs linking to media (photos, videos, and GIFs) at the end of a tweet do not count toward the 140-character limit.25 While none of the metadata fields analyzed in this dissertation were changed, changes in affordances may have impacted user behavior. 7.4 CONTRIBUTIONS Returning to the guiding research questions of this work, the findings address the interaction between ephemerality and the process of data collection. This dissertation advances the field of information science by empirically investigating how the ephemeral nature of social media data, metadata, and linked content have significant and lasting effects on the reliability and authenticity of datasets used in research. Situating research design decisions, specifically choices made on how and when to observe data, within the frameworks of process theory and archival theory, this work brings the importance of methodological considerations to the forefront of studies of digital and social media. Key contributions of this work include: • The introduction of a new framework detailing typical methodologies for sampling data from digital and social media platforms. 25 See https://dev.twitter.com/overview/api/upcoming-changes-to-tweets for a description of changes made to tweets, archived at http://perma.cc/K8D6-BHM3. 117 • Demonstrates through an empirical analysis of descriptive data related to the reliability and authenticity across three illustrative case studies, how the challenges of ephemerality of social media translate to consequences for research. • A design and technical system for archiving links embedded in social media datasets during data collection. • Guidelines and design considerations for social media-based research studies to aid in limiting and understanding the impact of ephemerality. • Limitations of social media datasets and the importance of these limitations within the field of information science. 7.5 FUTURE WORK This research addressed gaps in the current literature related to social media methods and data collection. Invitation of this problem space revealed directions for future work including: • Integrating ongoing re-conceptualizing of the archival record to take into account the concept of non-fixed, event-oriented records. Seeing social media as a performance that cannot be separated from its creator (Anderson, 2013, p. 362). • Examining the extent to which the changes in the reliability and authenticity due to latency in data collection impact the results of an analysis within the same research project — this could take the form of analyzing social media data addressing the same research question at different data collection latencies. 118 • While this work found that the prototypical features of a case study result in different patterns of inaccessibility, it was not possible to determine which have the most impact. Future research designs could further examine which prototypical features have the most impact. • Conduct sensitivity testing and random modeling of inaccessibility to better determine the extent of impact within each case study. • Addressing ethical questions surrounding ephemeral social media data sets — both from a research methods as well as a human-subjects angles. 7.6 CONCLUSION The process of collecting social media data presents a number of challenges for researchers as we attempt to add rigor to the field. In this dissertation I developed the concept of ephemerality as it relates to social media data sets – quantifying the levels of reliability and authenticity within three cases studies observed over a 90 day to 3 year timeframe. To me, the most surprising results were the levels of change of user profile metadata with over 50% of users in the RuPual’s Drag Race case study changing their profile images and a large number of users completely rewriting their profile descriptions. The empirical results lay the foundation for future work examining what impact latency in data collection and the resulting change in a social media data set have on findings. The results point to the need for researchers to more closely align the latency and methods of data collection with their research design. Some researchers may see these 119 results as a call to strengthen their data collection methods to prevent change and stabilize their data sets. Other researcher may see ephemerality as an inherent property of social media data itself. I do not have a normative stance on either view, but the results point to the importance of taking the impact of data set change into account when describing findings and limitations of research using social media data. These findings also point to the importance of more clearly describing our data collection procedures when publishing research so readers may evaluate findings in light of the research design choices that were made. Both of steps will go a long way to increasing the rigor of social media research. 120 WORKS CITED Abbott, A. (1990). A Primer on Sequence Methods. Organization Science, 1(4), 375–392. http://doi.org/10.1287/orsc.1.4.375 Abowd, J. M., Vilhuber, L., & Block, W. (2012). A proposed solution to the archiving and curation of confidential scientific inputs. Privacy in Statistical Databases. Acker, A. (2014). Born networked records: A history of the short message service format (Order No. 3623371). Available from ProQuest Dissertations & Theses Global. (1549977698). Retrieved from https://search.proquest.com/docview/1549977698?accountid=14784 Acker, A., & Brubaker, J. R. (2014). Death, Memorialization, and Social Media: A Platform Perspective for Personal Archives. Archivaria, (77), 1–23. Agarwal, S. D., Bennett, W. L., Johnson, C. N., & Walker, S. (2014). A Model of Crowd Enabled Organization: Theory and Methods for Understanding the Role of Twitter in the Occupy Protests. International Journal of Communication, 8, 27. Almuhimedi, H., Wilson, S., Liu, B., Sadeh, N., & Acquisti, A. (2013). Tweets are forever: a large-scale quantitative analysis of deleted tweets. In Proceedings of the 2013 conference on Computer supported cooperative work (pp. 897-908). ACM. Ananny, M. (2015). Toward an Ethics of Algorithms: Convening, Observation, Probability, and Timeliness. Science, Technology & Human Values, 41(1), 93–117. http://doi.org/10.1177/0162243915606523 Ananny, M., & Crawford, K. (2017). Seeing without knowing: Limitations of the transparency ideal and its application to algorithmic accountability. New Media & Society, 33(4), 146144481667664–17. http://doi.org/10.1177/1461444816676645 Anderson, K. (2013). The footprint and the stepping foot: archival records, evidence, and time. Archival Science, 13(4), 349–371. http://doi.org/10.1007/s10502-012-9193-2 Ankerson, M. S. (2012). Writing web histories with an eye on the analog past. New Media & Society, 14(3), 384–400. http://doi.org/10.1177/1461444811414834 Babbie, E. R. (2007). The Practice of Social Research (11 ed.). Belmont: Thomson Wadsworth. Bamman, D., O'Connor, B., & Smith, N. (2012). Censorship and deletion practices in Chinese social media. First Monday, 17(3), 259. http://doi.org/10.5210/fm.v17i3.3943 Bastos, M. T., Mercea, D., & Charpentier, A. (2015). Tents, tweets, and events: The interplay between ongoing protests and social media. Journal of Communication, 65(2), 320-350. Bennett, W. L., & Segerberg, A. (2012). The logic of connective action: Digital media and the personalization of contentious politics. Information, Communication & 121 Society, 15(5), 739-768. Bennett, W. L., Segerberg, A., & Walker, S. (2014). Organization in the crowd: peer production in large-scale networked protests. Information, Communication & Society, 17(2), 232–260. http://doi.org/10.1080/1369118X.2013.870379 Bernstein, M. S., Monroy-Hernández, A., Harry, D., André, P., Panovich, K., & Vargas, G. G. (2011, July). 4chan and/b: An Analysis of Anonymity and Ephemerality in a Large Online Community. In ICWSM (pp. 50-57). Bowker, G. C. (2013). Data flakes: An afterword to “Raw Data”is an oxymoron. In Raw Data Is an Oxymoron. MIT Press. Boyd, D., & Crawford, K. (2012). Critical questions for big data: Provocations for a cultural, technological, and scholarly phenomenon. Information, communication & society, 15(5), 662-679. Bozdag, E. (2013). Bias in algorithmic filtering and personalization. Ethics and Information Technology, 15(3), 209–227. http://doi.org/10.1007/s10676-013-9321- 6 Brooks, M. (2015). Human centered tools for analyzing online social data (Order No. 10000022). Available from ProQuest Dissertations & Theses Global. (1760603707). Retrieved from https://search.proquest.com/docview/1760603707?accountid=14784 Bruns, A. (2012). How Long Is a Tweet? Mapping Dynamic Conversation Networks on Twitter Using Gawk and Gephi. Information, Communication & Society, 15(9), 1323– 1351. http://doi.org/10.1080/1369118X.2011.635214 Bruns, A., & Burgess, J. (2011). #Ausvotes: How twitter covered the 2010 Australian federal election. Communication, Politics & Culture, 44(2), 37. Bruns, A., & Stieglitz, S. (2012). Quantitative Approaches to Comparing Communication Patterns on Twitter. Journal of Technology in Human Services, 30(3-4), 160–185. http://doi.org/10.1080/15228835.2012.744249 Bucher, T. (2012). Want to be on the top? Algorithmic power and the threat of invisibility on Facebook. New Media & Society, 14(7), 1164–1180. http://doi.org/10.1177/1461444812440159 Bucher, T., & Helmond, A. (2017). The affordances of social media platforms. In J. Burgess, T. Poell, & A. E. Marwick (Eds.), The SAGE Handbook of Social Media. London and New York: Sage Publications Ltd. Burgess, J., & Bruns, A. (2014). Easy Data, Hard Data: The Politics and Pragmatics of Twitter Research after the Computational Turn. In G. Elmer, J. Langlois, & J. Redden (Eds.), Compromised Data From Social Media to Big Data (pp. 1–27). Caren, N., & Gaby, S. (2012). Sociologist Tracks Social Media's Role in Occupy Wall Street Movement. University of North Carolina. 122 Crawford, K. (2009). Following you: Disciplines of listening in social media. Continuum, 23(4), 525–535. http://doi.org/10.1080/10304310903003270 Crowston, K. (2000). Process as Theory in Information Systems Research. In Organizational and Social Perspectives on Information Technology (pp. 149–164). Boston, MA: Springer US. http://doi.org/10.1007/978-0-387-35505-4_10 Dabbish, L., Venolia, G., & Cadiz, J. J. (2003). Marked for deletion: an analysis of email data. CHI '03 extended abstracts (pp. 924–925). New York, New York, USA: ACM. http://doi.org/10.1145/765891.766073 Dalton, C., & Thatcher, J. (2014). What does a critical data studies look like, and why do we care? Seven points for a critical approach to ‘big data’. Society and Space open site. Daniels, M. F., & Walch, T. (1984). Modern archives reader. Washington, DC: National Archives and Records Service, US General Services Administration, 1984. de Leeuw, E. D., Hox, J., & Dillman, D. (2012). International Handbook of Survey Methodology. Routledge. Dimitrova, D. V., & Bugeja, M. (2007). Raising the dead: Recovery of decayed online citations. American Communication Journal, 9(2), 2. Driscoll, K., & Walker, S. (2014). Big Data, Big Questions Working Within a Black Box: Transparency in the Collection and Production of Big Twitter Data. International Journal of Communication, 8, 20. Duranti, L. (1994). The concept of appraisal and archival theory. The American Archivist, 57(2), 328-344. Duranti, L. (1995). Reliability and authenticity: the concepts and their implications. Archivaria, 39. Duranti, L. (1997). The Archival Bond. Archives and Museum Informatics, 11(3-4), 213– 218. http://doi.org/10.1023/A:1009025127463 Duranti, L., Eastwood, T., & MacNeil, H. (2013). Preservation of the Integrity of Electronic Records. Dordrecht: Springer Science & Business Media. http://doi.org/10.1007/978-94-015-9892-7 Edwards, P. N. (2010). A Vast Machine. MIT Press. Felt, M. (2016). Social media and the social sciences: How researchers employ Big Data analytics. Big Data & Society, 3(1). http://doi.org/10.1177/2053951716645828 Flaxman, S., Goel, S., & Rao, J. M. (2013). Ideological Segregation and the Effects of Social Media on News Consumption. SSRN Electronic Journal. http://doi.org/10.2139/ssrn.2363701 Freelon, D. (2014). On the interpretation of digital trace data in communication and social computing research. Journal of Broadcasting & Electronic Media , 58(1), 59-75. Gaffney, D., & Puschmann, C. (2014). Data collection on Twitter. Twitter and Society. 123 New York. Gerlitz, C., & Helmond, A. (2013). The like economy: Social buttons and the data- intensive web. New Media & Society, 15(8), 1348–1365. http://doi.org/10.1177/1461444812472322 Gerlitz, C., & Rieder, B. (2013). Mining one percent of Twitter: collections, baselines, sampling. M/C Journal, 16(2). Gibson, J. J. (1977). The theory of affordances. Hilldale, USA. Gillespie, T. (2010). The politics of “platforms.” New Media & Society, 12(3), 347–364. http://doi.org/10.1177/1461444809342738 Goble, C., Stevens, R., Hull, D., Wolstencroft, K., & Lopez, R. (2008). Data curation + process curation=data integration + science. Briefings in Bioinformatics, 9(6), 506– 517. http://doi.org/10.1093/bib/bbn034 Goffman, E. (1990). The Presentation of Self in Everyday Life. Penguin Books, Limited (UK). Goh, D. H. L., & Ng, P. K. (2007). Link decay in leading information science journals. Journal of the American Society for Information Science and Technology, 58(1), 15–24. http://doi.org/10.1002/asi.20513 González-Bailón, S., Wang, N., Rivero, A., Borge-Holthoefer, J., & Moreno, Y. (2014). Assessing the bias in samples of large online networks. Social Networks, 38, 16–27. http://doi.org/10.1016/j.socnet.2014.01.004 Gray, J., Szalay, A. S., Thakar, A. R., Stoughton, C., & Vandenberg, J. (2002). Online scientific data curation, publication, and archiving. arXiv preprint cs/0208012. Grosser, B. (2014). What do metrics want? How quantification prescribes social interaction on Facebook. Computational Culture: a journal of software studies, 4. Gummadi, K. P., Saroiu, S., & Gribble, S. D. (2002, November). King: Estimating latency between arbitrary internet end hosts. In Proceedings of the 2nd ACM SIGCOMM Workshop on Internet measurment (pp. 5-18). ACM. Hargittai, E., & Sandvig, C. (2015). Digital Research Confidential. MIT Press. Helmond, A. (2015). The Platformization of the Web: Making Web Data Platform Ready. Social Media & Society, 1(2). http://doi.org/10.1177/2056305115603080 Herring, S. C. (2010). Web Content Analysis: Expanding the Paradigm. In Web content analysis: Expanding the paradigm (pp. 233–249). Dordrecht: Springer Netherlands. http://doi.org/10.1007/978-1-4020-9789-8_14 Holmes, O. (1964). Archival Arrangement—Five Different Operations at Five Different Levels. The American Archivist , 27(1), 21-42. Karlsson, M. (2012). Charting the liquidity of online news: Moving towards a method for content analysis of online news. International Communication Gazette, 74(4), 385– 402. http://doi.org/10.1177/1748048512439823 124 Karpf, D. (2012). SOCIAL SCIENCE RESEARCH METHODS IN INTERNET TIME. Information, Communication & Society, 15(5), 639–661. http://doi.org/10.1080/1369118X.2012.665468 Kim, J., & Kim, E. J. (2008). Theorizing Dialogic Deliberation: Everyday Political Talk as Communicative Action and Dialogue. Communication Theory, 18(1), 51–70. http://doi.org/10.1111/j.1468-2885.2007.00313.x Kitchin, R. (2014). The Data Revolution. SAGE. Knapp, M. L., Stafford, L., & Daly, J. A. (1986). Regrettable Messages: Things People Wish They Hadn't Said. Journal of Communication, 36(4), 40–58. http://doi.org/10.1111/j.1460-2466.1986.tb01449.x Krippendorff, K. (2012). Content analysis: An introduction to its methodology, Sage. Levenshtein, V. I. (1966, February). Binary codes capable of correcting deletions, insertions, and reversals. Soviet physics doklady, 10(8), 707-710. Liang, H., & Fu, K.-W. (2015). Testing Propositions Derived from Twitter Studies: Generalization and Replication in Computational Social Science. PLoS ONE, 10(8), e0134270. http://doi.org/10.1371/journal.pone.0134270 Light, B., & McGrath, K. (2010). Ethics and social networking sites: a disclosive analysis of Facebook. Information Technology & People, 23(4), 290–311. http://doi.org/10.1108/09593841011087770 Lynch, C. (2008). Big data: How do your data grow? Nature, 455(7209), 28–29. http://doi.org/10.1038/455028a Malik, M. T., Gumel, A., Thompson, L. H., Strome, T., & Mahmud, S. M. (2011). “Google Flu Trends” and Emergency Department Triage Data Predicted the 2009 Pandemic H1N1 Waves in Manitoba. Canadian Journal of Public Health / Revue Canadienne De Sante'e Publique, 102(4), 294–297. http://doi.org/10.2307/41995614?ref=no-x- route:af550292fd45cb5ef4f567324da26478 Manovich, L. (2013). Software takes command (Vol. 5). A&C Black. McKemmish, S. (2001). Placing records continuum theory and practice. Archival Science, 1(4), 333–359. http://doi.org/10.1007/BF02438901 Miller, C., Ginnis, S., Stobart, R., Krasodomski-Jones, A., & Clemence, M. (2015). The road to representivity, a Demos and Ipsos MORI report on sociological research using Twitter. London: Demos. Available at: http://www. demos. co. uk/files/Road_to_representivity_final. pdf, 1441811336. Moghaddam, A. I., Saberi, M. K., & Esmaeel, S. M. (2012). Availability and half-life of web references cited in Information Research Journal: a citation study. International Journal of Information Science and Management (IJISM), 8(2), 57–75. Mohr, G., Stack, M., Ranitovic, I., Avery, D., & Kimpton, M. (2004). An Introduction to Heritrix An open source archival quality web crawler. In In IWAW’04, 4th 125 International Web Archiving Workshop. Nahon, K. (2015). Where there is Social Media there is Politics. In A. Bruns, E. Skogerbo, C. Christensen, O. A. Larsson, & G. S. Enli (Eds.), Routledge Companion to Social Media and Politics (pp. 39–55). NYC, NY. Nahon, K., Hemsley, J., Walker, S., & Hussain, M. (2011). Fifteen Minutes of Fame: The Power of Blogs in the Lifecycle of Viral Political Information. Policy & Internet, 3(1), 1–28. http://doi.org/10.2202/1944-2866.1108 Neumayer, C., & Stald, G. (2014). The mobile phone in street protest: Texting, tweeting, tracking, and tracing. Mobile Media & Communication, 2(2), 117–133. http://doi.org/10.1177/2050157913513255 Park, H. W., & Thelwall, M. (2003). Hyperlink Analyses of the World Wide Web: A Review. Journal of Computer-Mediated Communication, 8(4). http://doi.org/10.1111/j.1083-6101.2003.tb00223.x Parmelee, J. H., & Bichard, S. L. (2013). Politics and the Twitter Revolution. Patton, M. Q. (2001). Qualitative Research & Evaluation Methods (3rd ed.). Thousand Oaks, Calif: SAGE Publications, Inc. Pearce-Moses, R. (2005). A glossary of archival and records terminology. Society of American Archivists. P Petrovic, S., Osborne, M., & Lavrenko, V. (2013). I wish i didn't say that! analyzing and predicting deleted messages in twitter. arXiv preprint arXiv:1305.3107. Phillips, W. (2011). LOLing at tragedy: Facebook trolls, memorial pages and resistance to grief online. First Monday, 16(12). http://doi.org/10.5210/fm.v16i12.3168 Plantin, J.-C., Lagoze, C., Edwards, P. N., & Sandvig, C. (2016). Infrastructure studies meet platform studies in the age of Google and Facebook. New Media & Society, 18(1), 1–18. http://doi.org/10.1177/1461444816661553 Ribes, David. "The kernel of a research infrastructure." In Proceedings of the 17th ACM conference on Computer supported cooperative work & social computing, pp. 574-587. ACM, 2014. http://doi.org/10.1145/2531602.2531700 Roblyer, M. D., McDaniel, M., Webb, M., Herman, J., & Witty, J. V. (2010). Findings on Facebook in higher education: A comparison of college faculty and student uses and perceptions of social networking sites. The Internet and Higher Education, 13(3), 134–140. http://doi.org/10.1016/j.iheduc.2010.03.002 SalahEldeen, H. M., & Nelson, M. L. (2013). Reading the correct history?: modeling temporal intention in resource sharing. the 13th ACM/IEEE-CS joint conference (pp. 257–266). New York, New York, USA: ACM. http://doi.org/10.1145/2467696.2467721 Saltzis, K. (2012). Breaking News Online: How news stories are updated and maintained 126 around-the-clock. Journalism Practice, 6(5-6), 702–710. http://doi.org/10.1080/17512786.2012.667274 Sanderson, R., Phillips, M., & Van de Sompel, H. (2011). Analyzing the persistence of referenced web resources with memento. arXiv preprint arXiv:1105.3459. Seaver, N. (2015). The nice thing about context is that everyone has it. Media, Culture & Society, 37(7), 1101–1109. http://doi.org/10.1177/0163443715594102 Segerberg, A., & Bennett, W. L. (2011). Social Media and the Organization of Collective Action: Using Twitter to Explore the Ecologies of Two Climate Change Protests. Communication Review, 14(3), 197–215. Simons, H. (2006). The Paradox of Case Study. Cambridge Journal of Education, 26(2), 225–240. http://doi.org/10.1080/0305764960260206 Sleeper, M., Cranshaw, J., Kelley, P. G., Ur, B., Acquisti, A., Cranor, L. F., & Sadeh, N. (2013). i read my Twitter the next morning and was astonished: a conversational perspective on Twitter regrets. the SIGCHI Conference (pp. 3277–3286). New York, New York, USA: ACM. http://doi.org/10.1145/2470654.2466448 Sood, S., & Loguinov, D. (2011, October). Probabilistic near-duplicate detection using simhash. In Proceedings of the 20th ACM international conference on Information and knowledge management (pp. 1117-1126). ACM. Starbird, K., & Palen, L. (2010). Pass it on?: Retweeting in mass emergency(pp. 1-10). International Community on Information Systems for Crisis Response and Management. Starbird, K., & Palen, L. (2012). (How) will the revolution be retweeted?: information diffusion and the 2011 Egyptian uprising. the ACM 2012 conference (pp. 7–16). New York, New York, USA: ACM. http://doi.org/10.1145/2145204.2145212 Starbird, K., Maddock, J., Orand, M., Achterman, P., & Mason, R. M. (2014). Rumors, False Flags, and Digital Vigilantes: Misinformation on Twitter after the 2013 Boston Marathon Bombing. iConference 2014 Proceedings: Breaking Down Walls. Culture - Context - Computing. http://doi.org/10.9776/14308 Thumim, J. (2002). 'Mrs. Knight must be balanced’: Methodological problems in researching early British television. In S. Allan, B. G, & C. C (Eds.), News, Gender, and Power (pp. 91–104). London: News. Valenti, J. (2014, May 28). # YesAllWomen Reveals the Constant Barrage of Sexism That Women Face. The Guardian. The Guardian. Retrieved from http://www.theguardian.com/commentisfree/2014/may/28/yesallwomen-barage- sexism-elliot-rodger van Dijck, J. (2013). The Culture of Connectivity. Oxford University Press. Vis, F. (2013). A critical reflection on Big Data: Considering APIs, researchers and tools as data makers. First Monday, 18(10). http://doi.org/10.5210/fm.v18i10.4878 127 Williams, S. A., Terras, M. M., & Warwick, C. (2013). What do people study when they study Twitter? Classifying Twitter related academic papers. Journal of Documentation, 69(3), 384–410. http://doi.org/10.1108/JD-03-2012-0027 Yin, R. K. (2014). Case Study Research. SAGE Publications. Zhang, B., Ng, T. S. E., Nandi, A., Riedi, R., Druschel, P., & Wang, G. (2006). Measurement based analysis, modeling, and synthesis of the internet delay space. the 6th ACM SIGCOMM (pp. 85–98). New York, New York, USA: ACM. http://doi.org/10.1145/1177080.1177091 Zhang, S. (2015). Using Twitter to Enhance Traffic Incident Awareness. 2015 IEEE 18th International Conference on Intelligent Transportation Systems - (ITSC 2015), 2941– 2946. http://doi.org/10.1109/ITSC.2015.471 Zhou, L., Wang, W., & Chen, K. (2016). Tweet Properly: Analyzing Deleted Tweets to Understand and Identify Regrettable Ones (pp. 603–612). International World Wide Web Conferences Steering Committee. http://doi.org/10.1145/2872427.2883052 Zimmer, M. (2010). “But the data is already public”: on the ethics of research in Facebook. Ethics and Information Technology, 12, 313–325. http://doi.org/10.1007/ s10676-010-9227-5 Zimmer, M., & Proferes, N. J. (2014). A topology of Twitter research: disciplines, methods, and ethics. Aslib Journal of Information Management, 66(3), 250–261. http://doi.org/10.1108/AJIM-09-2013-0083 128 APPENDIX A: IMPLICATIONS OF THIS RESEARCH FOR SOCIAL MEDIA RESEARCH In this appendix, I move beyond the context of Twitter to discuss general implications for social media research arising out of this work. For researchers who have not yet started data collection, these implications act as a set of considerations for approaching data collection. For researchers who already collected data, these implications provide a set limitations of their data collection procedures. These implications emerge from the findings in this dissertation as well as my direct experience with the challenges of collecting and analyzing social media data. As previously noted in Chapter 5, the data in this dissertation comes from three Twitter-based case studies. Twitter was chosen as the objects of study because: 1) the use of Twitter as an object of study and source of observational data is pervasive in academic research (Williams et al., 2013; Zimmer & Proferes, 2014), 2) Twitter is less susceptible to algorithmic filtering, also called ’filter bubbles’ (Bozdag, 2013; Bruns & Stieglitz, 2012; Bucher, 2012; Flaxman et al., 2013; van Dijck, 2013, p. 75), than other platforms since the public APIs return all public, non-deleted statuses matching query terms; theoretically producing a more “accurate” record26 (Driscoll & Walker, 2014), and 3) concepts, structures, metadata, and links (URLs) easily generalize beyond Twitter to other social media services and platforms. By focusing on the concepts, structures, 26 https://dev.twitter.com/streaming/overview 129 metadata, and links within each case study, a general set of implications for social media research emerged. 1. The Impact of Latency in Data Collection. The time between data collection and the event/phenomenon under investigation and data collection is important consideration since latency is a significant factor in predicting the availability of posts within social media data sets. Data from the case studies in this dissertation point to the first 48 hours as a critical time period, but posts continue to become unavailable over time. In addition to the posts, metadata surrounding posts also changes over time. Users change their profiles, images, and locations. Statistics related to users and posts also change over time. This content is bound to a specific point in time, so breaking is may result in the analysis of data not related to the original post content. For example, if an Instagram post was collected in September but the content of user profile was viewed and analyzed in January of the following year, the user profile may no longer represent the user at the time of posting. To address this issue, collection of the Instagram post as well as the user profile would need to be integrated into the research design. 2. Posts are Assemblages of Content. Social media platforms are not just filled with text. Posts and user profiles are assemblages of content of content made up of the post content, post metadata, user metadata, and linked content. Depending on the platform, the content of the post could be text — Tweets are 140 130 characters — or images and text — Instagram posts contain an image as well as descriptive text. Metadata about the user and other users’ interactions with the posts are also displayed — the number of retweets on Twitter or the number of links for an Instagram post. Content is also linked-to via URLs or embedded into the rendered post. When a post is rendered for a platform’s API or user-facing web interface, subsets of the content is rendered and organized by the logic of the algorithms within the platform. When collecting data, the interface used by a researcher may render all or part of this content and metadata leaving the researchers with a partial view of the post. 3. The Affordances of Platforms Create Constraints for Users and Researchers. The affordances, or features, of social media platforms create constraints for users as well as researchers collecting data from the platforms. Platforms offer users a specific set of interactions, as a result, users are unable to perform activities and actions not offered by a site. For example, Facebook users are provided with a set of emotional reactions (like, love, haha, cry, angry, and wow) to respond to each post, tweets are limited to 140 characters, and Snapchat messages can be viewed for a limited amount of time before self-destructing. Users sometimes develop practices to get around these limitations, such as including links in tweets or taking screenshots of snaps. The affordances and practices within a platform must be closely matched to the research questions and phenomena under investigation. 131 Researchers are also constrained by the affordances of the interfaces each platform offers for data collection. Data can be collected from user-focused web interfaces and software-focused APIs — each offering their own sets of constraints and subset of data. For example, collecting screen shots of posts would provide a rendering of the post similar to the experience of platform users, but may contain a limited set of metadata about the post and user. Collecting the same post via a platform’s API may provide extended metadata, but the format not provide information about how posts are rendered and presented to users of the platform. Similarly, the metrics provided by each platform privilege some activities while limiting or preventing the visibility of other types of activities. For example, the number of times a tweet was retweeted is often used as a measure of the popularity or reach of a tweet. The number retweets is displayed prominently when viewing a tweet on the Twitter website. It is important to note that no other measures related to the number of times a tweet has been seen by users. Using the number of retweet as a measure privileges production of posts over for other types of listening. 4. What data may I want to collect and analyze? Below is a list of components of a social media post. Considerations for data collection are given for each component: • Post content. The content of posts on most social media platforms consist of more than just the text -- posts often include images and links. If this content will be included in your analysis, it should also be included as part of your data collection. If you're using an API to collect data, mapping the fields received from 132 the API to the rendered post on the public web-interface can help locate missed content and differences between the text-based API response of the API and the rendered version of the post. • User profiles. Users change the name, description, image, location, and URLs within their profiles. If user profile information is included in your analysis, consider if your data collection strategy includes all of this information. As users change their profile information, their roles and presentations may change significantly. • URLs. Embedding URLs in posts and user profiles is a common affordance of social media platforms. Links extend the reach and content of a post, so examining the content of URLs should be considered when analyzing the content of a social media post. URLs change or become inaccessible over time, so an archiving strategy should be considered as part of your data collection strategy. • Query terms. A platform may also allow users to change their usernames, impacting your data collection if you use usernames as part of your query terms. Most platforms assign users unique numeric ids that do not change when a user updates their screen names. The unique user id is a more stable reference to use as a query term than usernames. 133 APPENDIX B: CASE STUDY QUERY TERMS Occupy Wall Street - Keywords A list of keywords used to collect tweets surrounding the Occupy Wall Street movement can be found at https://github.com/somelab/SoMeToolkit/blob/master/collection.terms. RuPaul’s Drag Race Case - List of Accounts and Keywords Followed Keywords #DragRace, #DragRaceAllStars, #AllStars, #AllStars2, #RPDR, #RuPaul - popular show hashtags DragRace OR DragRaceAllStars OR AllStars OR AllStars2 OR RPDR OR RuPaul OR RuPaulsDragRace OR DragRaceAllStar OR RuPaul OR michellevisage OR AdoreDelano OR Alaska5000 OR AlyssaEdwards_1 OR cocomontrese OR TheOnlyDetox OR TheGingerMinj OR katya_zamo OR PhiPhiOhara OR roxxxyandrews OR TATIANNANOW - hashtags used to express support for the final 3 constants Accounts @RuPaulsDragRace - OFFICIAL @LogoTV #DragRace Twitter account @RuPaul - RuPaul’s official account and head judge @Michellevisage - Drag Race Judge Show contestants: @AdoreDelano @Alaska5000 @AlyssaEdwards_1 @Cocomontrese @TheOnlyDetox @TheGingerMinj @katya_zamo @PhiPhiOhara @Roxxxyandrews @TATIANNANOW 134 Departments of Transportation Case - List of Accounts Followed State Accounts Washington @wsdot - Statewide updates @wsdot_traffic - Traffic and construction reports for King, Snohomish, Skagit and Whatcom counties @wsdot_sw - Traffic reports for Vancouver and southwest Washington @wsdot_passes - Mountain pass reports @wsdot_tacoma - Traffic and construction reports for Pierce, Thurston, Mason and Kitsap counties @goodtogowsdot - Good To Go! tolling information @snoqualmiepass - Snoqualmie Pass conditions and project info @wsferries - Ferry alerts and updates @wsdot_east - Traffic and highway news and information east of the Cascade Mountains @GoodToGoWSDOT - Washington state’s toll system @BerthaDigsSR99 - Official account of the tunneling machine digging the SR 99 tunnel to replace Seattle’s Alaskan Way Viaduct @wsdot_520 - Official WSDOT feed for 520 construction updates Oregon @OregonDOT - Official Oregon Dept. of Transportation Twitter account @MyOReGO - Road usage charge program of the Oregon Department of Transportation @TripCheckPDX - TripCheck Portland @TripCheckSalem - Tripcheck Salem @TripCheckEugne - TripCheck Eugene @TripCheckNCascd - TripCheck Cascades @TripCheckS_OR - Tripcheck S Oregon @TripCheckI_84 - TripCheck I-84 @TripCheckSE_OR - TripCheck SE Oregon California @CaltransHQ - The official Twitter of Caltrans @CaltransDist1 - Del Norte, Humboldt, Lake, and Mendocino @CaltransD2 - Counties of Shasta, Siskiyou, Trinity, Tehama, Modoc, Lassen, Plumas, and parts of Butte and Sierra @CaltransDist3 - Butte, Colusa, El Dorado, Glenn, Nevada, Placer, Sacramento, Sierra, Sutter, Yolo and Yuba @CaltransD4 - Bay Area @CaltransD5 - Monterey, San Benito, San Luis Obispo, Santa Barbara, and Santa Cruz Counties @Caltransdist6 - Fresno, Madera, Kings, Tulare and Kern counties 135 @CaltransDist7 - Los Angeles & Ventura County @Caltrans8 - Riverside and San Bernardino Counties @Caltrans9 - Eastern Sierra Nevada and California @CaltransDist10 - Alpine, Amador, Calaveras, Mariposa, Merced, San Joaquin, Stanislaus and Tuolumne Counties @SDCaltrans - San Diego and Imperial Counties @Caltrans12 - Orange County