Collaborative Librarians

Data don't tell the whole story.

CI Article: Synergizing in Cyberinfrastructure Development January 9, 2012

Filed under: CI Article,Coordinating Centers,Cyberinfrastructure,eScience — Betsy Rolland @ 10:53 am

Bietz, M. J., E. P. S. Baumer, C. P. Lee. (2010). “Synergizing in Cyberinfrastructure Development.” Computer Supported Cooperative Work, 19(3-4): 3-4.

Bietz et al. studied a nascent marine metagenomics collaboration called Community Cyberinfrastructure for Advanced Marine Microbial Ecology Research and Analysis (CAMERA), focusing on the work of the developers in creating infrastructure for the group. This paper takes the authors’ earlier work on human infrastructure (Lee et al 2006) and expands it to include notions of synergizing, leveraging and aligning. They define synergizing as the “active, strategic work of managing multiple relationships for infrastructure development” (p. 251) and relate it to the concept of the embeddedness of the developers as both a constraint and a resource. Developers, defined as anyone involved in the development of a new infrastructure, are required to work within the rules and limitations of the various infrastructures in which they are already embedded (e.g., a university, a development team, an academic discipline), while they are able to take advantages of the relationships they have at their disposal thanks to those infrastructures (e.g., coworkers from former development projects, existing technology transfer agreements with other universities). Developers leverage existing relationships and technologies in service of their goals, while also aligning themselves with others to get work done.

The bottom line here is that CI cannot be fully understood without taking into account both the social and technological issues inherent in building new infrastructure. For example, the authors demonstrate how some tech decisions are made for social reasons, such as choosing the software the university already supports even if it’s not the most robust or sharing server space with collaborators rather than purchasing one’s own.

Like Lee et al.’s original human infrastructure paper, I find this work very useful for my own research on coordinating centers because of its focus on the messiness of science. I think it’s a myth that it’s possible to implement scientific research according to a 5-year plan; the very raison d’etre of science is exploring something we don’t fully understand. In fact, it would be an interesting study to compare the timeline proposed in grant proposals with what actually happened in the project! A research project needs to retain enough flexibility to respond to changes in not only the science and technology but also the people involved. Can we embrace the messiness of science instead of trying to control it with arbitrary schedules and deadlines?

Lee, C. P., Dourish, P., & Mark, G. (2006). The human infrastructure of cyberinfrastructure. In Proceedings of the 2006 20th anniversary conference on Computer supported cooperative work (pp. 483–492). New York: ACM.

 

Science of Team Science Conference February 2, 2011

I registered recently to add the 2011 Science of Team Science Conference, hosted by Northwestern University’s NUCATS Institute and its Research Team Support & Development office. I couldn’t be more excited. I wasn’t able to attend last year’s conference because of my heavy travel schedule for the SLA research grant, so I’m thrilled to be able to attend this year.

I’ve been virtually attending last year’s conference via the PPT and MP3 recordings they’ve posted for each session. This is a treasure trove of information and worth perusing. I’ve listened to several presentations so far and have read the minutes, which are well done and really capture the essence of each conversation. They even captured the Q&A sessions!

I think it will be especially interesting to attend in my dual role as social science researcher and practitioner, as this doesn’t seem to be very common. I have to admit, I’ve been a little disappointed about the lack of discussion about libraries, librarians or even information management. I may submit a poster on that topic, just to make sure it makes it onto the radar.

 

CI Article: “New Knowledge from Old Data : The Role of Standards in the Sharing and Reuse of Ecological Data” February 1, 2011

Filed under: CI Article,Curation,Cyberinfrastructure,Data,eScience — Betsy Rolland @ 7:00 am

Zimmerman, A. (2008). “New Knowledge from Old Data.” Science, Technology, & Human Values 33(5): 631-652.

Zimmerman interviewed 13 ecologists about their use of secondary data (i.e., data they did not collect themselves) in order to tease out the role standards might play in the process of re-using data for new analyses. She found that the primary determinant in an ecologist’s decision to use the data was the researcher’s own ability to understand the data. This understanding was heavily contingent upon the researcher’s field experience and knowledge of collecting similar data. If the ecologist considered the data to be generally difficult data to collect or the kind of data that was frequently poorly understood, the data were not used. A second consideration was the reputation of the data collectors themselves or a personal relationship with the data collectors.

Zimmerman concludes that standards, while potentially useful, would be difficult to develop because the collection of data is so context-dependent. In short, the research questions determine how the data are collected and which data points are important. It would be a staggering task to try to develop standards that would cover every context and approach. Even if that were possible, science moves so quickly that the standards would likely be obsolete by the time they were approved.

There was no mention in this article about the potential for others to help with the curation or development of understanding of the data. Does the individual investigator need to be involved or is this a question that can be delegated to graduate students or a data manager? Was it a collective decision or one made by the lead researcher? The participants described a process of repeatedly going back to the journal article where the secondary data are described. I would have liked to know more about what types of information they were looking for when they did that. Which types of contextual information were most important to them? Could they even tell us or is that another form of tacit knowledge they find difficult to articulate?

 

Data sharing plans January 31, 2011

Filed under: Curation,Data,eScience — Betsy Rolland @ 8:00 am

An upcoming commentary in the Lancet (Walport, M. and P. Brest “Sharing research data to improve public health.” The Lancet In Press, Corrected Proof.), signed by the leaders of key funding agencies, made clear that these agencies will join major journals in demanding that data be deposited as a condition of funding or article publication. But then what? What do the agencies plan to do with the data sets they receive? How will they safeguard them, how will they provide and monitor access? What are the plans to protect patient privacy? This is especially crucial in the case of genome-wide association studies (GWAS) where genetic data would be deposited. Theoretically, I understand, it’s possible that a patient could be identified by those data alone.

Assuming such questions get ironed out, who is the intended audience for such data sets? As discussed in many key articles on data sharing, data sets can’t simply be handed over with no further explanation. Absent standards for data curation, it’s difficult to believe many data sets can be downloaded by a new research team and used without an investment of time from the original data collectors. How many researchers will be willing to take the time to help someone else understand the context of the study and even the specific meaning of each variable? Often, especially in cancer epidemiology, data are collected over a period of time, during which the protocol may change, producing a data set with one variable with multiple meanings.

Without contact with the data collection team or investigator, researchers will have a difficult time assessing the trustworthiness, reliability or appropriateness of any given data set. So, what is the goal of the funding agencies and journals in demanding deposit of data sets? Without a focus on the social aspects of data, as discussed by Birnholtz & Bietz, among others, and a greater understanding of how scientists actually use data, it’s hard to see how these data deposit initiatives move science forward.

 

CI Article: Data at Work: Supporting Sharing in Science and Engineering January 29, 2011

Filed under: CI Article,Cyberinfrastructure,Data,eScience — Betsy Rolland @ 3:07 pm

Birnholtz, J. P. and M. J. Bietz (2003). Data at work: supporting sharing in science and engineering. Proceedings of the 2003 international ACM SIGGROUP conference on Supporting group work. Sanibel Island, Florida, USA, ACM: 339-348.

Recent calls for open science and data sharing suggest that funding agencies believe that groundbreaking scientific research requires more data sharing among scientists. Even if we provide the technical means to move data from one lab to another, however, there may be social barriers to effectively using this data in practice. To design technologies that truly support the conduct of science, and not just the sharing of a data set, we argue that the designer must understand both the scientific role that data play in producing knowledge, and the social role that data play in the conduct of scientific work. (p. 340)

In this article on sharing scientific data, Birnholtz & Bietz discuss the social nature of data in collaborative research, describing some of the problems inherent in trying to share data between and among researchers, as well as what CSCW researchers can learn by thinking of data in this way.

The authors categorize the difficulties of sharing data into three categories: “1) willingness to share, 2) locating shared data, and 3) using shared data.” Data are often the end result of years of hard work and represent a scientist’s work product. Giving that hard work away to someone doesn’t make sense. Finding data is a huge challenge, given the lack of a central registry. In my experience, scientists use a variety of strategies, including journal papers, government data repositories and colleagues. Once data have been located and received (a process with its own set of issues), actually using the data is fraught with difficulties. Assessing quality and trustworthiness is especially difficult if an investigator doesn’t have access to the original data collectors. Just simply knowing what the data actually represent is also a challenge. Many data sets, especially older, legacy data, may not have a relevant data dictionary or anyone who remembers what a specific variable meant. As Birnholtz & Bietz point out, “Even if documentation is provided, however, it is often the case that much of the knowledge needed to make sense of data sets is tacit.” How can we capture that information?

The authors go on to discuss how data sharing and data practices vary among the three fields they study, earthquake engineers, space physics and HIV/AIDS research. The finish with a set of recommendations for CSCW researchers.

An intriguing thread not further developed in this article is the idea that the level of task uncertainty in a given field affects or influences the frequency or types of data sharing that occur. This makes intuitive sense to me, as I think of my experience in cancer epidemiology. While epidemiological data aren’t standardized, by any stretch, they seem to describe a finite universe — characteristics of people and their environments and habits. Physical activity, smoking, diseases, environment are known concepts about which to collect data. A more fluid, less established field may have more variation in the data collected. I would really like to see this area further developed, as I think it has the potential to really help us think about data sharing.

I really appreciate this article’s emphasis on data and science as socially constructed, because I think it gives us the opportunity to think of supporting science in ways that lie outside of technological solutions. It’s not enough to construct a database that combines two disparate data sets if the context and tacit knowledge inherent in the data sets aren’t taken into consideration. Without a true understanding of the data, harmonization fails and, worse, leads to bad science.

 

CI Article:Tensions across the scales: Planning infrastructure for the long-term January 17, 2011

Filed under: CI Article,Cyberinfrastructure,eScience — Betsy Rolland @ 12:05 pm

Ribes, D., & Finholt, T. A. (2007). Tensions across the scales: Planning infrastructure for the long-term Proceedings of the 2007 International ACM Conference on Supporting Group Work (pp. 229-238). New York: ACM.

Ribes & Finholt describe nine tensions inherent in the move from short-term to long-term infrastructure for science. These tensions are the intersection of three “concerns of actors” and three “scales of infrastructure.” Their aim is not to prescribe how to build infrastructure for the long-term, as no one yet knows how to do that, but to define a set of researchable questions around this topic so that we can begin to get an idea of what to pay attention to.

The first tension Ribes & Finholt discuss is “Project vs. facility,” noting that most CI endeavors are funded as projects, with finite timelines and scopes and no clear path to renewal of funding. This discourages the kind of long-term planning and thinking that could add stability to a CI infrastructure and most likely leads to wasting money. Rather than investing in one CI project for a domain community, funding agencies fund smaller projects, each of which builds its own CI.

Ribes & Finholt’s second tension speaks to “Individual vs. community interests.” This is a common theme in discussions of CI — building large infrastructure projects to support science requires not only computer scientists but also domain experts. Yet the reward system for scientists doesn’t give credit for that type of work. If  only a domain expert can generate appropriate metadata for a database of genetic structures but the time s/he spends on that task doesn’t help in the race toward tenure, the expert won’t be able to justify the time spent. But then the whole community loses out. This same argument applies to proactively preparing data to share, submitting to open access journals that aren’t yet valued by the community, etc. Some of the issues are also explored in the tension “Research vs. development.”

After describing the other tensions, Ribes & Finholt conclude with an emphasis on the human side of infrastructure, drawing upon the Charlotte Lee, et al, paper on human infrastructure (reference below).  Ribes & Finholt note: “[h]owever, while the work of design and development is ‘human,’ the challenges are more comprehensively described as technical, organizational and institutional. In considering design and enactment of infrastructure it is best to address ‘hard and soft’ foundations hand-in-hand, they are usually more intimately entwined than any raw distinction would suggest (236)” (emphasis in original).

One of the things I like about this article is that Ribes & Finholt focus not only on the domain scientists and computer scientists themselves but the project managers, as well. This group is often hidden or forgotten in the writing on CI but is a critical path in the success or failure of a project.

 

Lee, C. P., Dourish, P., & Mark, G. (2006). The human infrastructure of cyberinfrastructure Proceedings of the 2006 20th anniversary conference on Computer supported cooperative work (pp. 483 – 492). New York: ACM.

 

Making sense of Cyberinfrastructure January 17, 2011

Filed under: Cyberinfrastructure,eScience — Betsy Rolland @ 9:59 am

As I try to make sense of the literature in the area of CI, I’ve begun compiling a reading list and plan to start writing notes here on the blog about the various articles I’m reading. Even as a trained, though non-practicing, librarian, I struggle with wrapping my arms around this body of literature, keeping everything organized and holding the various arguments straight in my head.

The field of CI is especially daunting, in my opinion, because it’s so new and so interdisciplinary. Its leaders hail from a wide variety of disciplines, including information science, technical communication, the hard sciences, anthropology, sociology, social psychology, and more. Each brings his/her disciplinary background to their writing, leaving new students to tussle with not only the new CI material but whatever schools of thought the author represents. It adds great diversity and depth to the field but can be a bit intimidating for new students, I think.

How do others approach immersing themselves in a new field? What do you do first, how do you organize the actual literature, how do you build knowledge in a new domain?

 

 
Follow

Get every new post delivered to your Inbox.